Friday, 15 June 2018

Spreadsheet Data Manipulation in R

Today I decided to create a new repository on GitHub where I am sharing code to do spreadsheet data manipulation in R.

The first version of the repository and R script is available here: SpreadsheetManipulation_inR

As an example I am using a csv freely available from the IRS, the US Internal Revenue Service.
https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi

This spreadsheet has around 170'000 rows and 131 columns.

Please feel free to request new functions to be added or add functions and code yourself directly on GitHub.


Sunday, 25 March 2018

Data Visualization Website with Shiny

My second Shiny app is dedicated to data visualization.
Here users can simply upload any csv or txt file and create several plots:

  • Histograms (with option for faceting)
  • Barchart (with error bars, and option for color with dodging and faceting)
  • BoxPlots (with option for faceting)
  • Scatterplots (with options for color, size and faceting)
  • TimeSeries


  • Error bars in barcharts are computed with the mean_se function in ggplot2, which computes error bars as mean ± standard error. When the color option is set, barcharts are plotted one next to the other for each color (option dodging).

    For scatterplots, if the option for faceting is provided each plot will include a linear regression lines.


    Some examples are below:


    For the time being there is no option for saving plots, apart from saving the image from the screen. However, I would like to implement an option to have plots in tiff at 300dpi, but all the code I tried so far did not work. I will keep trying.

    The app can be accessed here: https://fveronesi.shinyapps.io/DataViz/

    The R code is available here: https://github.com/fveronesi/Shiny_DataViz

    Sunday, 11 March 2018

    Street Crime UK - Shiny App

    Introduction

    This is a shiny app to visualize heat maps of Street Crimes across Britain from 2010-12 to 2018-01 and test their spatial pattern.
    The code for both ui.R and server.R is available from my GitHub at: https://github.com/fveronesi/StreetCrimeUK_Shiny

    Usage

    Please be aware that this apps downloads data from my personal Dropbox once it starts and every time the user changes some of the settings. This was the only work-around I could think of to use external data in shinyapps.io for free. However, this also makes the app a bit slow, so please be patient.
    Users can select a date with two sliders (I personally do not like the dateInput tool), then a crime type and click Draw Map to update the map with new data. I also included a option to plot the Ripley K-function (function Kest in package spatstat) and the p-value of the quadrat.test (again from spatstat). Both tools work using the data shown within the screen area, so their results change as users interact with the map. The Ripley K function shows a red dashed line with the expected nearest neighbour distribution of points that are randomly distributed in space (i.e. follow a Poisson distribution). The black line is the one computed from the points shown on screen. If the black line is above the red means the observations shown on the map are clustered, while if it is below the red line means the crimes are scattered regularly in space. A more complete overview of the Ripley K function is available at this link from ESRI.
    The p-value from the quadrat test is testing a null hypothesis that the crimes are scattered randomly in space, against an alternative that they are clustered. If the p-value is below 0.05 (significance level of 5%) we can accept the alternative hypothesis that our data are clustered. Please be aware that this test does not account for regularly space crimes.

    NOTE

    Please not that the code here is not reproducible straight away. The app communicates with my Dropbox, though the package rdrop2, which requires a token to download data from Dropbox. More info github.com/karthik/rdrop2.
    I am sharing the code to potentially use a taken downloaded from elsewhere, but the url that points to my Dropbox will clearly not be shared.

    Preparing the dataset

    Csv files with crime data can be downloaded directly from the data.police.uk website. Please check the dates carefully, since each of these files contains more that one years of monthly data. The main issue with these data is that they are divided by local police forces, so for example we will have a csv for each month from the Bedfordshire Police, which only covers that part of the country. Moreover, these csv contain a lot of data, not only coordinates; they also contain the type of crimes, plus other details, which we do not need and which makes the full collection a couple of Gb in size.
    For these reasons I did some pre-processing, First of all I extracted all csv files into a folder named "CrimeUK" and then I ran the code below:
    lista = list.files("E:/CrimesUK",pattern="street",recursive=T,include.dirs=T,full.names=T,ignore.case = T)
    
    for(i in lista){
      DF = read.csv(i)
    
       write.table(data.frame(LAT=DF$Latitude, LON=DF$Longitude, TYPE=DF$Crime.type),
                   file=paste0("E:/CrimesUK/CrimesUK",substr(paste(DF$Month[1]),1,4),"_",substr(paste(DF$Month[1]),6,7),".csv"),
                   sep=",",row.names=F,col.names=F, append=T)
       print(i)
    }
    
    Here I first create a list of all csv files, with full link, searching inside all sub directory. Then I started a for loop to iterate through the files. The loop simply loads each file and than save part of its contents (namely coordinates and crime type) into new csv named after using year and month. This will help me identify which files to download from Dropbox, based on user inputs.
    Once I had these files I simply uploded them to my Dropbox.

    The link to test the app is:

    fveronesi.shinyapps.io/CrimeUK/


    A snapshot of the screen is below: