Learning R is something that is best done over time and through practice. However, much can be learnt from starting with something commonly done: getting data into R!

1 Jump right in!

We will start with importing a very common data file type — a csv file. Data stored in csv files are often (but not always) “well-behaved”. By well-behaved we mean that the data is represented in a simple rectangular format with variables in columns and the data in rows. A very well-behaved file will also have the first line as the names of the variables.

The most common situation is that such files are read directly from a computer hard disk, but R can easily read such files stored on the Internet. Here is a minimal example using the R built-in function read.csv, which assumes there is a header on the first row with variable names and rows of data.

This code reads the file and stores it in a ‘data frame’, which is the most common way R stores data. If you were reading the file from your computer you need to provide the file path e.g. "C:/Users/david/data/ExampleDataLong.csv". Note the use of the forward slash “/”.

Let’s explore the data in RStudio and using the R console

Everything looks fine e.g. the variables are either represented as numbers or integers. The problem though is that R has read the date as a Factor, which can be thought of as a character type.

If we want to work properly with dates and times we need to format the date correctly. Dates and times can be very complex with time zones, daylight savings time etc. However, a dedicated package called lubridate is ideal, which contains lots of functions to help format dates and times. How we do this depends on the original format of the date or date-time. In this example, the data are stored as dd/mm/yyyy HH:MM — which is very common. We can use one of a family of lubridate functions to tell R what the format is:

Look again in RStudio …

In fact, now we have a data frame with a correctly formatted date and the column names date, ws (wind speed) and wd we can use openair to work with the data …

2 Modern approaches to data analysis in R

2.1 The tidyverse

There is no doubt that over the last 5 years or so the approach to using R for data analysis has changed. Base R itself is extremely capable and provides the user with all they reasonably need to carry out simple to sophisticated analysis.

However, a collection of optional R packages under a framework called the tidyverse (see here) has changed the way many of us work with R. Not everyone uses this collection of packages, but if you are learning R and seeking help on learning R, you will come across this approach a lot.

Among the functionalities of the ‘tidyverse’ are:

  • ggplot2 for plotting data; now the most common approach in R
  • dplyr for manipulating, summarising data etc. This package is used extensively.
  • readr and readxl for reading data from text files and Excel files.
  • tidyr for getting your data into the right “shape” ready for easy analysis
  • lubridate, mentioned already — indispensable for working with dates and date-times

Make it easy and just install all the packages at once!

Note that the readr has a version of read.csv called read_csv, which is very useful for ready very large files and can often do a better job of getting the column formats correct. Similarly, the readxl package is very good for dealing with Excel workbooks, importing certain sheets and data ranges.

2.2 A brief example of usage

We will use some of the functions in the tidyverse packages to calculate some annual statistics for NOx and also introduce how to string different steps of analysis together.

## # A tibble: 8 x 2
##    year   nox
##   <dbl> <dbl>
## 1  1998  195.
## 2  1999  204.
## 3  2000  217.
## 4  2001  175.
## 5  2002  157.
## 6  2003  164.
## 7  2004  157.
## 8  2005  144.

Let’s unpick what we just did …

Summarise all numeric variables by year-month …

## # A tibble: 90 x 11
## # Groups:   year [8]
##     year month        ws    wd   nox   no2     o3  pm10   so2    co  pm25
##    <dbl> <ord>     <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1998 January    5.09  171.  168.  42.2   4.35  29.2  5.18  1.95 NaN  
##  2  1998 February   4.26  217.  284.  58.3   2.49  40.2  9.65  2.90 NaN  
##  3  1998 March      4.67  225.  194.  49.7   5.29  32.7  8.98  2.05 NaN  
##  4  1998 April      4.20  206.  177.  47.9   8.81  28.9  6.60  1.78 NaN  
##  5  1998 May        3.46  163.  122.  43.7   9.46  32.5  4.95  1.25  21.0
##  6  1998 June       4.93  219.  197.  47.1   5.51  33.0  6.30  2.05  20  
##  7  1998 July       4.83  234.  191.  46.5 NaN     30.9  6.82  2.10  18.8
##  8  1998 August     3.91  234.  171.  46.9   6.08  31.4  6.31  1.76  20.2
##  9  1998 September  3.16  194.  182.  49.1   5.25  35.7  8.45  2.05  24.3
## 10  1998 October    5.29  213.  205.  45.4   5.22  28.4  5.18  2.28  18.4
## # … with 80 more rows