--- title: "Introduction to {oystr}" author: "Matt Dray" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{oystr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} # Set options knitr::opts_chunk$set( collapse = TRUE, comment = "" ) # Load the package library(oystr) ``` # Summary You can use the {oystr} package to process [Oyster card](https://oyster.tfl.gov.uk/) journey data in CSV files provided by [Transport for London](https://tfl.gov.uk/) (TfL). Provide a file path to `oy_read()` that contains the CSVs, then use `data_clean()` to prepare the data and extract new features ready for further processing and summarisation. ```{r tldr, eval=FALSE} remotes::install_github("matt-dray/oystr") library(oystr) data_raw <- oy_read(path = "data/") # supply your own path data_clean <- oy_clean(x = data_raw) ``` # Background [Transport for London](https://tfl.gov.uk) (TfL) is responsible for the majority of the transport network in the UK's capital. Transport options include overground and underground trains, buses, trams and riverboats. TfL operates the [Oyster card](https://oyster.tfl.gov.uk/) system on its network. You can charge the card with funds and get access to the network by tapping in your card on a special reader. Your card has a unique number and this is recorded each time you 'tap in' and 'tap out' of the network. TfL stores this data for consumption by users for eight weeks. You can view your history on the TfL website or you can be sent a monthly digest by email. In both cases you can collect the data as a CSV file. This is in more of a human-readable form than machine-readable, which can make it difficult to perform further processing from a programmatic perspective. # Oyster data TfL provide the data in a consistent format. Here are the first few rows from an anonymised example where the locations and services have been replaced. ```{r example-raw-no-echo, echo=FALSE} example_raw <- read.csv(file = "../inst/extdata/july.csv") head(example_raw) ``` The resulting data frame has some problems. For example: * the first row of the CSV is blank, so the column headers end up as the first row of data in the data frame * the date is recorded for the journey start only, even thtough the journey end could be the next day * Journey/Action contains multiple types of information, like the start station, end station and bus journeys * the data types for each are not correct (`Date` is not in date-time format, for example) The aim of the {oystr} package is to help address these issues and more. # Read The `oy_read()` function reads and combines multiple CSV files of Oyster data. The function takes one argument: a string that is the path to a folder containing your Oyster data files. The names of the files don't matter, but the structure of each of them must be in the format provided by TfL. For example, the file path in the example below leads to a folder containing four files: two are CSVs containing monthly Oyster data in the format supplied by TfL (`july.csv` and `august.csv`) and the other two are a CSV and a text file that don't contain Oyster data (`not_oyster.csv` and `not_oyster.txt`). ```{r list-files-no-eval, eval=FALSE} # Print the file names in a folder of Oyster data list.files("data/") # supply your own path ``` ```{r list-files-no-echo, echo=FALSE} list.files("../inst/extdata/") ``` So we need to read the two CSVs of Oyster data and ignore the others. The {oystr} package can help you. You can install it from GitHub and then load it. ```{r library-no-eval, eval=FALSE} # Install the {remotes} package to help install {oystr} install.packages("remotes") remotes::install_github("matt-dray/oystr") # Load the package library(oystr) ``` The functions of the {oystr} package are now available to you. Use the `oy_read()` function to read the data sets. To the 'path' argument provide a file path to a folder containing your files. Here I've used the filepath created in teh sectio above. ```{r oy_read-no-eval, eval=FALSE} # Read data from a folder data_raw <- oy_read("data/") ``` ```{r oy_read-no-echo, echo=FALSE} data_raw <- oy_read("../inst/extdata/") ``` This gave a warning for one CSV file that was in the wrong format -- the function identified that it didn't contain Oyster data. The text file was ignored. Let's take a look at the structure of the data. ```{r str} str(data_raw) ``` The output data frame contains rows of data from both CSV files. We can tell because it has data from both months of the input data frames (July and August in this example). ```{r july-august} # Count rows from each month of input data july <- sum(grepl("Jul", data_raw$Date)) august <- sum(grepl("Aug", data_raw$Date)) # Check their sum equals the output from oy_read() july + august == nrow(data_raw) ``` # Clean There's still a number of problems after the data is read in and the blank first row is removed. For example: * dates and times are character types, not datettime * charge and credit are also character types * different types of data (bus trips and train journeys) are in the same `Journey.Action` column You can clean up the formatting and engineer new features with `oy_clean()`. ```{r oy_clean} # Clean the output from oy_read() data_clean <- oy_clean(data_raw) # See the data frame structure str(data_clean) ``` So what did `oy_clean()` do? You can see there are some new and altered columns: * [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) date-times for the start and end of the journey, where applicable (takes into account journeys that start before midnight and end after) * day of the week for separating work-week from weekend * journey duration, a difftime object, in minutes * the mode of travel (train, bus, etc) * the start and end stations separated * the number of the bus route The data are now in a suitable shape for further processing, such as summarisation and visualisation. # Summarise For now, you can summarise the data in two ways: as a list of summary statistcis or as a plot. These functions are underdeveloped while I wait for more information about the possible formats of data from the `Journey/Action` column. ## List You can use the `oy_summary()` function to crate a list where each element is some statistic. By default, the function works for train journeys. ```{r oy_summary-train} oy_summary(data_clean) ``` You can also specify bus journeys. ```{r oy_summary-bus} oy_summary(data_clean, mode = "Bus") ``` See `?oy_summary` for details. ## Plot There is limited suport for plotting too. You can call `oy_lineplot()` for a very simple lineplot of some continuous variables over time. For ecample, here's the balance remaining on the card over time. ```{r oy_lineplot-balance} oy_lineplot(data_clean, y_var = "balance") ``` And here's journey duration with weekends removed. ```{r oy_lineplot-duration} oy_lineplot(data_clean, y_var = "journey_duration", weekdays = FALSE) ```