Data Cleaning with R

Introduction

Divvy is a bike-sharing company with a network of docking stations that spans across Chicago and Evanston. Available 24/7 hours a day and 365 days a year, the bikes can be unlocked from one station and returned to any other station in the system.

People can either enroll in a yearly membership option that gives them unlimited 45-minute rides, a single ride pass, or a day pass that grants them 24 hours of unlimited 3-hour rides. Lastly, there are two bike types: electric and classic.

The data provided by Divvy includes:

  • Trip start day and time
  • Trip end day and time
  • Trip start station
  • Trip end station
  • Rider type (Member, Casual)

You can download Divvy trip history data here. Note: This data is provided according to the Divvy Data License Agreement.

If you just want to see the raw code, click below, otherwise keep reading as I walk through my process of cleaning/manipulating the dataset.

setwd("/Users/macbook/Desktop/divvy") divvy <- read_csv(list.files(pattern="*.csv")) divvy <- rename(divvy, membership = member_casual) divvy <- divvy %>% filter(!is.na(end_lat) | !is.na(end_lng)) divvy <- divvy %>% filter(!is.na(start_lat) | !is.na(start_lng)) divvy <- divvy %>% filter(!is.na(start_station_id) | !is.na(end_station_id)) divvy <- divvy[!with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),] divvy$month <- format(divvy$started_at, "%m") divvy$day <- format(divvy$started_at, "%d") divvy$year <- format(divvy$started_at, "%Y") divvy$day_of_the_week <- wday(divvy$started_at, label=TRUE, abbr=FALSE, week_start=1) divvy$trip_duration <- as.numeric(difftime(divvy$ended_at, divvy$started_at), units="secs") divvy <- divvy %>% filter(!(trip_duration < 60)) casual_three_hours_plus <- divvy %>% filter(membership == "casual" & trip_duration >= 10800) members_forty_five_plus <- divvy %>% filter(membership == "member" & trip_duration >= 2700) divvy <- anti_join(divvy, casual_three_hours_plus) divvy <- anti_join(divvy, members_forty_five_plus) na_classic <- divvy %>% filter(is.na(start_station_name) | is.na(end_station_name)) %>% filter(rideable_type == "classic_bike") divvy <- anti_join(divvy, na_classic) rm("casual_three_hours_plus") rm("members_forty_five_plus") rm("na_classic") write.csv(divvy, "/Users/macbook/Desktop/divvy_clean.csv")

Grab your library card, there are some things we need

First things first, got to pack everything that’s needed:

library(tidyverse) # data wrangling
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
  ## ✔ dplyr     1.1.1     ✔ readr     2.1.4
  ## ✔ forcats   1.0.0     ✔ stringr   1.5.0
  ## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
  ## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
  ## ✔ purrr     1.0.1     
  ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
  ## ✖ dplyr::filter() masks stats::filter()
  ## ✖ dplyr::lag()    masks stats::lag()
  ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

  library(dplyr) # data wrangling
  library(lubridate) # date functions
  library(janitor) # data cleaning
## 
  ## Attaching package: 'janitor'
  ## 
  ## The following objects are masked from 'package:stats':
  ## 
  ##     chisq.test, fisher.test
library(ggplot2) # visualization

One variable to rule them all

A year’s worth of data is separated by month into 12 separate CSV files, which I downloaded into a folder on my desktop titled divvy. To import the files you’d usually do something like this:

But with 12 files it’d call for a bunch of typing, copy/pasting, modifying, etc. and all the redundancy would make it all more prone to error as it’d end up like this (not to mention how messy it could get when transforming the data):

Instead, I set the working directory to the folder I created earlier with all the CSVs and used the list.files function to look for all the files in that directory that ended in “.csv” so I could assign them as divvy.

setwd("/Users/macbook/Desktop/divvy")
  divvy <- list.files(pattern="*.csv")

It’s worth noting that calling for divvy will only call back the CSV file names, not the actual data. At least not yet. Why? Because doing so would create a data set and it’s not a good idea to just blindly combine data without taking a look at it first.

The first date - getting to know the data

Luckily, setting up as a variable before merging the data makes it not only easier to call for individual files, but also multiple files.

# single file
  colnames(read_csv(divvy[1]))
## Rows: 103770 Columns: 13
  ## ── Column specification ────────────────────────────────────────────────────────
  ## Delimiter: ","
  ## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
  ## dbl  (4): start_lat, start_lng, end_lat, end_lng
  ## dttm (2): started_at, ended_at
  ## 
  ## ℹ Use `spec()` to retrieve the full column specification for this data.
  ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##  [1] "ride_id"            "rideable_type"      "started_at"        
  ##  [4] "ended_at"           "start_station_name" "start_station_id"  
  ##  [7] "end_station_name"   "end_station_id"     "start_lat"         
  ## [10] "start_lng"          "end_lat"            "end_lng"           
  ## [13] "member_casual"
# or a range of files
  colnames(read_csv(divvy[1:12]))
## Rows: 5667717 Columns: 13
  ## ── Column specification ────────────────────────────────────────────────────────
  ## Delimiter: ","
  ## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
  ## dbl  (4): start_lat, start_lng, end_lat, end_lng
  ## dttm (2): started_at, ended_at
  ## 
  ## ℹ Use `spec()` to retrieve the full column specification for this data.
  ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##  [1] "ride_id"            "rideable_type"      "started_at"        
  ##  [4] "ended_at"           "start_station_name" "start_station_id"  
  ##  [7] "end_station_name"   "end_station_id"     "start_lat"         
  ## [10] "start_lng"          "end_lat"            "end_lng"           
  ## [13] "member_casual"

The colnames function will call for just the column names. Surprisingly in this case they all share the same name and class type so the 12 CSVs can be merged without making any changes:

divvy <- read_csv(list.files(pattern="*.csv"))
## Rows: 5667717 Columns: 13
  ## ── Column specification ────────────────────────────────────────────────────────
  ## Delimiter: ","
  ## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
  ## dbl  (4): start_lat, start_lng, end_lat, end_lng
  ## dttm (2): started_at, ended_at
  ## 
  ## ℹ Use `spec()` to retrieve the full column specification for this data.
  ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The head, str, and summary functions quickly give more insight on the data:

head(divvy) # displays just the first couple of elements (by default only 6 are displayed)
## # A tibble: 6 × 13
  ##   ride_id          rideable_type started_at          ended_at           
  ##   <chr>            <chr>         <dttm>              <dttm>             
  ## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44
  ## 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17
  ## 3 BD0F91DFF741C66D classic_bike  2022-01-25 04:53:40 2022-01-25 04:58:01
  ## 4 CBB80ED419105406 classic_bike  2022-01-04 00:18:04 2022-01-04 00:33:00
  ## 5 DDC963BFDDA51EEA classic_bike  2022-01-20 01:31:10 2022-01-20 01:37:12
  ## 6 A39C6F6CC0586C0B classic_bike  2022-01-11 18:48:09 2022-01-11 18:51:31
  ## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
  ## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
  ## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(divvy) # displays the internal structure
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
  ##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
  ##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
  ##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
  ##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
  ##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
  ##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
  ##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
  ##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
  ##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
  ##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
  ##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
  ##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
  ##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
  ##  - attr(*, "spec")=
  ##   .. cols(
  ##   ..   ride_id = col_character(),
  ##   ..   rideable_type = col_character(),
  ##   ..   started_at = col_datetime(format = ""),
  ##   ..   ended_at = col_datetime(format = ""),
  ##   ..   start_station_name = col_character(),
  ##   ..   start_station_id = col_character(),
  ##   ..   end_station_name = col_character(),
  ##   ..   end_station_id = col_character(),
  ##   ..   start_lat = col_double(),
  ##   ..   start_lng = col_double(),
  ##   ..   end_lat = col_double(),
  ##   ..   end_lng = col_double(),
  ##   ..   member_casual = col_character()
  ##   .. )
  ##  - attr(*, "problems")=<externalptr>
summary(divvy) # statistical summary of data 
##    ride_id          rideable_type        started_at                    
  ##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
  ##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
  ##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
  ##                                        Mean   :2022-07-20 07:21:18.74  
  ##                                        3rd Qu.:2022-09-16 07:21:29.00  
  ##                                        Max.   :2022-12-31 23:59:26.00  
  ##                                                                        
  ##     ended_at                      start_station_name start_station_id  
  ##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
  ##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
  ##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
  ##  Mean   :2022-07-20 07:40:45.33                                        
  ##  3rd Qu.:2022-09-16 07:39:03.00                                        
  ##  Max.   :2023-01-02 04:56:45.00                                        
  ##                                                                        
  ##  end_station_name   end_station_id       start_lat       start_lng     
  ##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
  ##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
  ##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
  ##                                        Mean   :41.90   Mean   :-87.65  
  ##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
  ##                                        Max.   :45.64   Max.   :-73.80  
  ##                                                                        
  ##     end_lat         end_lng       member_casual     
  ##  Min.   : 0.00   Min.   :-88.14   Length:5667717    
  ##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
  ##  Median :41.90   Median :-87.64   Mode  :character  
  ##  Mean   :41.90   Mean   :-87.65                     
  ##  3rd Qu.:41.93   3rd Qu.:-87.63                     
  ##  Max.   :42.37   Max.   :  0.00                     
  ##  NA's   :5858    NA's   :5858

Ride_ID is obviously the unique identifier so it’s important to check if there are any duplicates. Using the duplicated would return TRUE or FALSE. TRUE has a value of 1 while FALSE has a value of 0. Combining the function with sum would return either a 0 if there are no duplicates or another integer if there are.

sum(duplicated(divvy$ride_id))
## [1] 0

Cleanliness is next to godliness

Cleaning data is one of the most important steps in data analysis.

Verifying that the entries in member_casual are either casual or member and then rename it to membership so it makes more

distinct(divvy, member_casual)
## # A tibble: 2 × 1
  ##   member_casual
  ##   <chr>        
  ## 1 casual       
  ## 2 member
divvy <- rename(divvy, membership = member_casual)

Same with the bike type, which returns electrical_bike, classic_bike, and docked_bike.

distinct(divvy, rideable_type)
## # A tibble: 3 × 1
  ##   rideable_type
  ##   <chr>        
  ## 1 electric_bike
  ## 2 classic_bike 
  ## 3 docked_bike

Checking for NA/null/empty data. Only station names and end coordinates have NA values.

# check for NA
  summary(is.na(divvy))
##   ride_id        rideable_type   started_at       ended_at      
  ##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
  ##  FALSE:5667717   FALSE:5667717   FALSE:5667717   FALSE:5667717  
  ##                                                                 
  ##  start_station_name start_station_id end_station_name end_station_id 
  ##  Mode :logical      Mode :logical    Mode :logical    Mode :logical  
  ##  FALSE:4834653      FALSE:4834653    FALSE:4774975    FALSE:4774975  
  ##  TRUE :833064       TRUE :833064     TRUE :892742     TRUE :892742   
  ##  start_lat       start_lng        end_lat         end_lng       
  ##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
  ##  FALSE:5667717   FALSE:5667717   FALSE:5661859   FALSE:5661859  
  ##                                  TRUE :5858      TRUE :5858     
  ##  membership     
  ##  Mode :logical  
  ##  FALSE:5667717  
  ## 
# remove any ending latitutde or longitude with NA
  divvy <- divvy %>% filter(!is.na(start_lat) | !is.na(end_lat))
  divvy <- divvy %>% filter(!is.na(start_station_id) | !is.na(start_station_id))

There are also testing stations that need to be excluded.

# find all station IDs that contain the word 'test' in them and then view them just to double-check
  divvy[with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),] %>% view()
  
  # remove all the testing stations
  divvy <- divvy[!with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),]

From using str earlier we know that start_at and ended_at are POSIXct data types. POSIXct objects provide a standardized way to represent date and time values that are timezone aware and most importantly allow date/time calculations or comparisons.

# create columns with the extracted data
  divvy$month <- format(divvy$started_at, "%m")
  divvy$day <- format(divvy$started_at, "%d")
  divvy$year <- format(divvy$started_at, "%Y")
  
  # label=true displays the day name using characters (as opposed to numeric values)
  # abbr=FALSE displays the full day name
  # week_start=1 sets the week to start on Monday instead of Sunday
  divvy$day_of_the_week <- wday(divvy$started_at, label=TRUE, abbr=FALSE, week_start=1)
  
  # create an abbreviated month name
  divvy$month_name <- month.abb[as.numeric(divvy$month)]
  
  # subtract the start time from the end time to calculate the trip duration
  divvy$trip_duration <- as.numeric(difftime(divvy$ended_at, divvy$started_at), units="secs")

Using summary on the trip_duration results in the minimum value being in the negative and the max in the 2 million range - yikes, obviously something is wrong. Per Divvy, staff is supposed to remove trips longer than 60 seconds.

The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).

According to the pricing page casual members get a 3-hour limit while members get a 45 minute limit.

summary(divvy$trip_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  ## -621201     359     630    1230    1134 2483235
# remove rows with a trip duration less than 60
  divvy <- divvy %>% filter(!(trip_duration < 60))
  
  # create data frames with both in order to look at more closely if needed
  casual_three_hours_plus <- divvy %>% filter(membership == "casual" & trip_duration >= 10800)
  members_forty_five_plus <- divvy %>% filter(membership == "member" & trip_duration >= 2700)
  
  # remove data frames
  divvy <- anti_join(divvy, casual_three_hours_plus)
## Joining with `by = join_by(ride_id, rideable_type, started_at, ended_at,
  ## start_station_name, start_station_id, end_station_name, end_station_id,
  ## start_lat, start_lng, end_lat, end_lng, membership, month, day, year,
  ## day_of_the_week, month_name, trip_duration)`
divvy <- anti_join(divvy, members_forty_five_plus)
## Joining with `by = join_by(ride_id, rideable_type, started_at, ended_at,
  ## start_station_name, start_station_id, end_station_name, end_station_id,
  ## start_lat, start_lng, end_lat, end_lng, membership, month, day, year,
  ## day_of_the_week, month_name, trip_duration)`

And with that the data is clean! All that’s left to do is to export it.

write.csv(divvy, "/Users/macbook/Desktop/divvy_clean.csv")

I then decided to create a visualization in Tableau.