Divvy is a bike-sharing company with a network of docking stations that spans across Chicago and Evanston. Available 24/7 hours a day and 365 days a year, the bikes can be unlocked from one station and returned to any other station in the system.
People can either enroll in a yearly membership option that gives them unlimited 45-minute rides, a single ride pass, or a day pass that grants them 24 hours of unlimited 3-hour rides. Lastly, there are two bike types: electric and classic.
The data provided by Divvy includes:
You can download Divvy trip history data here. Note: This data is provided according to the Divvy Data License Agreement.
If you just want to see the raw code, click below, otherwise keep reading as I walk through my process of cleaning/manipulating the dataset.
setwd("/Users/macbook/Desktop/divvy")
divvy <- read_csv(list.files(pattern="*.csv"))
divvy <- rename(divvy, membership = member_casual)
divvy <- divvy %>% filter(!is.na(end_lat) | !is.na(end_lng))
divvy <- divvy %>% filter(!is.na(start_lat) | !is.na(start_lng))
divvy <- divvy %>% filter(!is.na(start_station_id) | !is.na(end_station_id))
divvy <- divvy[!with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),]
divvy$month <- format(divvy$started_at, "%m")
divvy$day <- format(divvy$started_at, "%d")
divvy$year <- format(divvy$started_at, "%Y")
divvy$day_of_the_week <- wday(divvy$started_at, label=TRUE, abbr=FALSE, week_start=1)
divvy$trip_duration <- as.numeric(difftime(divvy$ended_at, divvy$started_at), units="secs")
divvy <- divvy %>% filter(!(trip_duration < 60))
casual_three_hours_plus <- divvy %>% filter(membership == "casual" & trip_duration >= 10800)
members_forty_five_plus <- divvy %>% filter(membership == "member" & trip_duration >= 2700)
divvy <- anti_join(divvy, casual_three_hours_plus)
divvy <- anti_join(divvy, members_forty_five_plus)
na_classic <- divvy %>% filter(is.na(start_station_name) | is.na(end_station_name)) %>% filter(rideable_type == "classic_bike")
divvy <- anti_join(divvy, na_classic)
rm("casual_three_hours_plus")
rm("members_forty_five_plus")
rm("na_classic")
write.csv(divvy, "/Users/macbook/Desktop/divvy_clean.csv")
First things first, got to pack everything that’s needed:
library(tidyverse) # data wrangling
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # data wrangling
library(lubridate) # date functions
library(janitor) # data cleaning
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2) # visualization
A year’s worth of data is separated by month into 12 separate CSV files, which I downloaded into a folder on my desktop titled divvy. To import the files you’d usually do something like this:
But with 12 files it’d call for a bunch of typing, copy/pasting, modifying, etc. and all the redundancy would make it all more prone to error as it’d end up like this (not to mention how messy it could get when transforming the data):
Instead, I set the working directory to the folder I created earlier with all the CSVs and used the list.files function to look for all the files in that directory that ended in “.csv” so I could assign them as divvy.
setwd("/Users/macbook/Desktop/divvy")
divvy <- list.files(pattern="*.csv")
It’s worth noting that calling for divvy will only call back the CSV file names, not the actual data. At least not yet. Why? Because doing so would create a data set and it’s not a good idea to just blindly combine data without taking a look at it first.
Luckily, setting up as a variable before merging the data makes it not only easier to call for individual files, but also multiple files.
# single file
colnames(read_csv(divvy[1]))
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
# or a range of files
colnames(read_csv(divvy[1:12]))
## Rows: 5667717 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
The colnames function will call for just the column names. Surprisingly in this case they all share the same name and class type so the 12 CSVs can be merged without making any changes:
divvy <- read_csv(list.files(pattern="*.csv"))
## Rows: 5667717 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The head, str, and summary functions quickly give more insight on the data:
head(divvy) # displays just the first couple of elements (by default only 6 are displayed)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17
## 3 BD0F91DFF741C66D classic_bike 2022-01-25 04:53:40 2022-01-25 04:58:01
## 4 CBB80ED419105406 classic_bike 2022-01-04 00:18:04 2022-01-04 00:33:00
## 5 DDC963BFDDA51EEA classic_bike 2022-01-20 01:31:10 2022-01-20 01:37:12
## 6 A39C6F6CC0586C0B classic_bike 2022-01-11 18:48:09 2022-01-11 18:51:31
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(divvy) # displays the internal structure
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5667717] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(divvy) # statistical summary of data
## ride_id rideable_type started_at
## Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00
## Mode :character Mode :character Median :2022-07-22 15:03:59.00
## Mean :2022-07-20 07:21:18.74
## 3rd Qu.:2022-09-16 07:21:29.00
## Max. :2022-12-31 23:59:26.00
##
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5667717 Length:5667717
## 1st Qu.:2022-05-28 19:43:07.00 Class :character Class :character
## Median :2022-07-22 15:24:44.00 Mode :character Mode :character
## Mean :2022-07-20 07:40:45.33
## 3rd Qu.:2022-09-16 07:39:03.00
## Max. :2023-01-02 04:56:45.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5667717
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5858 NA's :5858
Ride_ID is obviously the unique identifier so it’s important to check if there are any duplicates. Using the duplicated would return TRUE or FALSE. TRUE has a value of 1 while FALSE has a value of 0. Combining the function with sum would return either a 0 if there are no duplicates or another integer if there are.
sum(duplicated(divvy$ride_id))
## [1] 0
Cleaning data is one of the most important steps in data analysis.
Verifying that the entries in member_casual are either casual or member and then rename it to membership so it makes more
distinct(divvy, member_casual)
## # A tibble: 2 × 1
## member_casual
## <chr>
## 1 casual
## 2 member
divvy <- rename(divvy, membership = member_casual)
Same with the bike type, which returns electrical_bike, classic_bike, and docked_bike.
distinct(divvy, rideable_type)
## # A tibble: 3 × 1
## rideable_type
## <chr>
## 1 electric_bike
## 2 classic_bike
## 3 docked_bike
Checking for NA/null/empty data. Only station names and end coordinates have NA values.
# check for NA
summary(is.na(divvy))
## ride_id rideable_type started_at ended_at
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:5667717 FALSE:5667717 FALSE:5667717 FALSE:5667717
##
## start_station_name start_station_id end_station_name end_station_id
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:4834653 FALSE:4834653 FALSE:4774975 FALSE:4774975
## TRUE :833064 TRUE :833064 TRUE :892742 TRUE :892742
## start_lat start_lng end_lat end_lng
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:5667717 FALSE:5667717 FALSE:5661859 FALSE:5661859
## TRUE :5858 TRUE :5858
## membership
## Mode :logical
## FALSE:5667717
##
# remove any ending latitutde or longitude with NA
divvy <- divvy %>% filter(!is.na(start_lat) | !is.na(end_lat))
divvy <- divvy %>% filter(!is.na(start_station_id) | !is.na(start_station_id))
There are also testing stations that need to be excluded.
# find all station IDs that contain the word 'test' in them and then view them just to double-check
divvy[with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),] %>% view()
# remove all the testing stations
divvy <- divvy[!with(divvy, grepl("test", paste(start_station_id, end_station_id), ignore.case=TRUE)),]
From using str earlier we know that start_at and ended_at are POSIXct data types. POSIXct objects provide a standardized way to represent date and time values that are timezone aware and most importantly allow date/time calculations or comparisons.
# create columns with the extracted data
divvy$month <- format(divvy$started_at, "%m")
divvy$day <- format(divvy$started_at, "%d")
divvy$year <- format(divvy$started_at, "%Y")
# label=true displays the day name using characters (as opposed to numeric values)
# abbr=FALSE displays the full day name
# week_start=1 sets the week to start on Monday instead of Sunday
divvy$day_of_the_week <- wday(divvy$started_at, label=TRUE, abbr=FALSE, week_start=1)
# create an abbreviated month name
divvy$month_name <- month.abb[as.numeric(divvy$month)]
# subtract the start time from the end time to calculate the trip duration
divvy$trip_duration <- as.numeric(difftime(divvy$ended_at, divvy$started_at), units="secs")
Using summary on the trip_duration results in the minimum value being in the negative and the max in the 2 million range - yikes, obviously something is wrong. Per Divvy, staff is supposed to remove trips longer than 60 seconds.
The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).
According to the pricing page casual members get a 3-hour limit while members get a 45 minute limit.
summary(divvy$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -621201 359 630 1230 1134 2483235
# remove rows with a trip duration less than 60
divvy <- divvy %>% filter(!(trip_duration < 60))
# create data frames with both in order to look at more closely if needed
casual_three_hours_plus <- divvy %>% filter(membership == "casual" & trip_duration >= 10800)
members_forty_five_plus <- divvy %>% filter(membership == "member" & trip_duration >= 2700)
# remove data frames
divvy <- anti_join(divvy, casual_three_hours_plus)
## Joining with `by = join_by(ride_id, rideable_type, started_at, ended_at,
## start_station_name, start_station_id, end_station_name, end_station_id,
## start_lat, start_lng, end_lat, end_lng, membership, month, day, year,
## day_of_the_week, month_name, trip_duration)`
divvy <- anti_join(divvy, members_forty_five_plus)
## Joining with `by = join_by(ride_id, rideable_type, started_at, ended_at,
## start_station_name, start_station_id, end_station_name, end_station_id,
## start_lat, start_lng, end_lat, end_lng, membership, month, day, year,
## day_of_the_week, month_name, trip_duration)`
And with that the data is clean! All that’s left to do is to export it.
write.csv(divvy, "/Users/macbook/Desktop/divvy_clean.csv")
I then decided to create a visualization in Tableau.