Google Data Analytics Capstone

The following is a very detailed, step-by-step description of my Google Data Analytics Certification capstone case study. However, you can scroll to the final report/stakeholder presentation that I have included at the end of this description.

Scenario

I am a data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Three questions will guide the future marketing program: 

I have been tasked by the director of marketing to answer Question #1: How do annual members and casual riders use Cyclistic bikes differently?

I  will produce a report/stakeholder presentation with the following deliverables: 

About the Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Data Analysis Phase #1: Ask

Guiding Questions:


Data Analysis Phase #1 Ask Deliverable: A clear statement of the business task

The business challenge is to upsell Cyclistic's casual riders to more profitable Member riders.  So, my business task is to analyze the most relevant historical trip data to identify how Cyclistic's annual members and casual riders use the service differently so that recommendations can be made as to the most effective marketing strategies to convert casual riders to annual members.

Data Analysis Phase #2: Prepare

Guiding Questions:


Data Analysis Phase #2 Prepare Deliverable: A description of all data souces used

The data is located here and has been made available by Motivate International Inc under this license. This is company data so I do not expect issues of bias and/or credibility but will address these as warranted.


A concern that I have is that, due to the fact we are only now recovering from the decreased ridership due to Covid policies in place since 2020Q2, an analysis using "Covid era" data (2020Q2-2021Q1) might not be optimal for decision-making addressing the post-Covid environment. Therefore, I've inquired and my manager agrees that the most useful analysis would use data for the 12 month period prior to Covid policy implementation.


So, I decide to use the 12 months of Cyclistic's pre-Covid trip data April 2019-March 2020 as my analysis dataset. .csv files are available for this time period both monthly and quarterly. I create a folder named Case_study_cyclist and a subfolder named csv to storage the .csv files. Another folder named xlsx is created to save the .csv files as .xlsx files to work with them in Microsoft Excel.


I won't be able to combine all the monthly or quarterly files for this time period into one speadsheet as the amount of data exceeds Excel's limitations, so will do my full dataset analysis using R below.  However, I can use Excel to review individual months to familiarize myself with the data that is available, the general structure of the data, etc. Viewing within Excel, then, I see that each file contains the following columnized data:


Some months'/quarters' datafiles contain additional data columns which I will address below...

Data Analysis Phase #3: Process

Guiding Questions:


Data Analysis Process Phase #3 Prepare Deliverable: Documentation of any cleaning or manipulation of data

As noted above, I will be doing my full-blown data analysis in R, but I want to do some quick cleaning and some simple data transformation with individual months to investigate possible insights.


I open one of the monthly files and create a column "day_of_week" and, using Excel's "WEEKDAY" function and the "started_at" column, will calculate the day of the week that each ride started (1=Sunday, 7=Saturday). I'll also create a new column "ride_length" and, by subtracting the "started_at" column from the "ended_at" column,  will have the length of each ride (HH:MM:SS)...

I then notice that some trips have a ride length of  < 0 seconds and many of the very short (<60 secs) are noted as starting and ending  at station "HQ QR". I've confirmed with my manager that these are artifacts related to maintenance and not actual trips, so I filter and remove these "non-trips" to prepare for analysis.

Data Analysis Phase #4: Analyze

Guiding Questions:


Data Analysis Process Phase #3 Prepare Deliverable: A summary of the analysis

Continuing with Excel, I begin my data analysis by calculate the mean ride length, the maximum ride_length, and the mode of the day of the week that rides are taken...

I then create some pivot tables to quickly calculate and visualize some differences between member riders and casual riders like average ride_length...

So, from the above, I learn that casual riders, on average, tend to ride > 3x longer than member riders. This is consistent across all months. Let's take a look at how each groups' average ride length might vary by day of the week (1 is Sunday, 7 is Saturday)...

Of course, it would be good to also know the total ride breakdown between member and casual riders. I decide to analyze the absolute number of rides by both groups by day of the week. Looks like member ridership outpaces casual ridership every day of the week...

Again, these trends are consistent month-to-month.

So, from the above we see that both the number of casual riders and the length of their trips peaks on the weekend.

Analyzing Data with Tableau

At first glance, the geographic information contained in the data looks very enticing. My initial thought is to use this data to calculate member/casual average ride distance using the beginning and ending lat/lons. After considering this further, though, I realize that any "round trip" ride - a ride that begins and ends at the same docking location - would show a ride distance of "0". So, this particular analysis becomes less compelling. 

However, it occurs to me that there could be some value in understanding any patterns relating to the stations from which casual rides tend to originate throughout the city. So, using the started_at lat/lons, I decide to do a bubble chart within Tableau to see if any patterns emerge...

From the Tableau graphics above, it looks like casual ridership is most prevalent along the Chicago Lakefront and Downtown areas with decreasing activity as we move into the suburbs to the west, north, and south. I will dig a bit deeper into these specific stations below.

So, I have been exploring the monthly data sets up to this time. I now want to aggregate the monthly data so that I can conduct some of these same (and additional) analyses on an annualized basis. The full year's data will be too large to handle in Excel, so I will move to R...

Analyzing Data with R

Setting up the R environment

First, I install and load the following:

> install.packages("tidyverse")

Installing package into ‘C:/Users/shawn/Documents/R/win-library/4.0’

(as ‘lib’ is unspecified)

--- Please select a CRAN mirror for use in this session ---

trying URL 'https://cran.microsoft.com/bin/windows/contrib/4.0/tidyverse_1.3.1.zip'

Content type 'application/zip' length 430040 bytes (419 KB)

downloaded 419 KB


package ‘tidyverse’ successfully unpacked and MD5 sums checked


The downloaded binary packages are in

        C:\Users\shawn\AppData\Local\Temp\RtmpysADOy\downloaded_packages

> library(tidyverse)  #helps wrangle data

-- Attaching packages --------------------------------------- tidyverse 1.3.1 --

v ggplot2 3.3.3     v purrr   0.3.4

v tibble  3.1.1     v dplyr   1.0.5

v tidyr   1.1.3     v stringr 1.4.0

v readr   1.4.0     v forcats 0.5.1

-- Conflicts ------------------------------------------ tidyverse_conflicts() --

x dplyr::filter() masks stats::filter()

x dplyr::lag()    masks stats::lag()

> library(lubridate)  #helps wrangle date attributes


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:


    date, intersect, setdiff, union


> library(dplyr)

> library(readr)

> library(ggplot2)  

I now set my working directory...

> setwd("C:/Users/shawn/Desktop/Divvy_Exercise/csv") 

> getwd() 

[1] "C:/Users/shawn/Desktop/Divvy_Exercise/csv"

Collect and import data into R

The data is available as quarterly csv files. I now upload these 4 quarterly datasets to R...

> q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")


-- Column specification --------------------------------------------------------

cols(

  `01 - Rental Details Rental ID` = col_double(),

  `01 - Rental Details Local Start Time` = col_datetime(format = ""),

  `01 - Rental Details Local End Time` = col_datetime(format = ""),

  `01 - Rental Details Bike ID` = col_double(),

  `01 - Rental Details Duration In Seconds Uncapped` = col_number(),

  `03 - Rental Start Station ID` = col_double(),

  `03 - Rental Start Station Name` = col_character(),

  `02 - Rental End Station ID` = col_double(),

  `02 - Rental End Station Name` = col_character(),

  `User Type` = col_character(),

  `Member Gender` = col_character(),

  `05 - Member Details Member Birthday Year` = col_double()

)

> q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")


-- Column specification --------------------------------------------------------

cols(

  trip_id = col_double(),

  start_time = col_datetime(format = ""),

  end_time = col_datetime(format = ""),

  bikeid = col_double(),

  tripduration = col_number(),

  from_station_id = col_double(),

  from_station_name = col_character(),

  to_station_id = col_double(),

  to_station_name = col_character(),

  usertype = col_character(),

  gender = col_character(),

  birthyear = col_double()

)


> q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")


-- Column specification --------------------------------------------------------

cols(

  trip_id = col_double(),

  start_time = col_datetime(format = ""),

  end_time = col_datetime(format = ""),

  bikeid = col_double(),

  tripduration = col_number(),

  from_station_id = col_double(),

  from_station_name = col_character(),

  to_station_id = col_double(),

  to_station_name = col_character(),

  usertype = col_character(),

  gender = col_character(),

  birthyear = col_double()

)


> q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")


-- Column specification --------------------------------------------------------

cols(

  ride_id = col_character(),

  rideable_type = col_character(),

  started_at = col_datetime(format = ""),

  ended_at = col_datetime(format = ""),

  start_station_name = col_character(),

  start_station_id = col_double(),

  end_station_name = col_character(),

  end_station_id = col_double(),

  start_lat = col_double(),

  start_lng = col_double(),

  end_lat = col_double(),

  end_lng = col_double(),

  member_casual = col_character()

)

Wrangle data by making columns consistent and merging into a single dataframe

As the output from the above read_csv commands suggests, the q2_2019 data files has different column names than the other files...

> colnames(q3_2019)

 [1] "trip_id"           "start_time"        "end_time"         

 [4] "bikeid"            "tripduration"      "from_station_id"  

 [7] "from_station_name" "to_station_id"     "to_station_name"  

[10] "usertype"          "gender"            "birthyear"        

> colnames(q4_2019)

 [1] "trip_id"           "start_time"        "end_time"         

 [4] "bikeid"            "tripduration"      "from_station_id"  

 [7] "from_station_name" "to_station_id"     "to_station_name"  

[10] "usertype"          "gender"            "birthyear"        

> colnames(q2_2019)

 [1] "01 - Rental Details Rental ID"                   

 [2] "01 - Rental Details Local Start Time"            

 [3] "01 - Rental Details Local End Time"              

 [4] "01 - Rental Details Bike ID"                     

 [5] "01 - Rental Details Duration In Seconds Uncapped"

 [6] "03 - Rental Start Station ID"                    

 [7] "03 - Rental Start Station Name"                  

 [8] "02 - Rental End Station ID"                      

 [9] "02 - Rental End Station Name"                    

[10] "User Type"                                       

[11] "Member Gender"                                   

[12] "05 - Member Details Member Birthday Year"        

> colnames(q1_2020)

 [1] "ride_id"            "rideable_type"     

 [3] "started_at"         "ended_at"          

 [5] "start_station_name" "start_station_id"  

 [7] "end_station_name"   "end_station_id"    

 [9] "start_lat"          "start_lng"         

[11] "end_lat"            "end_lng"           

[13] "member_casual"

So, I need to standardize all column names before joining the files. I will rename columns according to the q1_2020 file as this will be the naming convention used by Cyclistic moving forward...

> (q4_2019 <- rename(q4_2019

+                    ,ride_id = trip_id

+                    ,rideable_type = bikeid 

+                    ,started_at = start_time  

+                    ,ended_at = end_time  

+                    ,start_station_name = from_station_name 

+                    ,start_station_id = from_station_id 

+                    ,end_station_name = to_station_name 

+                    ,end_station_id = to_station_id 

+                    ,member_casual = usertype))

# A tibble: 704,054 x 12

    ride_id started_at          ended_at            rideable_type

      <dbl> <dttm>              <dttm>                      <dbl>

 1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20          2215

 2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34          6328

 3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43          3003

 4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43          3275

 5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42          5294

 6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51          1891

 7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45          1061

 8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16          1274

 9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18          6011

10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46          2957

# ... with 704,044 more rows, and 8 more variables:

#   tripduration <dbl>, start_station_id <dbl>,

#   start_station_name <chr>, end_station_id <dbl>,

#   end_station_name <chr>, member_casual <chr>, gender <chr>,

#   birthyear <dbl>

> (q3_2019 <- rename(q3_2019

+                    ,ride_id = trip_id

+                    ,rideable_type = bikeid 

+                    ,started_at = start_time  

+                    ,ended_at = end_time  

+                    ,start_station_name = from_station_name 

+                    ,start_station_id = from_station_id 

+                    ,end_station_name = to_station_name 

+                    ,end_station_id = to_station_id 

+                    ,member_casual = usertype))

# A tibble: 1,640,718 x 12

    ride_id started_at          ended_at            rideable_type

      <dbl> <dttm>              <dttm>                      <dbl>

 1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41          3591

 2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44          5353

 3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42          6180

 4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10          5540

 5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26          6014

 6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31          4941

 7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12          3770

 8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16          5442

 9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57          2957

10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14          6091

# ... with 1,640,708 more rows, and 8 more variables:

#   tripduration <dbl>, start_station_id <dbl>,

#   start_station_name <chr>, end_station_id <dbl>,

#   end_station_name <chr>, member_casual <chr>, gender <chr>,

#   birthyear <dbl>

> (q2_2019 <- rename(q2_2019

+                    ,ride_id = "01 - Rental Details Rental ID"

+                    ,rideable_type = "01 - Rental Details Bike ID" 

+                    ,started_at = "01 - Rental Details Local Start Time"  

+                    ,ended_at = "01 - Rental Details Local End Time"  

+                    ,start_station_name = "03 - Rental Start Station Name" 

+                    ,start_station_id = "03 - Rental Start Station ID"

+                    ,end_station_name = "02 - Rental End Station Name" 

+                    ,end_station_id = "02 - Rental End Station ID"

+                    ,member_casual = "User Type"))

# A tibble: 1,108,163 x 12

    ride_id started_at          ended_at            rideable_type

      <dbl> <dttm>              <dttm>                      <dbl>

 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251

 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226

 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649

 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151

 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270

 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123

 7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418

 8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513

 9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280

10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534

# ... with 1,108,153 more rows, and 8 more variables:

#   01 - Rental Details Duration In Seconds Uncapped <dbl>,

#   start_station_id <dbl>, start_station_name <chr>,

#   end_station_id <dbl>, end_station_name <chr>,

#   member_casual <chr>, Member Gender <chr>,

#   05 - Member Details Member Birthday Year <dbl>

Let's inspect the results of my column name changes using the str() command...

> str(q1_2020)

spec_tbl_df[,13] [426,887 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

 $ ride_id           : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...

 $ rideable_type     : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...

 $ started_at        : POSIXct[1:426887], format: "2020-01-21 20:06:59" ...

 $ ended_at          : POSIXct[1:426887], format: "2020-01-21 20:14:30" ...

 $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...

 $ start_station_id  : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...

 $ end_station_name  : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...

 $ end_station_id    : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...

 $ start_lat         : num [1:426887] 42 42 41.9 41.9 41.9 ...

 $ start_lng         : num [1:426887] -87.7 -87.7 -87.6 -87.6 -87.6 ...

 $ end_lat           : num [1:426887] 42 42 41.9 41.9 41.9 ...

 $ end_lng           : num [1:426887] -87.7 -87.7 -87.7 -87.6 -87.6 ...

 $ member_casual     : chr [1:426887] "member" "member" "member" "member" ...

 - attr(*, "spec")=

  .. cols(

  ..   ride_id = col_character(),

  ..   rideable_type = col_character(),

  ..   started_at = col_datetime(format = ""),

  ..   ended_at = col_datetime(format = ""),

  ..   start_station_name = col_character(),

  ..   start_station_id = col_double(),

  ..   end_station_name = col_character(),

  ..   end_station_id = col_double(),

  ..   start_lat = col_double(),

  ..   start_lng = col_double(),

  ..   end_lat = col_double(),

  ..   end_lng = col_double(),

  ..   member_casual = col_character()

  .. )

> str(q4_2019)

spec_tbl_df[,12] [704,054 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

 $ ride_id           : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...

 $ started_at        : POSIXct[1:704054], format: "2019-10-01 00:01:39" ...

 $ ended_at          : POSIXct[1:704054], format: "2019-10-01 00:17:20" ...

 $ rideable_type     : num [1:704054] 2215 6328 3003 3275 5294 ...

 $ tripduration      : num [1:704054] 940 258 850 2350 1867 ...

 $ start_station_id  : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...

 $ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...

 $ end_station_id    : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...

 $ end_station_name  : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...

 $ member_casual     : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...

 $ gender            : chr [1:704054] "Male" "Male" "Female" "Male" ...

 $ birthyear         : num [1:704054] 1987 1998 1991 1990 1987 ...

 - attr(*, "spec")=

  .. cols(

  ..   trip_id = col_double(),

  ..   start_time = col_datetime(format = ""),

  ..   end_time = col_datetime(format = ""),

  ..   bikeid = col_double(),

  ..   tripduration = col_number(),

  ..   from_station_id = col_double(),

  ..   from_station_name = col_character(),

  ..   to_station_id = col_double(),

  ..   to_station_name = col_character(),

  ..   usertype = col_character(),

  ..   gender = col_character(),

  ..   birthyear = col_double()

  .. )

> str(q3_2019)

spec_tbl_df[,12] [1,640,718 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

 $ ride_id           : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...

 $ started_at        : POSIXct[1:1640718], format: "2019-07-01 00:00:27" ...

 $ ended_at          : POSIXct[1:1640718], format: "2019-07-01 00:20:41" ...

 $ rideable_type     : num [1:1640718] 3591 5353 6180 5540 6014 ...

 $ tripduration      : num [1:1640718] 1214 1048 1554 1503 1213 ...

 $ start_station_id  : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...

 $ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...

 $ end_station_id    : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...

 $ end_station_name  : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...

 $ member_casual     : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...

 $ gender            : chr [1:1640718] "Male" NA NA NA ...

 $ birthyear         : num [1:1640718] 1992 NA NA NA NA ...

 - attr(*, "spec")=

  .. cols(

  ..   trip_id = col_double(),

  ..   start_time = col_datetime(format = ""),

  ..   end_time = col_datetime(format = ""),

  ..   bikeid = col_double(),

  ..   tripduration = col_number(),

  ..   from_station_id = col_double(),

  ..   from_station_name = col_character(),

  ..   to_station_id = col_double(),

  ..   to_station_name = col_character(),

  ..   usertype = col_character(),

  ..   gender = col_character(),

  ..   birthyear = col_double()

  .. )

> str(q2_2019)

spec_tbl_df[,12] [1,108,163 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

 $ ride_id                                         : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...

 $ started_at                                      : POSIXct[1:1108163], format: "2019-04-01 00:02:22" ...

 $ ended_at                                        : POSIXct[1:1108163], format: "2019-04-01 00:09:48" ...

 $ rideable_type                                   : num [1:1108163] 6251 6226 5649 4151 3270 ...

 $ 01 - Rental Details Duration In Seconds Uncapped: num [1:1108163] 446 1048 252 357 1007 ...

 $ start_station_id                                : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...

 $ start_station_name                              : chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...

 $ end_station_id                                  : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...

 $ end_station_name                                : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...

 $ member_casual                                   : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...

 $ Member Gender                                   : chr [1:1108163] "Male" "Female" "Male" "Male" ...

 $ 05 - Member Details Member Birthday Year        : num [1:1108163] 1975 1984 1990 1993 1992 ...

 - attr(*, "spec")=

  .. cols(

  ..   `01 - Rental Details Rental ID` = col_double(),

  ..   `01 - Rental Details Local Start Time` = col_datetime(format = ""),

  ..   `01 - Rental Details Local End Time` = col_datetime(format = ""),

  ..   `01 - Rental Details Bike ID` = col_double(),

  ..   `01 - Rental Details Duration In Seconds Uncapped` = col_number(),

  ..   `03 - Rental Start Station ID` = col_double(),

  ..   `03 - Rental Start Station Name` = col_character(),

  ..   `02 - Rental End Station ID` = col_double(),

  ..   `02 - Rental End Station Name` = col_character(),

  ..   `User Type` = col_character(),

  ..   `Member Gender` = col_character(),

  ..   `05 - Member Details Member Birthday Year` = col_double()

  .. )

Next, I will use the mutate() to convert ride_id and rideable_type to character so that the dataframes can stack correctly...

> q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)

+                    ,rideable_type = as.character(rideable_type)) 

> q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)

+                    ,rideable_type = as.character(rideable_type)) 

> q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)

+                    ,rideable_type = as.character(rideable_type))

Now, I use bind_rows() to stack each quarter's dataframe into one large dataframe...

> all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)

I have already used the lat/lon data in the creation of my Tableau bubbe chart above, so I will remove start_lat, start_lng, end_lat, end_lng, as well as birthyear, gender as this data is sporadic and is no longer collected. Also, 2020Q1 data does not include a "tripduration" column. so we will remove this column and recalculate for all trips below...

> all_trips <- all_trips %>%  

+   select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Clean and add data to prepare for analysis

Let's inspect our new table...

> colnames(all_trips)  #List of column names

[1] "ride_id"            "started_at"        

[3] "ended_at"           "rideable_type"     

[5] "start_station_id"   "start_station_name"

[7] "end_station_id"     "end_station_name"  

[9] "member_casual"   

  

> nrow(all_trips)  #How many rows are in data frame?

[1] 3879822


> dim(all_trips)  #Dimensions of the data frame?

[1] 3879822       9

> head(all_trips)  #See the first 6 rows of data frame.  Also tail(qs_raw)

# A tibble: 6 x 9

  ride_id  started_at          ended_at            rideable_type

  <chr>    <dttm>              <dttm>              <chr>        

1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251         

2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226         

3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649         

4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151         

5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270         

6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123         

# ... with 5 more variables: start_station_id <dbl>,

#   start_station_name <chr>, end_station_id <dbl>,

#   end_station_name <chr>, member_casual <chr>

> str(all_trips)  #See list of columns and data types (numeric, character, etc)

tibble[,9] [3,879,822 x 9] (S3: tbl_df/tbl/data.frame)

 $ ride_id           : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...

 $ started_at        : POSIXct[1:3879822], format: "2019-04-01 00:02:22" ...

 $ ended_at          : POSIXct[1:3879822], format: "2019-04-01 00:09:48" ...

 $ rideable_type     : chr [1:3879822] "6251" "6226" "5649" "4151" ...

 $ start_station_id  : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...

 $ start_station_name: chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...

 $ end_station_id    : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...

 $ end_station_name  : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...

 $ member_casual     : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...

> summary(all_trips)  #Statistical summary of data. Mainly for numerics

   ride_id            started_at                 

 Length:3879822     Min.   :2019-04-01 00:02:22  

 Class :character   1st Qu.:2019-06-23 07:49:09  

 Mode  :character   Median :2019-08-14 17:43:38  

                    Mean   :2019-08-26 00:49:59  

                    3rd Qu.:2019-10-12 12:10:21  

                    Max.   :2020-03-31 23:51:34  

                                                 

    ended_at                   rideable_type     

 Min.   :2019-04-01 00:09:48   Length:3879822    

 1st Qu.:2019-06-23 08:20:27   Class :character  

 Median :2019-08-14 18:02:04   Mode  :character  

 Mean   :2019-08-26 01:14:37                     

 3rd Qu.:2019-10-12 12:36:16                     

 Max.   :2020-05-19 20:10:34                     

                                                 

 start_station_id start_station_name end_station_id 

 Min.   :  1.0    Length:3879822     Min.   :  1.0  

 1st Qu.: 77.0    Class :character   1st Qu.: 77.0  

 Median :174.0    Mode  :character   Median :174.0  

 Mean   :202.9                       Mean   :203.8  

 3rd Qu.:291.0                       3rd Qu.:291.0  

 Max.   :675.0                       Max.   :675.0  

                                     NA's   :1      

 end_station_name   member_casual     

 Length:3879822     Length:3879822    

 Class :character   Class :character  

 Mode  :character   Mode  :character  


There are a few problems I will need to fix:

(1) In the "member_casual" column, there are two names for members ("member" and "Subscriber") and two names for casual riders ("Customer" and "casual"). I  will need to consolidate that from four to two labels.

(2) Currently, the data can only be aggregated at the ride-level, which is too granular for all the analyses I want to conduct. I want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.

(3) I  want to add a calculated field for length of ride since the 2020Q1 data did not have the "tripduration" column. I will add "ride_length" to the entire dataframe for consistency.

(4) There are some rides where tripduration shows up as negative, including several hundred rides where Cyclistic took bikes out of circulation for Quality Control reasons. I want to delete these rides.

I  will use mutate() again and, in the "member_casual" column, replace "Subscriber" with "member" and "Customer" with "casual"...

> all_trips <-  all_trips %>% 

+   mutate(member_casual = recode(member_casual

+                            ,"Subscriber" = "member"

+                            ,"Customer" = "casual"))

Let's check the results...

> table(all_trips$member_casual)

 casual  member 

 905954 2973868 

I will now add columns that list the date, month, day, and year of each ride. This will allow me to aggregate ride data for each month, day, or year. Before completing these operations I could only aggregate at the ride level...

> all_trips$date <- as.Date(all_trips$started_at)

> all_trips$month <- format(as.Date(all_trips$date), "%m")

> all_trips$day <- format(as.Date(all_trips$date), "%d")

> all_trips$year <- format(as.Date(all_trips$date), "%Y")

> all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

I now add a "ride_length" calculation to all_trips (in seconds) and convert from factor to numeric so we can run calculations on the data...

> all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

> is.factor(all_trips$ride_length)

[1] FALSE

> all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

> is.numeric(all_trips$ride_length)

[1] TRUE

Now I will get rid of all trips with start_station noted as "HQ QR" (maintenance) and all trips noted as  < 0 seconds duration...

> all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Descriptive analysis

Let's now do a descriptive analysis on ride length (in secs) for all rides (both casual and member)...

> mean(all_trips_v2$ride_length) #straight average (total ride length / rides)

[1] 1479.139

> median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths

[1] 712

> max(all_trips_v2$ride_length) #longest ride

[1] 9387024

> min(all_trips_v2$ride_length) #shortest ride

[1] 1

I will now do this same analysis for casual and member users...

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)

  all_trips_v2$member_casual all_trips_v2$ride_length

1                     casual                3552.7502

2                     member                 850.0662

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

  all_trips_v2$member_casual all_trips_v2$ride_length

1                     casual                     1546

2                     member                      589

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

  all_trips_v2$member_casual all_trips_v2$ride_length

1                     casual                  9387024

2                     member                  9056634

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

  all_trips_v2$member_casual all_trips_v2$ride_length

1                     casual                        2

2                     member                        1

So, looking at the above I see that, over the course of the year, the mean ride length for casual riders is 4x's that of member riders whereas the median is approximately 3x more.  Let do a day-of-week breakdown...

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

   all_trips_v2$member_casual all_trips_v2$day_of_week

1                      casual                   Friday

2                      member                   Friday

3                      casual                   Monday

4                      member                   Monday

5                      casual                 Saturday

6                      member                 Saturday

7                      casual                   Sunday

8                      member                   Sunday

9                      casual                 Thursday

10                     member                 Thursday

11                     casual                  Tuesday

12                     member                  Tuesday

13                     casual                Wednesday

14                     member                Wednesday

   all_trips_v2$ride_length

1                 3773.8351

2                  824.5305

3                 3372.2869

4                  842.5726

5                 3331.9138

6                  968.9337

7                 3581.4054

8                  919.9746

9                 3682.9847

10                 823.9278

11                3596.3599

12                 826.1427

13                3718.6619

14                 823.9996

These are out of order, so I will fix...

> all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

> aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

   all_trips_v2$member_casual all_trips_v2$day_of_week

1                      casual                   Sunday

2                      member                   Sunday

3                      casual                   Monday

4                      member                   Monday

5                      casual                  Tuesday

6                      member                  Tuesday

7                      casual                Wednesday

8                      member                Wednesday

9                      casual                 Thursday

10                     member                 Thursday

11                     casual                   Friday

12                     member                   Friday

13                     casual                 Saturday

14                     member                 Saturday

   all_trips_v2$ride_length

1                 3581.4054

2                  919.9746

3                 3372.2869

4                  842.5726

5                 3596.3599

6                  826.1427

7                 3718.6619

8                  823.9996

9                 3682.9847

10                 823.9278

11                3773.8351

12                 824.5305

13                3331.9138

14                 968.9337

I will now calculate, for members and casuals, both the number of rides and the average length of the rides by day of week...

> all_trips_v2 %>% 

+   mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()

+   group_by(member_casual, weekday) %>%  #groups by usertype and weekday

+   summarise(number_of_rides = n()#calculates the number of rides and average duration 

+   ,average_duration = mean(ride_length)) %>% # calculates the average duration

+   arrange(member_casual, weekday)


# A tibble: 14 x 4

# Groups:   member_casual [2]

   member_casual weekday number_of_rides average_duration

   <chr>         <ord>             <int>            <dbl>

 1 casual        Sun              181293            3581.

 2 casual        Mon              103296            3372.

 3 casual        Tue               90510            3596.

 4 casual        Wed               92457            3719.

 5 casual        Thu              102679            3683.

 6 casual        Fri              122404            3774.

 7 casual        Sat              209543            3332.

 8 member        Sun              267965             920.

 9 member        Mon              472196             843.

10 member        Tue              508445             826.

11 member        Wed              500329             824.

12 member        Thu              484177             824.

13 member        Fri              452790             825.

14 member        Sat              287958             969.

Recall my Tableau graphic showing the location of those bike stations with the greatest number of casual rides? Let's list the names of those top 20 stations with the greatest number of casual riders...

> all_trip_data_v3 %>%

+     group_by(start_station_name, member_casual) %>%

+     summarise(number_of_ride = n(), .groups = 'drop') %>%

+     filter(start_station_name != "", member_casual != 'member') %>%

+     arrange(-number_of_ride) %>%

+     head(n=20) %>%

+     select(-member_casual)

# A tibble: 20 x 2

   start_station_name           number_of_ride

   <chr>                                 <int>

 1 Streeter Dr & Grand Ave               53415

 2 Lake Shore Dr & Monroe St             39686

 3 Millennium Park                       21901

 4 Michigan Ave & Oak St                 21633

 5 Shedd Aquarium                        20781

 6 Lake Shore Dr & North Blvd            19150

 7 Theater on the Lake                   15363

 8 Dusable Harbor                        12694

 9 Michigan Ave & Washington St          12379

10 Adler Planetarium                     12031

11 Michigan Ave & 8th St                  9710

12 Montrose Harbor                        8342

13 Indiana Ave & Roosevelt Rd             8224

14 Columbus Dr & Randolph St              7909

15 Field Museum                           7647

16 McClurg Ct & Illinois St               7075

17 Michigan Ave & Jackson Blvd            7033

18 Clark St & Lincoln Ave                 6809

19 Clark St & Armitage Ave                6706

20 Lake Shore Dr & Ohio St                6538

 Export summary file for further analysis

Now I will export a summary file in case I  want to import this data into another tool at some later date...

counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

head(counts)

write.csv(counts, file = 'C:/Users/shawn/Desktop/Divvy_Exercise/avg_ride_length.csv')

Data Analysis Phase #5: Share

Guiding Questions:


Data Analysis Process Phase #3 Prepare Deliverable: Supporting visualizations and key findings

I will now create additional visualizations highlighting casual ridership characteristics and will begin preparing my presentation to stakeholders.

Let's visualize the number of rides by rider type...

  mutate(weekday = wday(started_at, label = TRUE)) %>% 

  group_by(member_casual, weekday) %>% 

  summarise(number_of_rides = n()

            ,average_duration = mean(ride_length)) %>% 

  arrange(member_casual, weekday)  %>% 

  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +

  geom_col(position = "dodge")

And we will create a visualization for average ride duration for both groups...

all_trips_v2 %>% 

  mutate(weekday = wday(started_at, label = TRUE)) %>% 

  group_by(member_casual, weekday) %>% 

  summarise(number_of_rides = n()

            ,average_duration = mean(ride_length)/60) %>% 

  arrange(member_casual, weekday)  %>% 

  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +

  geom_col(position = "dodge")

Let's look at the time (24hr local time) of day that casual riders ride...

all_trips_v2 %>%

    mutate(hour_of_day = hour(round_date(started_at, unit ="hour"))) %>%

    filter(member_casual == 'casual') %>%

    group_by(hour_of_day, member_casual) %>%

    summarise(number_of_ride = n(), .groups = 'drop') %>%

    arrange(-number_of_ride) %>%

    ggplot(aes(x = hour_of_day, y = number_of_ride, fill = member_casual)) +

    geom_bar(position = 'dodge', stat = 'identity') +

    scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))

 Finally, let's create two charts - one for casual riders and one for members - showing the daily number of rides for the whole year

all_trips_v2 %>%

  group_by(date, member_casual) %>%

  summarise(number_of_ride = n(), .groups = 'drop') %>%

##    filter(member_casual == 'casual') %>%

  ggplot(aes(x = date, y = number_of_ride, fill = member_casual)) +

  geom_area(position = 'dodge', stat = 'identity') + 

  theme(axis.text.x = element_text(angle = 45)) +

  facet_wrap(~member_casual)

Data Analysis Phase #6: Act

Guiding Questions:


Data Analysis Process Phase #6 Act Deliverable: Top  recommendations based on analysis

Completed Report/Presentation for Stakeholders

Cyclistic Ridership Charactertistics

shawn.rampy@gmail.com   |   281-557-6145