Moshi Moshi, Ruby Desu.

A5 Writeup

Jad, Lynsey, Ruby

Intro to Data Visualization

9 May, 2022

Introduction

We continue to see the societal effects of Covid-19 long after its introduction to the United States in early 2020. A facet of life that has waxed and waned alongside the pandemic has been the use of public transit. In New York City, where public transit is the most popular mode of transportation, the impact of the pandemic can be visualized and understood through an analysis of MTA subway ridership. As Yahoo! finance proudly proclaims, “NYC Subway Ridership Hits Pandemic High, Passes Pre-Omicron Peak”, a statement alluding to the return to “normal” of pre-pandemic times. That being said, what is the new normal? Where and when have stations for the NYC subway bounced back to pre-pandemic levels? What inferences can we draw from using different visualizations of the data that we might not have had otherwise? In short, where has NYC bounced back, and, equally as important, where has it not?

Methodology

Data Source and Data Preparation

All of the data in the interactive visualization comes from the MTA turnstile website.The MTA records cumulative turnstile entries and exits for each station every four hours and groups this dataset in a file for each one week period. We wrote a python script to scrape these files, starting from January 2019 to April 2022, to first combine them into one dataset. We then processed the data to group turnstiles by station, and converted the entries and exits from cumulative numbers to the combined total count over a week period.

Since the data had a large number of sudden increases, we filtered out any data point that was abnormally high (>1,500,000). On further investigation, the MTA cumulative entries and exit counts were sometimes reset to a new, seemingly random number, causing these high and sometimes negative counts. Sometimes the turnstiles seemed to be counting down rather than up, so we added the counts of these turnstiles rather than ignoring them. While that was effective in removing some of the spikiness of the data, we needed to additionally put in checks to see whether a data point was a factor of 1.6 greater than the previous known point, and if so, ignore it. The factor of 1.6 was determined by observing the greatest spikes disappearing while keeping it high enough not to lose any valid data point increases.

Designing and Implementing the Heat Map

Originally, we had intended to show which boroughs bounced back to pre pandemic levels and which didn’t change. We wanted to do this by representing movement of individuals from borough to burghough, station to station. We were going off of hypotheses of places we would think had higher rates of frontline workers compared to more affluent work-from-home rates. We realized this is not possible considering where the start and end points of trains are and how the entry and exit data limits us in a way where we cannot track an individual’s path just the cold numbers of how many people entered at a point. We wanted to map out the changes in subway system utilization over the course of the covid pandemic, so we decided to do it without creating paths. A simple solution for this was to create a heat map of individual subway stations and how a screenshot of ridership presently compares to a screenshot of pre-pandemic levels.

Subway stations and routes were plotted on a map of NYC presented through the Google Maps API. The map was styled to minimize distractions from the data. This meant removing vibrant colors and using more grays for non-significant geographical data. We used two color schemes with our legends. Subway station color was linked to the data mapping out two key variables; the percentage of recovery since the beginning of the pandemic, and the overall business of the station in terms of current total entries and exits recorded. Interactivity was added to the map, allowing users to zoom and highlight specific stations for additional information to the user. Dropdown menus allow the users to focus on the subway stations with the top ten most or least percent recovery, or the top ten most or least current total riders.

Designing and Implementing the Line Graph

In addition, a line graph depicting overall ridership over time was added to give a broader context for the effect of the pandemic on subway ridership, and individual line graphs can be plotted and highlighted for station-to-station comparison over time. If time allows, red vertical lines stemming from the time axis can be added to the line graph to add context to peaks and falls. Such context includes lockdowns and variant surges. Ideally, we’d love to add animation to capture the reader’s attention by adding a path so the line moves as time passes.

The primary design concerns were just getting the data to display properly. Once that was done, we didn’t discriminate and just let the multi-train line graph use myriad colors to distinguish between lines corresponding to those trains. The next focus was to be able to highlight and brush. This is where graying out lines that are not hovered over comes into excellent use. Both line graphs also included tool tips. The single overall line presents the total of entries at that point in the year across all subways in its tool tip. The multi-line graph displayed the subway station name associated with that line of data. We also intended to add a level of interactivity where the users can be guided through the data by comparing stations or proving themselves right or wrong on their assumptions. For example, which lines do you think recovered the most and which areas might be staying home or even fleeing the city for less densely populated and less disease susceptible areas. The way we designed this was with the use of dropdown menus. They allow the users to focus on the subway stations with the top ten most or least percent recovery, or the top ten most or least current total riders. Users can still hover over and use the tool tip alongside the drop down menu.

Tools, Libraries, Platforms and other Implementation Issues to consider

While we initially began a scrolly story, our concern became that our actual story was not made yet. Thus, we had to focus on getting the visualizations working. Google API, Observable notebooks, forking from useful visualizations, experimenting with the data on tableau, using D3.js to recreate tableau observed correlations, cleaning data that causes issues, and tinkering here have been other implementation components worthy of stating here. As for the scrolly story, we would love to implement this with our visualizations, but at the moment, they take too long to load and would make the scrolly story difficult to render at a pleasing rate for the viewer.

Discussion – success? Audience reaction? What new insights or practices has your system enabled, for the intended audience, general audience, or yourself? What would we do differently and what did work?

Our major successes were the insights we gained from analyzing the data. The first insight we gained from our data was what we expected: that NYC subway ridership had dropped significantly, and had not recovered. The insight we weren’t expecting was that ridership wasn’t returning to parts of NYC equally – in fact, train stations in the Bronx such as Longwood station and 138/Grand Concourse have actually increased to 150% of their pre-pandemic traffic (although we must admit they are the rare outliers!). Every station that is struggling to recover to its pre-pandemic levels is located in Manhattan, which are also the busiest subway stations in NYC. However, it is again in the outliers where we find the insights – the station Flushing-Main in Queens’ recovery, while in the top 5% busiest stations in New York City, has recovered a significant amount more than the busy stations in Manhattan. Subway traffic is rebounding at a much faster rate in the outlying boroughs compared to Manhattan.

While we don’t have an obvious explanation for these outliers, what we could do differently is explore various datasets of New York City to try to find a trend to these recovering subway stations. Offering up interesting and original interpretations of the dataset through data relationships could show a novel way to look at this dataset. Some possible datasets to look into include housing market activity, rental unit prices, and income over geographic areas.

One aspect that could be improved is the user experience of our visualizations. Some of our audience had initial confusion about what exactly the map is saying about which stations have recovered and which haven’t. To mitigate confusion, we could make the components build on each other in a step-by-step process, to guide the user through the data and show them the insights we found ourselves.

Conclusions and Future Work

Something we could have done differently as well is to have made a clickable station map that returns two lines in a graph that the user wants to compare. We could have also focused more on how to do a scrolly story after realizing our initial idea wouldn’t be compatible with some of our new observable creations. This means likely not making the scrolly first without knowing what the visualizations will look like and how much space and time they take up when loading. We also need to make it more obvious that we can change the map using the same drop-down menu that we used for the line graphs. Something that if we had more access to private or well-documented open-source data would be see how economy, income, and race play out with what we already have. We attempted to look through various places on the web but could never find such data for all three pandemic years or not narrow enough as to show neighborhood by neighborhood. Overall, we are happy with being able to see and connect the dots from our assumptions to the realities of how public transit has changed before, during, and nearing the end of the pandemic but would love to expand if the data were feasibly attainable.

Acknowledgements and References

Data from: http://web.mta.info/developers/turnstile.html

Observable Forked from: https://observablehq.com/@d3/multi-line-chart

Scrolly-story template

Observable student’s previous MTA projects

User testing commentary from: students and teachers at Columbia University during panels