Tuesday, July 28, 2015

PATH Marey Train Map

The idea for this visualization has been bouncing around my head for a while now, but I have had a tough time figuring out how to execute it.  The original Marey Train Schedule visualization catalogs trips from Paris to Lyon in 1885.  This visualization may look familiar to you if you're an Edward Tufte fan, as it is on the cover of his book The Visual Display of Quantitative Information.

Similar graphs have been created using other technologies in the past (most notably in D3 by Mike Bostock) but I was unable to find anything similar in Tableau.  This visualization focuses on the PATH train that is primarily used to take commuters between New Jersey and New York City.

Use the drop-downs at the top of the page to select a start and end station.  The Marey plot makes up the top half of the page, with some supporting charts for train times and locations below it.

Since this took a bit to figure out, I thought it might be helpful to talk about how I got from the original data source to the end product.  Below the visualization, I have step-by-step instructions detailing the process that led to the end product.  I have also open sourced all of my code on GitHub which you can access here.  If you have any questions, as always, I am happy to answer them on Twitter - @McGovey.




Here is an image of the original.





The only data source was the GTFS formatted PATH data that was pulled into Alteryx, and visualized using Tableau.  To start you can download the full PATH train schedule (or a different GTFS train schedule of your choice).

I knew the Marey Train plot was not something I was going to be able to build in the native data source format so some level of data cleansing was going to be required.  I decided to pull everything into Alteryx and build my tables there.  Below are the specifics of what I had to do in Alteryx for those of you who can't open the yxmd file and see for yourselves.





First, we want a single table to work with as a base, so start by combining the files for routes, individual train trips, station stops, stop times, and days that the train runs.  To build the x-axis for the chart, we need a numeric field; while the arrival_time field in the stop_times table is really helpful, we want to convert that to a decimal format.  For example, 15:30:00 becomes 15.5 (note: the tick marks on the x-axis only use integers).  That's all that is necessary for a single table file but when you try to pull this into Tableau, you have a tough time creating a dynamic station stop parameter.

The stops at each station on a given route will vary based on the time of day and the route taken so if you try to use the stop_sequence, you will get multiple data points per stop.  To get around this, we have to do some work.  For a while, my visualization relied on a field that was built by grouping the data by the route and getting the average time it takes to reach each stop.  This was working fine with an earlier dataset and it wasn't until my Slalom colleague, Pankil Shah, and I switched to the PATH data that we noticed just how big of an issue this approach would cause.  So instead, what we're going to try to ultimately achieve is a unique list of the possible origins and destinations paired with the possible stops between them.

We do this by creating a cross tab (column-wise format) of all the possible stations for a given train trip, then joining it back to the original table on trip ID.  This gives us data in this format:


trip_id station arrtime station1name station2name
1 station1name 0.2 0.2 0.3
2 station2name 0.3 0.2 0.3
3 station3name 0.6 0.3
4 station4name 0.1 0.5


Once this table has been created, transpose this back into a vertical format and filter out the values where there aren't values for both the origin and destination stations, to get a list of possible stations.  To match all unique destination-origin pairs, a TransitID value is created from the above list.

The unique destination-origin pair list is then joined to the original table on trip ID which will create a giant list of all possible stops for a given trip.  To get to a more manageable data set, filter out any stations that do not occur between the time of departure and the time of arrival.

Finally, to get the y-axis, calculate the average time it takes to get to each stop between the selected origin destination pair. Start by filtering for only trains going in one direction (not return trips, otherwise our averages would be miscalculated) then finding the minimum station time for each trip (or you can think of it as the trip start time).  Next you need to find the time that it takes to reach each station on a trip.  Once you have these two points, subtract the trip start time from the station stop time.  Then average the result of that to get the average time it takes to get to each station.  I know that's a lot so I posted an image of a close up of that process in Alteryx.



Quick side note: an earlier version of this Tableau viz used level of detail calculations. I didn't realize just how powerful these calculations are and how much you can achieve using them.

I have already talked about how I created the sequence order and the arrival time number but I also included trip_id detail for each of the lines, route_id and direction_id for the colors, and added sequence order to the path shelf (otherwise there would be gaps in the lines for stations that were skipped).  The only thing left to do is create the station filters.

To start, I created a station 1 and station 2 parameter.  These will drive the stations that you're viewing.  These parameters control the visualization via a calculated field (used as a filter) that checks if the name of station 1 is equal to the origin name and station 2 is equal to the destination name.  There is also an OR clause included to make sure you can view trains going in both directions.



And that's all there is to it.

I hope you were able to follow along and that you will be able to use some of the stuff in here to create some cool visualizations on your own. If I lost you at any point, I highly recommend taking a look at the files and digging into some of the details to determine what exactly is going on.  I know this may have seemed rather technical but I'm really happy that I was able to get a working version of a historically significant chart.

Thursday, April 16, 2015

Exploring NFL Free Agency

So I know it's been a bit since my last (and first) post but I wanted to check in again with a viz tracking 2015 NFL Free Agency.  My Philadelphia Eagles have had a busy, and at times head scratching, offseason and I thought it might be interesting to see the free agents who have signed, left and resigned with the team.  The other objectives here were to show the quality of the players signed and the free agents that have not signed yet.  I used Pro Football Focus' ratings from 2014 to display the free agents by rank.  If you're interested in reading more about the ratings, you can do so here.

As for interacting with the viz, on the NFL Player Movement page you can click a the bar plot with grading buckets to see all the players that make up that bucket or the table with individual players to see his journey this offseason.

If you're looking for an overview of your team, head over to NFL Teams Offseason page and click on your team name.  This will give you an overview of new signings, players leaving, and players re-signing.  I wanted to have the ability to see exactly who the impact players were who have signed with a team and I feel like this gave me the ability to do that.

I also want to mention that I used the Tableau 9 to create this.  A couple of things to note that I really enjoyed:

  • Loading and interacting with the dashboards are much faster now than they were previous versions.  I barely encountered the wait messages that used to be the norm.
  • The ability to type directly into a shelf is really cool.  It was great to be able to edit formulas directly on a sheet and see the results immediately without having to edit any calculated fields.  It was then great to be able to save that formula only after I've seen the results and decided that is what I really want.


Friday, January 30, 2015

How Weather Affects Flights

So, this truly begins my journey into blogging.  I had previously kicked off blogging on Medium, only to find out that I can't embed Tableau Vizs on the site (great planning by me).

Allow me to introduce myself, my name is Kevin McGovern and I am a Consultant with Slalom in the Data Visualization and Discovery practice.  I have been using QlikView for over two and a half years and recently had my first full introduction to Tableau (thanks @pgilks).  I am a bit of a data nerd and like getting my hands on a fresh data set and seeing what kinds of cool ideas jump out at me.  I'm hoping that Tableau Public will allow me to build some cool stuff that I can share with fellow data enthusiasts.  You can follow me on Twitter as well (@mcgovey).

For my first Tableau Public venture, with winter in full swing I wanted to illustrate how the weather affects flights differently both based on the season and the location of the airport.  Unsurprisingly, the middle of the country seems to have the highest proportion of flights that are delayed or cancelled due to weather and the winter months appear to be the most brutal.  Another surprising result was just how highly American Airlines ranks on this list, given their "major carrier" status.

Select a state on the first page (or use the lasso to select multiple states) to see details for the airports in that date.  You can also select a month in the chart at the bottom of the dashboard to filter the data only for that month.

Hopefully the visualization mostly speaks for itself but if there are questions, I'm happy to answer them on Twitter, in the comments of this post or in a subsequent blog post.  One thing to note, the time period for the data is January 2012 to November 2014 (December 2014 wasn't available when I wrote this).