Taxi Tip Amounts vs Geocoordinates
A graph of taxi tip amounts based on pickup geocoordinates, programmed in R using ggplot2, ggmap, and open street maps.
We were tasked with creating a visual that demonstrates where taxi drivers can expect to find the highest tips in New York City.
A standalone exercise in R, this graph is composed of a scatterplot of pickup locations in geocoordinates. The color of these points is mapped to the amount received in tips, and the scatterplot is then overlaid on top of a map of New York City.
The scatterplot was constructed in ggplot2 and ggmaps was used to plot this on top of the map, which was obtained from Open Street Maps through a package named osmdata.
Code Walkthrough
To recreate the graph, we will begin by importing packages used in the plotting process.
The dataset is too large to host on github so I am unable to provide it. Let’s see exactly how large it is.
[1] 1494926 21
That is a lot of cab rides! There are 1,494,926 cab rides logged in the dataset, each with its own latitude and longitude for pickup location.
There are just too many points to make a readable graph using my available hardware. A minimum tip amount was arbitrarily chosen in order to cut down on the number of observations to be graphed. There is a lot of statistical groundwork that was skipped for this portion, and ultimately this constrains any conclusions to be drawn from the graphic to comparisons between tip amounts greater than the cutoff.
[1] 10509 21
It will be a lot easier to visually represent 10.5 thousand points rather than 1.5 million points.
Digital Pseudo-Cartography (or, Making Data-Rich Backgrounds)
Now that I have the observations to be plotted, I can get the map that will serve as the background for the plot. To make it easier to zone in on the area I wish to include, I set the minimum and maximum latitude and longitude.
Since OSM reads latitude and longitude in matrix format, the desired variables need to be converted into a matrix and the rows and columns need to be properly named.
And finally osmdata is called using the get_map() function. This returns a map of new york city which is stored in the variable nyc_map.
Building a Graph
The final step is to construct the graph. This is done by painstakingly layering elements and adjusting aesthetics until you realize that the deadline is 20 minutes away and you finally convince yourself that the graph looks great and you won’t push your luck (but let’s be real, it could always look better). In other words, all the prepwork is done and it is time to actually do the heavy lifting. Details below.
The caption which details the sources of the data is stored in a variable. This will later be displayed on the graph (lower right corner).
Building the Graph
First I will show you the entire chunk for the graph, then I will explain the purpose of each piece.
Breakdown begin! (everything is fine….)
To begin, the new york city map is called and set as the graph background using ggmap(). I find that ggplot2 and ggmap do not play well together unless ggmap is called first.
Next the scatterplot is created using ggplot2. The data is called from the dataframe df1 and arranged by tip amount. This is done because the points are plotted in the order they are fed into the scatterplot. And because there are so many datapoints being plotted, the largest observations would get buried underneath smaller observations if they were plotted first.
After arranging the plotting order, the x and y axes are set for longitude and latitude.
Finally, the color, alpha, and size aesthetics for observations are mapped to the tip amount for each observation. This was done to differentiate tip amounts from each other. It was not enough to differentiate based on color, so size and alpha also had to be mapped.
A theme had to be set in order to adjust values within the theme. Why does ggplot2 require this, and why can’t it use default values? The world may never know. Moving on:
The bounds for the graph are set to match the map of nyc. This is done because the cab data extends past the bounds of the requested map, and if the bounds were not specified there would be white space surrounding the map.
“But why not grab a bigger map for your background,” you may ask. The answer lies with data scarcity at boundaries: I wanted to focus the graph on pertinent data with regards to the highest tip amounts and biggest groupings of tips. In short, showing all of the data would mean having to zoom the graph out and would detract from the primary purpose of the graph.
The alpha and size gradients are mapped to Tip_amount in an above code block and as a result they both grow in size and become more opaque (or less transparent) as the tip amount grows.
In this code block, the start and end points for the alpha scale are set. scale_size_continuous() is called purely to turn the legend off (which is done by passing guide=”none”).
In this code block, the color gradient for the scatterplot is fine-tuned.
In preliminary drafts of this graph, I divided the dataset into two dataframes: one with tip amounts $10-$149, and the other with tip amounts $150 to $300. Ultimately I decided to go with a three-color gradient (or in ggplot terms, scale_color_gradient2). The low, mid, and high colors were deliberately chosen to show as much contrast as possible from each other, and to be perfectly honest I am not satisfied with the results. I feel that some of the points get washed out around the $200 range and hidden in the graph. I tried to prevent this as much as possible by adjusting color hues and the midpoint, which I found appeared best around $115. The breaks and limits portion of this code displays the given values on the legend for the graph. Note that $10 and $300 are passed into both the breaks and limits arguments.
Lastly my favorite part, the text elements of the graph. The variable capt1 that was set earlier is called and displayed as the caption for the graph.
My goal with this text was to convey maximally relevant information while minimizing word usage. Rather than go with “Scatterplot” as the title, I recognized that the key visual focus of the graph was the map of New York City. The subtitle conveys the subject matter of the scatterplot and axes in as few as 7 words (this could be reduced to 5 by removing “Green Taxi”).