Clear Skies
I have worked over the past year or so to speed up routine reporting tasks with rudimentary code. Python gave me the most success, and I still rely heavily on Excel formulas. Argus prefers the statistics language R.
R easily chews through the massive sheets of data I review as an energy reporter. But coaxing conclusions back out of that maw has proven a much more finicky conversation than Python and certainly harder than the formulas I cobbled together over a decade in Excel. Still, I try to figure out new problems using R first as often as possible when coding for work.
The collapse of the air transportation industry offers one of the clearest illustrations of economic devastation wrought by global efforts to mitigate COVID-19. Major US air carriers slashed their capacity in March and April with dizzying speed. But without access to data about scheduled flights, it was tough to get a sense of the drop.
The US Transportation Safety Administration began posting data about daily passenger screenings beginning in March. These popped up in a simple HTML table updated every morning with the previous day’s total. Nice! This simple year-over-year comparison is very helpful for several reasons.
First, it’s much more timely information than government agencies usually make available. You can usually wait weeks, if not months, for this kind of data to go public. Second, it’s published with context that illustrates how the current data is changing. Sure, it’d be great to see a few years of the same data, instead of just one year. But giving year-over-year data still means we can draw some very apples-to-apples comparisons about the current year by comparing it to the same travel season last year. One additional helpful bit is that TSA is giving the information compared to the same day of the week as 2019, not simply the exact same date. This is helpful because in normal circumstances travelers favor specific days of the week. Simply comparing 12 April, 2020, to the same day of 2019, would fuzzy up the comparison by using different days of the week.
This is useful data. And because it’s in a simple table, it is very easy to grab with a script, a task called scraping.
R has a package called rvest that will help make this easy. I’m going to also call in some tools from a huge and useful library called tidyverse. Tidyverse also has tools to help the computer understand dates in the first column, but at the time I wrote this I was using a package called anytime to process those. Finally, I’ll use ggplot2 to show me what’s happening over time.
library(rvest)
library(anytime)
library(tidyverse)
library(ggplot2)
First, I need to tell my script what to grab. Obviously, it’s going to need the website’s URL. I also need to tell it where on the webpage it should look.
I’m going to do this with the website’s xpath, which tells my script how to navigate the website. It’s the computer equivalent of telling it to look at the table below the masthead. xpath is simple to find. Modern browsers make this really easy when inspecting a page’s source.
OK, so I can now describe through R how to find the table we want. I’m still learning rvest at this point, so for now I’m going to use tools called xml2 and html_table() in the package to read the table. Why? Because I’m learning R pretty much the same way I learned higher-level Excel: Google the question and try the most recent answers. This is not a great way to go about it! R has changed a lot over the years. But it does let you eavesdrop on lots of conversations of people who struggled with solving these questions before you.
# fetch the latest screening numbers
url <- "https://www.tsa.gov/coronavirus/passenger-throughput"
TSA <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="node-page-25164"]/div[1]/div/div/div/table') %>%
html_table()
Alright! I parked the url in a variable called “url”. Then I told rvest to look at the code that makes up the TSA website, find the table, and scrape the table into a variable called “TSA”. This comes out of rvest in a list format, and I want it in a dataframe, so I’ll need to deal with that next.
I’m going to create a dataframe called “passengers.” I’m going to label my columns of data and trim the old headings. I’m going to tell R to treat the text that writes out dates as dates that a computer understands. I’m going to make sure R knows that the rest of the numbers are, actually, numbers. Finally, I’m going to go ahead and create a column that tells me what the 2020 traffic represents as a percentage of last year’s traffic.
I’d use iterative processes called loops to do this in Python. You can do this in R, too — but I can’t, consistently. So this code does it the longer but clearer way.
# process the HTML table into a data table with numbers and a percentage calculation
passengers <- TSA[[1]]
colnames(passengers) <- c("Date", "Current", "Year-ago")
passengers <- passengers[-1,]
passengers$Date <- anydate(passengers$Date)
passengers$Current <- as.numeric(gsub(",", "", passengers$Current))
passengers$`Year-ago` <- as.numeric(gsub(",","", passengers$`Year-ago`))
passengers <- mutate(passengers,
pc = round((passengers$Current / passengers$'Year-ago' * 100), 1)
)
There’s one more thing I want to do, right off the bat. This TSA data kind of appeared from nowhere. It showed up in March, back dated to the first of the month. That tells me it could disappear at any time, too. So I’m going to write a copy of that data onto my computer for later, just in case.
write.csv(passengers,"<not a real path posted to the internet>tsa_screen.csv", row.names = FALSE)
Now for where R shines: instant graphs! let’s take a look at what the data looks like over time. I’m going to tell R to gather up my dataframe into a tall set so I can plot the two years against each other. I’m going to tell it not to get confused about my percentage column or the dates. Then I’m going to do some quality of life bits, like telling the scale to include commas, giving it a title, and making sure I cite my source.
# plot the current and year-ago screenings
passengers %>%
gather(key = "When", value = "passengers", -Date, -pc) %>%
ggplot(., aes(x=Date, y=passengers, col=When)) +
geom_line(size=1.2) +
scale_y_continuous(labels = scales::comma) +
labs(title = "TSA Passengers Screened",
caption = "Source: Transportation Security Administration",
x = element_blank(),
y = element_blank()) +
theme(plot.title = element_text(hjust = 0.5))
Ouch.
I need one last thing, and it’s the bit that I struggle with in R the most. The plot is great for showing the story at a glance. But I’m a writer. I need the numbers R found. So for want of a pretty way to do this, I’m just going to tell it to show me the latest day of data.
# What is the latest number of passengers screened?
passengers[1,]
Date Current Year-ago pc
1 2020-04-11 93645 2059142 4.5
Just 4.5 percent of air travelers screened last year on the Saturday of Easter weekend checked in yesterday. An incredible collapse.
And now that I’ve written the script, my computer will check and record this information in the background while showing me the latest update on screenings. No more visiting the website, no risk of errors of transcription or new math problems.