Computing Rolling Averages With Pandas

Published on

There’s been a lot of great data reporting amid the COVID-19 crisis (and – let’s be honest – a whole lot of awful abuse of numbers, too). My hometown paper The San Francisco Chronicle has pretty reliably good data visuals, and I think they’ve done a nice job with their Coronavirus Tracker.

The other day, I took note of how the paper integrated a rolling seven-day average into a set of charts about new cases and deaths. Here’s a screen-grab of one of the charts, in anticipation of that hopeful day when the link above will be broken and all of this horror is behind us:

Figure 1: New Reported Cases in the Bay Area (Source: SF Chronicle)

Figure 1: New Reported Cases in the Bay Area (Source: SF Chronicle)

I’ve also recently gotten involved in an open source project to collect and publish COVID-19 infection data from various official sources around the San Francisco Bay Area, and wrote a script to pull in hospitalization stats from an open API made available by the the California Department of Public Health. As I looked at the charts from the SF Chron, I started wondering about how I could easily compute a rolling seven-day average for this time series data. The feature is especially helpful visually when the data is “spiky,” i.e. has a fair amount of fluctuation with outliers. That’s not necessarily the case with this data set, actually, but I was still curious.

Lo and behold, the fantastic pandas library delivers again with a method for DataFrame and Series objects called (of course) rolling(). Check out the full documentation, but the quick summary is that you can take a DataFrame with a DatetimeIndex and easily get a seven-day average like this:

df.rolling("7D").mean()

You can see how I used it in action at this GitHub repo, which contains a Jupyter Notebook and the script I used to pull down the data. Here’s a chart I created showing the reported number of COVID-19 confirmed patients in ICU units at hospitals in San Francisco county. The blue bars are the raw numbers, the orange line is the rolling average computed by rolling().

A chart showing ICU patients in San Francisco County

Figure 2: Confirmed ICU COVID Patients - San Francisco County