Continuously Publish Fresh Data with Datasette, Heroku, and GitHub Actions

Published on

Web scraping is a useful technique for gathering data in an automated fashion. But how can you publish that data, and then keep the published version up to date with the freshest scrape automatically? Enter Datasette, Heroku, and GitHub Actions.

One of the projects that I contributed to over the past year was the Bay Area Pandemic Dashboard, a project initiated by a group within Code for San Francisco to gather data published by local municipalities within the Bay Area and visualize what was happening with the pandemic. As an extension to that, I created the BAPD Open Database, which took the data that the project scraped and transformed it from JSON into a sqlite database that could be explored on the web using Datasette.

Datasette provides a nice, user-friendly interface for digging into large amounts of data, allowing simple form-based querying as well as full SQL queries. It allows users to export the underlying data in several formats, and provides a JSON API. Powerful stuff. It also supports publishing data to the web to several platforms out of the box (documentation here), one of which is Heroku. What Heroku provides is a platform to run web applications, including those written in Python, like Datasette. You can do this by registering for what it calls a "Dyno" (either a limited free one, or a cheap-ish hobby account), and then creating a new app via its web interface. I've found other posts to be helpful in explaining this process (like this one here by Shawn Graham), so for the rest of this post I'll focus on the GitHub Actions piece.

Publishing data started as a manual process invoked via the command line, which I documented pretty thoroughly in the project README , but I had been looking for a way to allow the database to keep itself up-to-date. While running a script automatically at a set time might traditionally involve commandeering your own server or cloud instance and then setting up a job using cron, GitHub Actions offers a lightweight solution — especially if you're already using GitHub and the code you want to run is in a repository. The cost comes in getting familiar with a not-totally-intuitive YAML syntax, but after looking around at some examples, this too is fairly easy to puzzle out.

Simon Willison, the creator of Datasette, wrote a very helpful post on using GitHub Actions to publish data using Google Cloud Run, one of the other platforms that his library supports. A lot of the top section I found to be mostly reusable for my purposes. However, my workflow wasn't entirely the same, and I needed to deploy using Heroku instead. There were a few things in particular that were helpful to me in understanding what I needed to get this process working:

You can see the full workflow I ended up with here (and saved here for posterity). Worfklows are event-triggered, and you can set the process to run at a specified interval using cron. I found it's also handy to be able to manually trigger events, too, which you can do by listing workflow_dispatch as an event trigger (with no additional params). There are some additional tips I learned from other helpful members of the BAPD project such as how to cache Python dependencies, which saves a little on time and bandwidth.