Exploring JSON Data From the Command Line With jq

Published on

Overview: Why You Might Want to Use jq

JSON is a ubiquitous data interchange format, but the tools for exploring it aren't always great. Why might you want to traverse JSON outside of manipulating it with a full-fledged programming language? Here's a real-world use case to consider:

I contribute to an open source project that scrapes COVID-19 data from local government sources, cleans it, and then stores the output as JSON. Often, I find myself wanting to quickly examine the data from the latest run when testing out someone's code in a pull request or working on fixes myself. For a while, I did this using Python's built in json library in the REPL. But this becomes quite tedious when iterating. Each time, I have to import the module, open the file, load the data into a variable, etc.

Then I learned about jq, a domain-specific language for processing JSON data, and believe this to be a much better tool for this situation. You can think about jq sort of like awk, but for JSON instead of tabular or delimiter-separated text files. Like awk, it has the advantage of being easy to compose on the fly from the command line.

There are a lot of tutorials out there on jq; I found this one on "How To Geek" to be a pretty good primer. In this blog, I'll aim to go a bit more in depth by showing how records can be selected depending on their value(s), and how you can use jq for data validation.

As always, make sure to RTFM.

Getting Started

One thing you might want to do with a JSON blob is get a list of all the keys, so you can begin to get a feel for the data. For the purposes of this post, let's imagine we have a file called people.json that looks like the following. jq is much more useful when we have big, unwieldy objects, but this simplisitic example will help us get our bearings.

{
    "people": [
        {
            "first_name": "Dade",
            "last_name": "Murphy",
            "likes": ["hacking", "sunglasses"],
            "age": 18
        },
        {
            "first_name": "Kate",
            "last_name": "Libby",
            "likes": ["hacking", "video games"],
            "age": 1337
        }
    ]
}

Let's start our exploration by using the keys command:

$ jq 'keys[]' people.json
"people"

Notice a couple things here. First, jq can be invoked by passing the name of the file you want to read at the end of the command; you can also start the pipeline by executing cat people.json and piping to jq, with the same result. This is expected behavior from Unix tools, and makes jq flexible and powerful. Similar to awk or sed, a jq program is written inside single quotes.

Now let's look a bit closer at the syntax. keys is the only command or "filter" in this short program, and as you might expect, it prints all the keys at the top level of the object. Following that with square brackets ([]) unpacks the array into single elements. Without them, the output will look like this:

$ cat people.json | jq 'keys'
[
  "people"
]

OK, now we have the basics, so let's dig a little deeper.

Selecting Data by Index and Value

We might want to get more familiar with the data by looking at all the keys in the first entry in the array. We can achieve that in the following way:

$ cat people.json | jq '.people[0] | keys'
[
  "age",
  "first_name",
  "last_name",
  "likes"
]

We see here that we can select a key using the dot (.) or "identity" filter followed by the name of the key, with no quotations around it; if the value is an array, values within that array can be selected by index number. What's more, within jq, we can pass the output to other jq commands with a pipe (|). NB: This pipe is still within the jq program (i.e. inside the single quotes), not outside of it.

We can also select an object based on the value in a key-value pair, using the handily named select() function. For example, say we wanted to find every entry for someone named "Dade". We could write our program like so:

$ cat people.json |jq '.people[] |select(.first_name == "Dade")'
{
  "first_name": "Dade",
  "last_name": "Murphy",
  "likes": [
    "hacking",
    "sunglasses"
  ],
  "age": 18
}

We could take that a step further, and just get the ages for people named "Dade". In order to handle the possibility that there might be more than one person with this first name, we'll want to treat the selection as an array by surrounding it in square brackets, and then unpack it:

$ cat people.json |jq '.people[] |[ select(.first_name == "Dade") ][].age'
18    

We know there is a key called "likes". That's plural, so we'll assume that's an array. How would we look for people who have "hacking" among their likes? With the help of any().

$ cat people.json |jq '.people[] |select(any(.likes[]; . == "hacking"))'

This outputs the records for both Dade and Kate, since both of them have "hacking" as likes. The syntax of the last line bears some explanation (and a HT to StackOverflow for helping me find it). This is what's described in the docs as the any(generator; condition) form. Basically, you are unpacking the array and then testing the condition on each element. Outside of that, select is grabbing any element from people where the condition is true. If you passed in "sunglasses" instead of "hacking", we'd only get Dade back.

Data Validation and Diving Deeper

What if you didn't know that "likes" was an array? Or more generally, what if you wanted to produce an object showing all the key names and the corresponding data types at the level of the object you selected? You'd need to get a little tricky:

$ cat people.json |jq '.people[0] |[ [ keys_unsorted[] ], [ (.[] | type) ] ] |   (1)
› transpose |                                   (2)
› map( {key: .[0], type: .[1]} )'               (3)

Let's walk through each step. At (1), we're creating two separate arrays: one with all the keys -- not sorted -- and one with all the corresponding types. At (2), the transpose function "zips" the values from each array together, so you have [["first_name", "string"], ["last_name", "string"] ...] etc. Finally, at (3) you take those values and map them such that you create an array of new objects, producing output like this:

[
  {
    "key": "first_name",
    "type": "string"
  },
  {
    "key": "last_name",
    "type": "string"
  },
  {
    "key": "likes",
    "type": "array"
  },
  {
    "key": "age",
    "type": "number"
  }
]

Neat, right? The important thing to note is the use of keys_unsorted, instead of just keys at (2); this ensures that the keys are mapped to their correct corresponding type. (Another HT to StackOverflow for this.)

Finally, let's inspect our data for invalid values. If you recall from above, Kate's age looks like it got, um, hacked. Maybe we want to find any records where the age values in our records are over 110.

$ cat people.json |jq '.people[] |select(.age > 110)'
{
  "first_name": "Kate",
  "last_name": "Libby",
  "likes": [
    "hacking",
    "video games"
  ],
  "age": 1337
}

If someone had really managed to mangle their age, maybe through lax rules in a web form, we could look for age values that are not numeric, too.

$ cat people.json |jq '.people[] |select(.age | type != "number")'
{
  "first_name": "Kate",
  "last_name": "Libby",
  "likes": [
    "hacking",
    "video games"
  ],
  "age": "foobar"
}

Conclusion

There's a lot more you can do with jq; I've seen people use it almost exclusively in web scraping scripts to extract and transform complex JSON objects. It's not for every use case, but hopefully this tutorial has been helpful in sharing some of the strengths of this tool.