Powerful JSON processing with jq
jq is a command line tool for processing JSON documents. It can filter, transform and restructure documents to format them in the way we want.
The JSON documents we have to work with are becoming more and more complex. Web APIs often generate documents with hierarchic structure and optional elements.
Data mining, however, needs a tabular structure, without hierarchic elements, and if possible without missing data. jq helps us with the transformation of relevant parts of input documents into this shape.
Consider the following example document:
{ "count": 3, "category": "example", "elements": [ { "id": 1, "description": "first element", "tags": ["tag1"] }, { "id": 2, "description": "second element", "optional": "optional element", "tags": [] }, { "id": 3, "description": "third element", "tags": ["tag1", "tag2"] } ] }
This shows the usual pitfalls of complex JSON documents:
- Elements on different hierarchy levels: category, elements/id etc.
- Optional elements: elements[2]/optional
- Variable number of elements: elements/tags
It is really hard to process this kind of document in RapidMiner. If you install the Web Mining extension, you get JSON to XML; in Text Processing there's JSON to Data. However, JSON to XML tends to fail on more complex documents, and JSON to Data produces an example set with hundreds or thousands of attributes that might be complex or impossible to process with a series of Transpose, Pivot and similar operators.
So one might want to turn to jq. It can be used as a command line utility to process data. But the easiest way to try it is online at jqplay.org.
Let's create a table with the category, the element id and the tags. The jq expression for this is:
{count, category, elements: .elements[] } | {category, id: .elements.id, tag: .elements.tags[]}
Scary for sure in the first moment! But when you look at it, it’s built of simple elements. You can always execute it step by step at jqplay to see the effects of each transformation.
In the first step (the steps being delimited by the pipe symbol „|“) we declare the elements we want to process. We build an object list with { }, taking count and category from the top level and an array of the elements. count and category are repeated to create a proper table.
In the second step we select category and the object id-s, which were on different levels previously. The tags are selected as an array. Using the syntax name: .element.element we can select elements and name them. The result of this step is a list of objects having category, id and tag in a table, suitable for writing into a relational database or processing in a data mining tool.
Using jq with RapidMiner
It would be useful to process these kinds of documents with jq in RapidMiner. This is what I did, using jackson-jq, a Java implementation of jq. A zip file is attached to this post, it contains the necessary libraries and two example processes.To prepare, we need to copy the jackson-jq jar file and two dependencies into the RapidMiner Studio lib directory. Then we’re able to use the jq functionality in the built-in Groovy scripting operator (Execute Script).
I created two RapidMiner processes to make the application easier. These can be used in other processes. There is one variant working on tables (example sets), here you specify the name of the input attribute containing your documents and the target attribute for the transformation result. The other variant works on Document objects, like those coming from Get Page.
In both cases you specify the jq expression and set up the output options. You can indent the output, and convert the result to CSV. The CSV formatted result can be easily transformed to an example set — this is a frequent use case.
If you want to see CSV output in jqplay, check „Raw Output“ and append the following:
| [ .category, .id, .tag ] | @csv
This creates an array (with the [] syntax) and lists the elements in the output. The result is converted with the @csv step. (The RapidMiner process does this automatically if the csv output is selected.)
This, together with some practicing in jqplay, enables processes that can transform complex JSON documents to straight tables.
To process different parts and structures in the document, just multiply it and apply different jq expressions on the copies.