The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Tutorial for the "JSON processing with jq" extension
BalazsBarany
Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
You might have some complex JSON documents that you'd like to process in RapidMiner. This tutorial demonstrates the functionality of the JSON processing with jq extension.
The first step is to install the extension from the Marketplace if you don't have it yet. Click the Extensions/Marketplace menu entry, and in the Marketplace window, enter JSON in the search box. Select the extension for installation, and let Studio restart itself when the installation is done.
Let's play with a publicly available data set from the Vienna Open Data server.
Here's a list of playgrounds in the city of Vienna, with some attributes and even geocoordinates:
https://www.data.gv.at/katalog/dataset/bd8b518b-d812-46f9-b367-4c1b660cfc99
When you go there, you see a list of documents in different formats, we're obviously interested in the JSON document, with this URL.
The simplest RapidMiner process to get the contents of this URL and process them with jq looks like this:
In Open File, you set the resource type to URL, and paste the URL of the JSON resource. In Read Document uncheck "extract text only", as we don't want to change the input. Then add a Process Document with jq operator and connect its input and output ports.
By default, Process Document with jq is set up for JSON output, indenting (formatting) the resulting JSON document, and with the simplest jq expression ".", which just copies (and formats) the incoming document.
So we get the first result from the process:
The document contains a kind of a header (type: FeatureCollection and totalFeatures from the GeoJSON standard), and an array of "features" (the playgrounds).
We're interested in the name (ANL_NAME), the playground details (SPIELPLATZ_DETAIL), and the geocoordinates of every playground.
To develop the jq expression, we go to jqplay.org and paste the JSON data into JSON field. The we interactively begin to develop the expression to select the data we want.
The first step is ".features". This selects the features array (discarding the header), and returns every element as an object of one large array.
If we change this to ".features[]", we get a list of different objects, which is better for further processing.
In jq, we use the pipe symbol | for processing steps. Now we list the elements we want after a pipe. The expression is:
.features[] | [.properties.ANL_NAME, .properties.SPIELPLATZ_DETAIL, .geometry.coordinates[0], .geometry.coordinates[1] ]
This gives us a nice flat structure that we can easily process with RapidMiner, especially if we let the operator convert it to CSV.
(Uncheck "first row as names" in the Read CSV operator, and change the separator to comma.)
The result is a normal RapidMiner example set:
Check out the documentation of jq if you need to process even more complex documents. jq offers additional functionality like extracting a variable length array of values (like tags) to a table structure, or counting elements, regular expression replacements, etc.
The first step is to install the extension from the Marketplace if you don't have it yet. Click the Extensions/Marketplace menu entry, and in the Marketplace window, enter JSON in the search box. Select the extension for installation, and let Studio restart itself when the installation is done.
Let's play with a publicly available data set from the Vienna Open Data server.
Here's a list of playgrounds in the city of Vienna, with some attributes and even geocoordinates:
https://www.data.gv.at/katalog/dataset/bd8b518b-d812-46f9-b367-4c1b660cfc99
When you go there, you see a list of documents in different formats, we're obviously interested in the JSON document, with this URL.
The simplest RapidMiner process to get the contents of this URL and process them with jq looks like this:
In Open File, you set the resource type to URL, and paste the URL of the JSON resource. In Read Document uncheck "extract text only", as we don't want to change the input. Then add a Process Document with jq operator and connect its input and output ports.
By default, Process Document with jq is set up for JSON output, indenting (formatting) the resulting JSON document, and with the simplest jq expression ".", which just copies (and formats) the incoming document.
So we get the first result from the process:
The document contains a kind of a header (type: FeatureCollection and totalFeatures from the GeoJSON standard), and an array of "features" (the playgrounds).
We're interested in the name (ANL_NAME), the playground details (SPIELPLATZ_DETAIL), and the geocoordinates of every playground.
To develop the jq expression, we go to jqplay.org and paste the JSON data into JSON field. The we interactively begin to develop the expression to select the data we want.
The first step is ".features". This selects the features array (discarding the header), and returns every element as an object of one large array.
If we change this to ".features[]", we get a list of different objects, which is better for further processing.
In jq, we use the pipe symbol | for processing steps. Now we list the elements we want after a pipe. The expression is:
.features[] | [.properties.ANL_NAME, .properties.SPIELPLATZ_DETAIL, .geometry.coordinates[0], .geometry.coordinates[1] ]
This gives us a nice flat structure that we can easily process with RapidMiner, especially if we let the operator convert it to CSV.
(Uncheck "first row as names" in the Read CSV operator, and change the separator to comma.)
The result is a normal RapidMiner example set:
Check out the documentation of jq if you need to process even more complex documents. jq offers additional functionality like extracting a variable length array of values (like tags) to a table structure, or counting elements, regular expression replacements, etc.
2
Comments
In paid/trial/educational it might be necessary to set "Grant additional permissions to unsigned extensions" in the Preferences on the Startup tab.
Just used the default values with operators "open file" and "read document"
It works in my Studio 9.7.002 with the aforementioned limitation (Grant additional permissions ...).
The library I'm using is doing some operations that RapidMiner doesn't like.
If the extension doesn't work for you, you can right-click the operator, "Open defining process", and use it as a subprocess with macro parametrization. You remove the first five operators that create macros from the operator parameters, and create the same macros in the process context. Then you save the process and use it in other processes by specifying the parameters with the Macros button.
Regards,
Balázs
thanks anyway!
I don't know anything about your process, nor the error message. I need more details to be able to help you.
In the process, the macros you define have to have the exact same names as they used to in the deleted parameter operators. If that's OK, we must find out what causes the error. It could be a wrong JQ expression (which you can check on jqplay.com), something with the data, or a problem with the script.
Regards,
Balázs
Thanks for this tutorial.
Has anyone been able to solve the permissions error that happens for the community edition?
I mean the one causing this error message:
Thanks,
Amnon Khen
More and more customers are using RapidMiner on servers or in the cloud where the scripting would allow things that aren't well defined in that environment.
You can try with "Grant additional permissions to unsigned extensions" in the Settings => Start-up form. If that doesn't work, you can try downgrading Studio to an older version.
My test process for the extension still works in my Studio 9.10.0, but there is no guarantee that the extension will still be usable in future versions. It would be cool to have this functionality in a native operator, but that requires a Java programmer which I am not.
Regards,
Balázs
I even tried writing my own script and RM blocked it due to security reasons.
This is very strange. It is my own code, running on my own computer.
It would be nice to be able to personally approve this specific extension on my computer.
Kind regards,
Amnon
I have a question about the Process document with Jq function.
I want to use read a Json file, but for some reason when I process it I get only the first message of the file. I compared the results after the read document function and the Process document with iq function and it seems that only 1 message gets through.
When I copy the file in jqplay.org, I do get the expected result. So I am not sure what I am doing wrong.
When I use the Json file from your example it seems to be working fine, so that is also strange.
Kind regards
Suzan
that's interesting. Can you add more details, like example contents of the file and the JQ expression you're using?
jqplay.org uses the original jq utility, but the RapidMiner extension uses a Java reimplementation, they might work differently in subtle ways.
Regards,
Balázs
"eventid": "94f707a8-ddde-46e2-99f1-3c83ca83499b",
"visitorid": "01761ba7286200062a2a96d59d1a03072013906a00bd0"
}
{
"eventid": "4b560bdf-de4b-45f1-a6c2-32cd8e68e2bc",
"visitorid": "0172dfb7b47f001ae614ab3b7c4801077001e06f0093c
}
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="open_file" compatibility="9.10.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="E:/somexample.json"/>
</operator>
<operator activated="true" class="text:read_document" compatibility="9.4.000" expanded="true" height="68" name="Read Document" width="90" x="246" y="34">
<parameter key="extract_text_only" value="false"/>
<parameter key="use_file_extension_as_type" value="true"/>
<parameter key="content_type" value="txt"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="json_processing_with_jq:process_document_with_jq" compatibility="0.8.001" expanded="true" height="68" name="Process Document with jq" width="90" x="380" y="34">
<parameter key="indent_json" value="true"/>
<parameter key="jq_expression" value=".eventid"/>
<parameter key="output_format" value="json"/>
</operator>
<operator activated="true" class="text:write_document" compatibility="9.4.000" expanded="true" height="82" name="Write Document" width="90" x="514" y="34">
<parameter key="overwrite" value="true"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="read_csv" compatibility="9.10.001" expanded="true" height="68" name="Read CSV" width="90" x="782" y="34">
<parameter key="column_separators" value=";"/>
<parameter key="trim_lines" value="false"/>
<parameter key="use_quotes" value="true"/>
<parameter key="quotes_character" value="""/>
<parameter key="escape_character" value="\"/>
<parameter key="skip_comments" value="false"/>
<parameter key="comment_characters" value="#"/>
<parameter key="starting_row" value="1"/>
<parameter key="parse_numbers" value="true"/>
<parameter key="decimal_character" value="."/>
<parameter key="grouped_digits" value="false"/>
<parameter key="grouping_character" value=","/>
<parameter key="infinity_representation" value=""/>
<parameter key="date_format" value=""/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="read_all_values_as_polynominal" value="true"/>
<list key="data_set_meta_data_information"/>
<parameter key="read_not_matching_values_as_missings" value="true"/>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_op="Process Document with jq" to_port="input 1"/>
<connect from_op="Process Document with jq" from_port="result 1" to_op="Write Document" to_port="document"/>
<connect from_op="Write Document" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Your JSON structure is actually not correct, objects need to be separated with commas.
The official JQ is tolerant enough with this.
The Java reimplementation used in the extension is apparently not.
You might want to fix the JSON. For example, replace "^\}$" with "\},$" in the string and then put [ ] around it so it results in a correct JSON array structure.
Regards,
Balázs
Regards
Anthony
Regards
Anthony
can you post the process or at least the relevant parameters of the jq operator and an example of the incoming data?
Does the same transformation work in JQPlay?
Regards,
Balázs
This is if I follow instructions in the post using the convert operator to CSV. If though I change the column seperator to "\n" to account for newlines. I get the following:
Please advise:
Environment: Windows 64bit, RM 9.10.001 (educational)
Your process almost works.
1. Check the encoding setting in Read Document. The source document is in UTF-8, and your operating system isn't, so the default setting (SYSTEM) is not sufficient on your computer. Change it to UTF-8 to avoid getting the strange special characters.
2. Change output format in Process Document with jq to CSV. The tutorial is supposed to explain how you get tabular data, and that's where the CSV output is more helpful.
3. In Read CSV change the separator back to , .
The process works for me with these changes. Your attempt deliver JSON (in the wrong encoding) and then tries to process that with Read CSV.
Regards,
Balázs
I managed to used the Process Document with jq, however, if the json file has null array how can I modify the operator so that it will ignore this null array and continue with the rest of the non-null arrays in the json file?
Thanks
Ivan