The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting text from a record
paul_balas
Member Posts: 11 Contributor II
in Help
Hi,
Is there an easy control to use to extract the text from the following field:
Is there an easy control to use to extract the text from the following field:
{
"data": {
"translations": [
{
"translatedText": "020114 - SECURITAS - Security - AE Menor - 14x7 - Van",
"detectedSourceLanguage": "es"
}
]
}
}
I want to extract just the following text: 020114 - SECURITAS - Security - AE Menor - 14x7 - Van
I want to extract just the following text: 020114 - SECURITAS - Security - AE Menor - 14x7 - Van
Tagged:
0
Best Answers
-
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Managerhi all - so yes that is very good "sorcery" @rfuentealba. The reason is that the existing JSON parsing tools are currently out-of-date. There are some updates in the pipeline but to be honest, I would STRONGLY suggest just using Old World Computing's new Web Automation extension (from the marketplace). I can do your JSON parsing very elegantly in about 2 min like this:
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34"> <parameter key="text" value="{ "data": { "translations": [ { "translatedText": "020114 - SECURITAS - Security - AE Menor - 14x7 - Van", "detectedSourceLanguage": "es" } ] } }"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> <description align="center" color="transparent" colored="false" width="126">this is your JSON</description> </operator> <operator activated="true" breakpoints="after" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34"> <parameter key="text_attribute" value="json"/> <parameter key="add_meta_information" value="false"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="rmx_webautomation:process_json_object" compatibility="2.2.431" expanded="true" height="82" name="Process Object" width="90" x="179" y="238"> <process expanded="true"> <operator activated="true" class="rmx_webautomation:process_json_object" compatibility="2.2.431" expanded="true" height="82" name="Process Object (2)" width="90" x="179" y="34"> <parameter key="property_name" value="data"/> <process expanded="true"> <operator activated="true" class="rmx_webautomation:process_json_array" compatibility="2.2.431" expanded="true" height="82" name="Process Array" width="90" x="179" y="34"> <parameter key="property_name" value="translations"/> <parameter key="array_type" value="objects"/> <parameter key="create_id_attribute" value="false"/> <process expanded="true"> <operator activated="true" class="rmx_webautomation:extract_json_properties" compatibility="2.2.431" expanded="true" height="82" name="Extract Properties" width="90" x="112" y="34"> <list key="extract_properties"> <parameter key="translatedText" value="translatedText.polynominal"/> </list> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time zone" value="SYSTEM"/> </operator> <operator activated="true" class="rmx_webautomation:commit_row" compatibility="2.2.431" expanded="true" height="82" name="Commit Row" width="90" x="246" y="34"/> <connect from_port="parse specification" to_op="Extract Properties" to_port="parse specifications 1"/> <connect from_op="Extract Properties" from_port="parse specifications 1" to_op="Commit Row" to_port="parse specifications 1"/> <connect from_op="Commit Row" from_port="parse specifications 1" to_port="parse specifications 1"/> <portSpacing port="source_parse specification" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_parse specifications 1" spacing="0"/> <portSpacing port="sink_parse specifications 2" spacing="0"/> </process> </operator> <connect from_port="parse specification" to_op="Process Array" to_port="parse specification"/> <connect from_op="Process Array" from_port="parse specifications 1" to_port="parse specifications 1"/> <portSpacing port="source_parse specification" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_parse specifications 1" spacing="0"/> <portSpacing port="sink_parse specifications 2" spacing="0"/> </process> </operator> <connect from_port="parse specification" to_op="Process Object (2)" to_port="parse specification"/> <connect from_op="Process Object (2)" from_port="parse specifications 1" to_port="parse specifications 1"/> <portSpacing port="source_parse specification" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_parse specifications 1" spacing="0"/> <portSpacing port="sink_parse specifications 2" spacing="0"/> </process> </operator> <operator activated="true" class="rmx_webautomation:parse_json_data" compatibility="2.2.431" expanded="true" height="103" name="Parse JSON from Data" width="90" x="380" y="85"> <parameter key="attribute" value="json"/> <parameter key="keep_example_set" value="false"/> </operator> <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Parse JSON from Data" to_port="example set"/> <connect from_op="Process Object" from_port="parse specifications 1" to_op="Parse JSON from Data" to_port="parse specifications 1"/> <connect from_op="Parse JSON from Data" from_port="example set 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Scott8 -
paul_balas Member Posts: 11 Contributor IIMuch easier! Disappointing that some of these controls are so buggy. This solved a problem I've been struggling with for about 4 hours. Thank you!2
Answers
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
<parameter key="file" value="/Users/master/files/text.json"/>
<parameter key="extract_text_only" value="false"/>
<parameter key="use_file_extension_as_type" value="true"/>
<parameter key="content_type" value="txt"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
<parameter key="query_type" value="JsonPath"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Nominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="true"/>
<parameter key="assume_html" value="true"/>
<list key="index_queries"/>
<list key="jsonpath_queries">
<parameter key="translated" value="$.data.translations[*].translatedText"/>
</list>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
<parameter key="text_attribute" value="text"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
This is how the process looks:
It uses jsonPath to extract information.
You can use this site to explore and understand how to use jsonPath: http://jsonpath.com/
The goal is to extract the text: Incident with Vehicle
excluding any of the other text before or after it.
Here is my process which is complaining that the attribute doesn't exist (but it clearly does). And here is my regex:
(?<=Text": ")(.*)(?=",) which correctly extracts the text I'm after from the above example.
Here is the 'Extract Description' transform which precedes it showing that I can reference the attribute:
Also confusing is why in the 'Process Object' control, I have another embedded 'Process Object' control, Then the 'Process Array', and finally the controls to 'Extract Properties' (but still unsure what 'Commit Row' does as well).
Also, a strange behavior is that after the 'Parse JSON from Data', I can't reduce the attributes passed through (I selected 'keep example set' which passes through all the attributes).
I strongly recommend to read our three blog posts about the extension: