The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

CSV with multiline field won't load up properly

avoronaavorona Member Posts: 2 Learner III
edited December 2018 in Help

I can not figure out how to import a CSV file with multiline values in them.

 

label,value
1,"a
b"
2,"a"

The RapidMinder complains about that with

Value quotes not closed at position 3.
Value quote misplaced at position 2.

How do I make it handle line breaks inside the quoted values?

Best Answer

  • jczogallajczogalla Employee-RapidMiner, Member Posts: 144 RM Engineering
    Solution Accepted

    Hi @avorona!

     

    Right now the Read CSV Operator does not support multiline values. You can however do something similar to what I described in this thread. Here is an example process that works with your samples.csv. (You'll need the text extension, which can be found on the marketplace.)

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="filename" value="sample.csv"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.2.000-SNAPSHOT" expanded="true" height="68" name="Read Document" width="90" x="179" y="34">
    <parameter key="extract_text_only" value="false"/>
    </operator>
    <operator activated="true" class="text:cut_document" compatibility="8.2.000-SNAPSHOT" expanded="true" height="68" name="Cut Document" width="90" x="313" y="34">
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="text" value="((?:[^&quot;]+?|&quot;(.|\n)*?&quot;|)*?)\n"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    <process expanded="true">
    <operator activated="true" class="text:remove_document_parts" compatibility="8.2.000-SNAPSHOT" expanded="true" height="68" name="Remove Document Parts" width="90" x="45" y="34">
    <parameter key="deletion_regex" value="&quot;"/>
    </operator>
    <connect from_port="segment" to_op="Remove Document Parts" to_port="document"/>
    <connect from_op="Remove Document Parts" from_port="document" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="8.2.000-SNAPSHOT" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="split" compatibility="9.0.003" expanded="true" height="82" name="Split" width="90" x="715" y="34"/>
    <operator activated="true" class="rename_by_example_values" compatibility="9.0.003" expanded="true" height="82" name="Rename by Example Values" width="90" x="849" y="34"/>
    <operator activated="true" class="guess_types" compatibility="9.0.003" expanded="true" height="82" name="Guess Types" width="90" x="983" y="34">
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
    <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/>
    <connect from_op="Rename by Example Values" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
    <connect from_op="Guess Types" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I used a regular expression to cut the the plain text into rows, convert them to an example set and then split the text by ",". This has limitations, i.e. it only works if multiline values are in quotes, and only if there are no commata in any value.

    I hope this helps!

    Cheers,

    Jan

Answers

  • avoronaavorona Member Posts: 2 Learner III

    Thank you, that makes sense.

     

    Right now I am just pre-processing the CSV files with a simple python script. Having essentially the same as part of the flow would be better.

     

    Thanks!

Sign In or Register to comment.