The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"[SOLVED]questions of reading csv files"
huaiyanggongzi
Member Posts: 39 Contributor II
When using “read csv” operator to import csv file, I have the following problem.
If a given cell has a “,”, the word following it will not be read. I think this is because the “,” is used as the column operator. But for this case, “,” is just an character appeared in a string . How can I let rapidminer skip this “,” in the string.
The following is the test csv file, which just include one row with two columns. The main content is just a text string. in the gnerated wordlist, we can find that the word "what" was not read due to the "," appearing before it.
The following is the process
If a given cell has a “,”, the word following it will not be read. I think this is because the “,” is used as the column operator. But for this case, “,” is just an character appeared in a string . How can I let rapidminer skip this “,” in the string.
The following is the test csv file, which just include one row with two columns. The main content is just a text string. in the gnerated wordlist, we can find that the word "what" was not read due to the "," appearing before it.
ID Text Field 1 wow <Content>, what charm! |
The following is the process
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="165">
<parameter key="csv_file" value="C:\Users\LocalRepository\Source_Data\test3.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="GBK"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="ID.true.integer.id"/>
<parameter key="1" value="Text Field.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="120"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (3)" width="90" x="313" y="120"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (3)" width="90" x="447" y="120">
<parameter key="min_chars" value="1"/>
<parameter key="max_chars" value="200"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="255"/>
<connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
when ',' is not used to separate columns, you can simply change the column sparator in the operator parameters to the actual character that is used to separate columns in your data. In case you only have 1 column you can set it to whatever you like (but make sure it never appears in your text). If you have something like "my, text" , 123
Then you can keep ',' as the column separater char, but you'd have to set " as the quotes char. Separater characters that appear in between quote characters are ignored and kept as part of the text.
Regards,
Marco
Suppose I have several columns, which still use “,” as column separator (because they are generated as csv file). However, within some cell entries, they include string like ABC,DEF
How to handle this kind of scenario? Do I have to modify this csv file, and mark everything, like ABC,DEF with “ABC,DEF”?
yes. A csv file which contains , both as part of a string and as a separator char is syntactically invalid. It is impossible to read such a file without quote characters around the strings so that the parser knows what is a separator and what is part of a literal.
Regards,
Marco
ID,Text
1,wow <Content>, what charm!
I think you might be able to read in the data using RegEx (certainly you could use RegEx & Notepad++ to clean it up also.