"[SOLVED]questions of reading csv files"

huaiyanggongzi · July 2014

When using “read csv” operator to import csv file, I have the following problem.

If a given cell has a “,”, the word following it will not be read. I think this is because the “,” is used as the column operator. But for this case, “,” is just an character appeared in a string . How can I let rapidminer skip this “,” in the string.

The following is the test csv file, which just include one row with two columns. The main content is just a text string. in the gnerated wordlist, we can find that the word "what" was not read due to the "," appearing before it.

ID Text Field
1 wow <Content>, what charm!

The following is the process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="165">
        <parameter key="csv_file" value="C:\Users\LocalRepository\Source_Data\test3.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="GBK"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID.true.integer.id"/>
          <parameter key="1" value="Text Field.true.text.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="120"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (3)" width="90" x="313" y="120"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (3)" width="90" x="447" y="120">
            <parameter key="min_chars" value="1"/>
            <parameter key="max_chars" value="200"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="255"/>
          <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
          <connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
          <connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
          <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Marco_Boeck · July 2014

Hi,

when ',' is not used to separate columns, you can simply change the column sparator in the operator parameters to the actual character that is used to separate columns in your data. In case you only have 1 column you can set it to whatever you like (but make sure it never appears in your text). If you have something like "my, text" , 123
Then you can keep ',' as the column separater char, but you'd have to set " as the quotes char. Separater characters that appear in between quote characters are ignored and kept as part of the text.

Regards,
Marco

huaiyanggongzi · July 2014

Marco, Thanks.

Suppose I have several columns, which still use “,” as column separator (because they are generated as csv file). However, within some cell entries, they include string like ABC,DEF
How to handle this kind of scenario? Do I have to modify this csv file, and mark everything, like ABC,DEF with “ABC,DEF”?

Marco_Boeck · July 2014

Hi,

yes. A csv file which contains , both as part of a string and as a separator char is syntactically invalid. It is impossible to read such a file without quote characters around the strings so that the parser knows what is a separator and what is part of a literal.

Regards,
Marco

JEdward · July 2014

Actually, if the format of the file is only two columns e.g.

ID,Text
1,wow <Content>, what charm!

I think you might be able to read in the data using RegEx (certainly you could use RegEx & Notepad++ to clean it up also.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"[SOLVED]questions of reading csv files"

Answers