How to create new examples by spliiting at punctuation marks?

chrisniem · July 2012

Hi all!

I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:

2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."

What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:

2012-05-04 Source1 Speaker1 Context1 "The unsettling prospects come at a time of growing uncertainty for the country’s economy."
2012-05-04 Source1 Speaker1 Context1 "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06 Source2 Speaker2 Context2 "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06 Source2 Speaker2 Context2 "Others have been cutting their corn early to use for feed, a much less profitable venture."

Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.

Any help is very appreciated!

Thanks

Chris

MariusHelf · July 2012

Hi Chris,

you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.

Best,
~Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="505" width="721">
      <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
        <list key="attribute_values">
          <parameter key="meta" value="false"/>
          <parameter key="text" value="&quot;This is also a test. With two sentences.&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
        <list key="attribute_values">
          <parameter key="meta" value="true"/>
          <parameter key="text" value="&quot;Test. Sentence. Blubb.&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true" height="505" width="658">
          <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries">
              <parameter key="t" value="\..\."/>
            </list>
            <list key="regular_expression_queries">
              <parameter key="t" value="([^\.]+)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="523" width="658">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|meta|text"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

chrisniem · July 2012

Hi Marius,

great, that will do it!

Thanks a lot!

Chris

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to create new examples by spliiting at punctuation marks?

Answers