The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Does Rapid Miner have Normalize White space in Text processing"

nawafpowernawafpower Member Posts: 34 Contributor II
edited May 2019 in Help
Hi everybody,
I just wonder if the Rapid Miner does have "Normalize White Space" in its built in functions? I am trying to preprocess a text documents by normalizing the Case " To lower case", and Normalize White Space in the text files. If anybody can help with this it will be great.
Thanks

Answers

  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    sorry, I did not get what you are after. Could you give an example for a text before and after the desired transformation together with a description about what happened in between?

    Cheers,
    Ingo
  • nawafpowernawafpower Member Posts: 34 Contributor II
    Hi,
    By Normalize white space I mean "removing any leading or trailing space and reducing any internal white space to one space character per occurrence" . It's available in JGAAP application by Patrick Juola , I found out that this preprocessing step is very important in the text classification process. I need to implement it in RM if it possible.
    Regards
  • el_chiefel_chief Member Posts: 63 Contributor II
    Most text classification processes will tokenize the document, rendering white space removal pointless.

    But if for some reason you really needed to do it, it could be accomplished with one line of groovy script.

    Why do you need to do this?
  • nawafpowernawafpower Member Posts: 34 Contributor II
    Hi Neil,
    I have been playing with JGAAP and I found that best results came with normalize whitespace and unify case for Authorship purposes, when you mentioned doing one line code for this process, how can I do own programming with Rapid Miner GUI? I did ask you on your youtube channel if you can do a small video on Authorship but may be you don't have time, but if you can it will be great.
    I valuate your notes Neil, they were always helpful.
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    well, you could use a combination of the operators "Trim" (removing leading and trailing white spaces) with "Replace" (replacing any "surviving" white space by a single space) for this task. Please note, that those two operators work on attributes (and not on documents or tokens) so you would have to perform the transformation before you use the text processing operators.

    Below you will find a sample process which demonstrates the two operators.

    how can I do own programming with Rapid Miner GUI?
    There is a white paper in our shop which explains that:

    http://rapid-i.com/component/page,shop.product_details/flypage,flypage.tpl/product_id,52/category_id,5/option,com_virtuemart/Itemid,180/

    Cheers,
    Ingo

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="235" width="413">
          <operator activated="true" breakpoints="after" class="subprocess" compatibility="5.1.008" expanded="true" height="76" name="Subprocess" width="90" x="45" y="30">
            <process expanded="true" height="674" width="924">
              <operator activated="true" class="retrieve" compatibility="5.1.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
                <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace" width="90" x="313" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="replace_what" value="-"/>
                <parameter key="replace_by" value="            "/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace (2)" width="90" x="447" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
                <parameter key="replace_what" value="(.*)"/>
                <parameter key="replace_by" value="            $1          "/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="5.1.008" expanded="true" height="76" name="Filter Examples" width="90" x="581" y="30">
                <parameter key="condition_class" value="no_missing_attributes"/>
              </operator>
              <operator activated="true" class="nominal_to_text" compatibility="5.1.008" expanded="true" height="76" name="Nominal to Text" width="90" x="715" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="vacation"/>
              </operator>
              <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
              <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
              <connect from_op="Replace (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
              <connect from_op="Nominal to Text" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="trim" compatibility="5.1.008" expanded="true" height="76" name="Trim" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="vacation"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.008" expanded="true" height="76" name="Replace (3)" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="vacation"/>
            <parameter key="replace_what" value="\s+"/>
            <parameter key="replace_by" value=" "/>
          </operator>
          <connect from_op="Subprocess" from_port="out 1" to_op="Trim" to_port="example set input"/>
          <connect from_op="Trim" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
          <connect from_op="Replace (3)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.