Splitting data

frankie · February 2011

Hi,

I have what I consider a simple problem but due to poor understanding or perhaps poor documentation I cannot figure out how to:
Split a dataset of say 1000 observations into two separate datasets of say 700 and 300 observations respectively. That is, a operator that has two outputs and one input...

Is this done with the "Split Data" operator? If so, what are these "partitions" I need to define?
The split should be random, preferably with a predefined seed for reproducibility.

-frankie

earmijo · February 2011

Frankie:

Yes you can do it easily in RM. Take a look at the code below. It uses the operator "Split Data". It splits the iris dataset into 2 partitions: 70/30%. This info is fed to RM clicking the "Edit Enumeration" button. Notice you could have k partitions by adding k ratios.

If you select the option "local random seed" the partitions will be the same in repeated trials.

Hope this helps.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="179" width="346">
      <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="74" y="62">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="5.1.001" expanded="true" height="94" name="Split Data" width="90" x="246" y="75">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_port="result 1"/>
      <connect from_op="Split Data" from_port="partition 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

frankie · February 2011

Thank you!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Splitting data

Answers