Best practice adding a new record(s) to example set.

JEdward · August 2013

Hello,

I'm going to feel pretty silly asking this, but what is the best practice for adding a record to an example set & setting the ID field to the next in the series?

ID	Names
1	Jack
2	Lucy
3	Simon
4	George

If I want to add "James" to this table with the ID value of 5 what would be the best practice using RapidMiner for adding the next record in the table.
In SQL I'd have the ID column set to autonumber when I add new entries, but I'm looking to do this only using RapidMiner because otherwise I would lose the metadata. If I aggregate & look for the maxvalue in the table I could then use this value + 1, but I'm not sure if this is the best way of doing this, it seems an extra step somehow. I suppose it would also be possible to use the SQL autonumber tables by having a pseudotable that just holds ID numbers in it, but this seems needlessly excessive (although for millions of records could have some speed advantages).

Any opinions?

awchisholm · August 2013

Hello

I've done something like this before. I've included an example.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="5.3.013" expanded="true" height="60" name="Generate Data" width="90" x="112" y="75"/>
      <operator activated="true" class="generate_id" compatibility="5.3.013" expanded="true" height="76" name="Generate ID" width="90" x="112" y="165"/>
      <operator activated="true" class="generate_data" compatibility="5.3.013" expanded="true" height="60" name="Generate Data (3)" width="90" x="112" y="300">
        <parameter key="number_examples" value="1"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.013" expanded="true" height="76" name="Generate ID (3)" width="90" x="112" y="390"/>
      <operator activated="true" class="append" compatibility="5.3.013" expanded="true" height="94" name="Append (2)" width="90" x="313" y="255"/>
      <operator activated="true" class="extract_macro" compatibility="5.3.013" expanded="true" height="60" name="Extract Macro (2)" width="90" x="447" y="255">
        <parameter key="macro" value="numberOfExamples"/>
        <parameter key="statistics" value="max"/>
        <parameter key="attribute_name" value="id"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="set_data" compatibility="5.3.013" expanded="true" height="76" name="Set Data" width="90" x="581" y="255">
        <parameter key="example_index" value="%{numberOfExamples}"/>
        <parameter key="attribute_name" value="id"/>
        <parameter key="value" value="%{numberOfExamples}"/>
        <list key="additional_values"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Append (2)" to_port="example set 1"/>
      <connect from_op="Generate Data (3)" from_port="output" to_op="Generate ID (3)" to_port="example set input"/>
      <connect from_op="Generate ID (3)" from_port="example set output" to_op="Append (2)" to_port="example set 2"/>
      <connect from_op="Append (2)" from_port="merged set" to_op="Extract Macro (2)" to_port="example set"/>
      <connect from_op="Extract Macro (2)" from_port="example set" to_op="Set Data" to_port="example set input"/>
      <connect from_op="Set Data" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Basically, it uses Append to join the example sets together and Extract Macro to work out how many examples are in the new example set. The Set Data operator then allows the id of the last example to be explicity set to the next value in the sequence. I don't know for certain but this last operator is probably quite slow so care would be needed for large volumes of data.

If this is a problem or if you wanted to append multiple new examples, you could modify the process so that the generated id started at an offset calculated from the number of examples in the original plus 1. The Set Data would not be needed in this case.

I'll leave that as an exercise for the reader

regards

Andrew

JEdward · August 2013

Thanks Andrew,

I'm reticent to use append as it states explicitly that it is not suited for merging large example sets. My own way of doing it currently is the below. It takes a name, checks that it does not already exist and if it doesn't creates it in the data table returning the record from the data table (with ID) that it finds.
It's a little complicated ::), what do you & the community think?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.012">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.012" expanded="true" height="60" name="Retrieve People" width="90" x="45" y="210">
        <parameter key="repository_entry" value="People"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="5.3.012" expanded="true" height="60" name="New Data Entry" width="90" x="45" y="75">
        <list key="attribute_values">
          <parameter key="Person_Name" value="&quot;Some Test Data&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="5.3.012" expanded="true" height="148" name="Check for Matches &amp; Separate Streams" width="90" x="179" y="75">
        <process expanded="true">
          <operator activated="true" class="ida:extract_metadata" compatibility="5.1.000" expanded="true" height="76" name="extract_metadata" width="90" x="45" y="120"/>
          <operator activated="true" class="multiply" compatibility="5.3.012" expanded="true" height="94" name="Multiply (Old Data)" width="90" x="45" y="255"/>
          <operator activated="true" class="multiply" compatibility="5.3.012" expanded="true" height="112" name="Multiply (New Data)" width="90" x="112" y="30"/>
          <operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join - Check For Matches" width="90" x="246" y="120">
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="Person_Name" value="Person_Name"/>
            </list>
          </operator>
          <connect from_port="in 1" to_op="Multiply (New Data)" to_port="input"/>
          <connect from_port="in 2" to_op="extract_metadata" to_port="example set input"/>
          <connect from_op="extract_metadata" from_port="example set output" to_port="out 1"/>
          <connect from_op="extract_metadata" from_port="original" to_op="Multiply (Old Data)" to_port="input"/>
          <connect from_op="Multiply (Old Data)" from_port="output 1" to_op="Join - Check For Matches" to_port="right"/>
          <connect from_op="Multiply (Old Data)" from_port="output 2" to_port="out 4"/>
          <connect from_op="Multiply (New Data)" from_port="output 1" to_op="Join - Check For Matches" to_port="left"/>
          <connect from_op="Multiply (New Data)" from_port="output 2" to_port="out 3"/>
          <connect from_op="Multiply (New Data)" from_port="output 3" to_port="out 5"/>
          <connect from_op="Join - Check For Matches" from_port="join" to_port="out 2"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="source_in 3" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
          <portSpacing port="sink_out 4" spacing="0"/>
          <portSpacing port="sink_out 5" spacing="0"/>
          <portSpacing port="sink_out 6" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="branch" compatibility="5.3.012" expanded="true" height="112" name="Branch" width="90" x="313" y="75">
        <description>Handle New Data</description>
        <parameter key="condition_type" value="min_examples"/>
        <parameter key="condition_value" value="1"/>
        <process expanded="true">
          <connect from_port="condition" to_port="input 1"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="source_input 3" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="extract_macro" compatibility="5.3.012" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
            <parameter key="macro" value="MaxID"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="max"/>
            <parameter key="example_index" value="2"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="5.3.012" expanded="true" height="94" name="Generate New ID" width="90" x="45" y="165">
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="5.3.012" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="75">
                <list key="function_descriptions">
                  <parameter key="ID" value="%{MaxID}+1"/>
                </list>
              </operator>
              <operator activated="true" class="real_to_integer" compatibility="5.3.012" expanded="true" height="76" name="Real to Integer" width="90" x="179" y="75">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="ID"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="set_role" compatibility="5.3.012" expanded="true" height="76" name="Set Role" width="90" x="246" y="210">
                <parameter key="attribute_name" value="ID"/>
                <parameter key="target_role" value="id"/>
                <list key="set_additional_roles"/>
              </operator>
              <connect from_port="in 1" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Real to Integer" to_port="example set input"/>
              <connect from_op="Real to Integer" from_port="example set output" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="source_in 3" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Generate New ID" to_port="in 1"/>
          <connect from_port="input 2" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Generate New ID" to_port="in 2"/>
          <connect from_op="Generate New ID" from_port="out 1" to_port="input 1"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="source_input 3" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join New Data With Old" width="90" x="313" y="210">
        <parameter key="join_type" value="outer"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Person_Name" value="Person_Name"/>
          <parameter key="ID" value="ID"/>
        </list>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.012" expanded="true" height="60" name="Store" width="90" x="447" y="210">
        <parameter key="repository_entry" value="People"/>
      </operator>
      <operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join New Data With Old (2)" width="90" x="581" y="75">
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="Person_Name" value="Person_Name"/>
        </list>
      </operator>
      <connect from_op="Retrieve People" from_port="output" to_op="Check for Matches &amp; Separate Streams" to_port="in 2"/>
      <connect from_op="New Data Entry" from_port="output" to_op="Check for Matches &amp; Separate Streams" to_port="in 1"/>
      <connect from_op="Check for Matches &amp; Separate Streams" from_port="out 1" to_op="Branch" to_port="input 2"/>
      <connect from_op="Check for Matches &amp; Separate Streams" from_port="out 2" to_op="Branch" to_port="condition"/>
      <connect from_op="Check for Matches &amp; Separate Streams" from_port="out 3" to_op="Branch" to_port="input 1"/>
      <connect from_op="Check for Matches &amp; Separate Streams" from_port="out 4" to_op="Join New Data With Old" to_port="left"/>
      <connect from_op="Check for Matches &amp; Separate Streams" from_port="out 5" to_op="Join New Data With Old (2)" to_port="left"/>
      <connect from_op="Branch" from_port="input 1" to_op="Join New Data With Old" to_port="right"/>
      <connect from_op="Join New Data With Old" from_port="join" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_op="Join New Data With Old (2)" to_port="right"/>
      <connect from_op="Join New Data With Old (2)" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

awchisholm · August 2013

Hello

It does more than my example because I always assume the new record is to be added. So this makes it more complex - it also contains an operator I don't have so I can't run it.

However I can see a cunning use of Joins to find if the new record already exists and if not creating it with the new id. The subsequent outer Join may build everything in memory so we would need to do an experiment to see if it outperforms Append.

regards

Andrew

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Best practice adding a new record(s) to example set.

Answers