The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Complex Data Preparation

IlyaIlya Member Posts: 3 Contributor I
edited November 2018 in Help
Hello everyone,

First of all, it's important to say that I've been following this forum for some time now, and it helped me a lot – so thank you!
Now it's finally my turn to ask for help, and I really hope you could help me out  :D

I'm working on a project requiring machine-learning, currently using a SVM model in Weka, while the data preparation is done by code.
Now I am tasked with transferring all the coded data preparations into RM, but I'm having difficulties with it.

I'll try to simplify the problem.
Let's say we are trying to predict which students will be suitable for the high school basketball team, using the age and height as attributes.

Basically I'm creating features for SVM using every combination of the attributes, in this case using two (in reality I'm currently up to four attributes, possibly more to come…)

foreach student : exampleSet // from repository
    foreach age : constAgeArray // [8, 9, 10]
        foreach height : constHeightArray // [130, 135, 140]
            if (student.age < age && student.height > height)
                // set feature BASKETBALL_POTENTIAL_{age}_{height} = 1
            else
                // set feature BASKETBALL_POTENTIAL_{age}_{height} = 0
1. I've tried all of the Loop operators to create nested loops, but the process became extremely cumbersome and eventually did not work.
2. Is it possible to define const-arrays in RM? Now I'm using additional exampleSets as arrays…
3. Should I even use RM for this kind of data preparation? Or the best practice is to do it by other means, and import the result into RM for further use (i.e. classification and regression)
4. I would be really grateful if someone could give a RM example for the above basketball data preparation  ;)

Thanks in advance!!

Answers

  • IlyaIlya Member Posts: 3 Contributor I
    Guys, this is still relevant...
    I would really appreciate your help.
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee-RapidMiner, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    You can probably do this via existing operators, however I think the process would be quite complex.
    In this case I'd actually recommend the "Execute Script" operator (unless you want to run this on the Server often). I have created a small example on how this could look:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.001">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.1.001" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="retrieve" compatibility="6.1.001" expanded="true" height="60" name="Retrieve CustomFeatureCreationData" width="90" x="45" y="30">
           <parameter key="repository_entry" value="CustomFeatureCreationData"/>
         </operator>
         <operator activated="true" class="execute_script" compatibility="6.1.001" expanded="true" height="94" name="Execute Script" width="90" x="179" y="30">
           <parameter key="script" value="import java.util.LinkedList;&#10;import java.util.List;&#10;&#10;import com.rapidminer.example.Attribute;&#10;import com.rapidminer.example.ExampleSet;&#10;import com.rapidminer.example.table.AttributeFactory;&#10;import com.rapidminer.example.table.DoubleArrayDataRow;&#10;import com.rapidminer.example.table.MemoryExampleTable;&#10;import com.rapidminer.tools.Ontology;&#10;&#10;// grab input data&#10;ExampleSet exampleSet = input[0];&#10;&#10;// define constants over which to loop below&#10;int[] ageArray = new int[3];&#10;ageArray[0] = 8;&#10;ageArray[1] = 9;&#10;ageArray[2] = 10;&#10;int[] heightArray = new int[3];&#10;heightArray[0] = 130;&#10;heightArray[1] = 135;&#10;heightArray[2] = 140;&#10;&#10;// loop over all examples (aka rows) in the data&#10;for (Example example : exampleSet) {&#10;&#9;// loop over all constant arrays&#10;&#9;for (int i=0; i&lt;ageArray.length; i++) {&#10;&#9;&#9;for (int j=0; j&lt;heightArray.length; j++) {&#10;&#9;&#9;&#9;// grab data from example&#10;&#9;&#9;&#9;int age = (int) example.getValue(exampleSet.getAttributes().get(&quot;Age&quot;));&#10;&#9;&#9;&#9;int height = (int) example.getValue(exampleSet.getAttributes().get(&quot;Height&quot;));&#10;&#10;&#9;&#9;&#9;// check if attribute (aka column) already exists&#10;&#9;&#9;&#9;String attName = &quot;BASKETBALL_POTENTIAL_&quot; + ageArray + &quot;_&quot; + heightArray;&#10;&#9;&#9;&#9;Attribute newAtt = exampleSet.getAttributes().get(attName);&#10;&#9;&#9;&#9;if (newAtt == null) {&#10;&#9;&#9;&#9;&#9;// does not yet exist, create it&#10;&#9;&#9;&#9;&#9;newAtt = AttributeFactory.createAttribute(attName, Ontology.ATTRIBUTE_VALUE_TYPE.NUMERICAL);&#10;&#9;&#9;&#9;&#9;exampleSet.getExampleTable().addAttribute(newAtt);&#10;&#9;&#9;&#9;&#9;exampleSet.getAttributes().addRegular(newAtt);&#10;&#9;&#9;&#9;}&#10;&#10;&#9;&#9;&#9;// fill newly added attributes with desired values&#10;&#9;&#9;&#9;if (age &lt; ageArray &amp;&amp; height &gt; heightArray) {&#10;&#9;&#9;&#9;&#9;example.setValue(newAtt, 1);&#10;&#9;&#9;&#9;} else {&#10;&#9;&#9;&#9;&#9;example.setValue(newAtt, 0);&#10;&#9;&#9;&#9;}&#10;&#9;&#9;}&#10;&#9;}&#10;}&#10;&#10;// return input data&#10;return exampleSet;"/>
         </operator>
         <connect from_op="Retrieve CustomFeatureCreationData" from_port="output" to_op="Execute Script" to_port="input 1"/>
         <connect from_op="Execute Script" from_port="output 1" to_port="result 1"/>
         <connect from_op="Execute Script" from_port="output 2" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    Input data:

    Robert Hanson 8.0 120.0
    Dennis Muller 9.0 150.0
    Joe Stevens 9.0 110.0
    Marc Bold 7.0 135.0
    Bill Holmes 8.0 110.0
    Result:

    "Name" "Age" "Height" "BASKETBALL_POTENTIAL_8_130" "BASKETBALL_POTENTIAL_8_135" "BASKETBALL_POTENTIAL_8_140" "BASKETBALL_POTENTIAL_9_130" "BASKETBALL_POTENTIAL_9_135" "BASKETBALL_POTENTIAL_9_140" "BASKETBALL_POTENTIAL_10_130" "BASKETBALL_POTENTIAL_10_135" "BASKETBALL_POTENTIAL_10_140"
    "Robert Hanson" 8.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    "Dennis Muller" 9.0 150.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
    "Joe Stevens" 9.0 110.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    "Marc Bold" 7.0 135.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
    "Bill Holmes" 8.0 110.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    Regards,
    Marco
  • IlyaIlya Member Posts: 3 Contributor I
    Thank you!
    I'll check this as soon as I get to work.
Sign In or Register to comment.