Test existing model on a different dataset

viktorvanbeerse · June 2017

Hi,

I have two datasets which are very similar (same attributes & label), yet one of them is incomplete. The assignment is to develop a predictive model (Decision Tree and Logistic Regression) with the "incomplete" data and to validate this on the other dataset. So the goal is to develop the model with one dataset (the "incomplete" one) as training set and to use the other dataset (the "complete" one) as test set. Does anybody know if it is possible to model this issue by means of cross-validation/performance?

Thank you in advance

Viktor

Thomas_Ott · June 2017

This sounds backwards. You need to train a classification task with a label. This means that you already have some 'truth' on a historical data set. For example, you have a training data set that has labels for churn and loyal. Then you train on that and you use the "incomplete" data set as your scoring set, which will then autogenerate the prediction.

FBT · June 2017

Hi,

not sure what you mean by incomplete data, but assuming it means that some attributes have missing values, it should be straight forward, as long as your training data has sufficient values for the desired labels. See, if the below sample process is doing what you want.

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
        <parameter key="attribute_name" value="Survived"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.5.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Survived|Age|Passenger Class|Sex|Passenger Fare|No of Siblings or Spouses on Board|No of Parents or Children on Board"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.5.000" expanded="true" height="145" name="Cross Validation" width="90" x="447" y="34">
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_random_forest" compatibility="7.5.000" expanded="true" height="82" name="Random Forest" width="90" x="112" y="34"/>
          <connect from_port="training set" to_op="Random Forest" to_port="training set"/>
          <connect from_op="Random Forest" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.5.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.5.000" expanded="true" height="82" name="Performance - &quot;Incomplete&quot;" width="90" x="179" y="85"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance - &quot;Incomplete&quot;" to_port="labelled data"/>
          <connect from_op="Performance - &quot;Incomplete&quot;" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="313" y="238">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="187">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance" compatibility="7.5.000" expanded="true" height="82" name="Performance - &quot;Complete&quot;" width="90" x="715" y="187"/>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance - &quot;Complete&quot;" to_port="labelled data"/>
      <connect from_op="Performance - &quot;Complete&quot;" from_port="performance" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

You may need to do some pre-processing though, depending on the learning algorithm you chose.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Test existing model on a different dataset

Answers