How can I make training data with multiple labels?

Prentice · April 2019

Hello,

I got a few questions following my post from earlier (https://community.rapidminer.com/discussion/55414/how-can-i-classify-one-example-into-multiple-classes-if-necessary#latest)
I have training data for my model where some examples have multiple labels, how do I put these in RapidMiner?

Say I have this example: I like cats and dogs, the labels are cat and dog.
Do I put them as separate examples? I like cats and dogs -> cat, I like cats and dogs -> dog
Or do I need to make a second label attribute for this? I like cats and dogs -> label1 cat -> label2 dog.

I've also managed to make the following model that has a second prediction when the confidence is lower than 0.7. But I actually want it to make a second prediction more accurately. Is this possible, that the model knows when an example probably has one label or two? Or do I just have to make it around a margin for instance 0.4-0.6?

My final question is how can I make the generate aggregation variable instead of having to select the subset. I know it's possible with regular expression, but I can't figure out the syntax. It just needs to select all the attributes with "confidence" in it.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="179" y="85">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.66"/>
          <parameter key="ratio" value="0.34"/>
        </enumeration>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="naive_bayes" compatibility="9.2.001" expanded="true" height="82" name="Naive Bayes" width="90" x="313" y="34">
        <parameter key="laplace_correction" value="true"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="136">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="generate_aggregation" compatibility="9.2.001" expanded="true" height="82" name="Generate Aggregation (3)" width="90" x="447" y="238">
        <parameter key="attribute_name" value="Maximum"/>
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="confidence(Iris-setosa)|confidence(Iris-versicolor)|confidence(Iris-virginica)"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="aggregation_function" value="maximum"/>
        <parameter key="concatenation_separator" value="|"/>
        <parameter key="keep_all" value="true"/>
        <parameter key="ignore_missings" value="true"/>
        <parameter key="ignore_missing_attributes" value="false"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="136">
        <parameter key="parameter_expression" value=""/>
        <parameter key="condition_class" value="custom_filters"/>
        <parameter key="invert_filter" value="false"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Maximum.lt.0\.7"/>
        </list>
        <parameter key="filters_logic_and" value="true"/>
        <parameter key="filters_check_metadata" value="true"/>
      </operator>
      <operator activated="true" class="generate_prediction_ranking" compatibility="9.2.001" expanded="true" height="82" name="Generate Prediction Ranking" width="90" x="581" y="34">
        <parameter key="number_of_ranks" value="2"/>
        <parameter key="remove_old_predictions" value="true"/>
      </operator>
      <operator activated="true" class="rename_by_replacing" compatibility="9.2.001" expanded="true" height="82" name="Rename by Replacing" width="90" x="715" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="prediction(label)_1"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="replace_what" value="_1"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="34">
        <parameter key="attribute_name" value="prediction(label)"/>
        <parameter key="target_role" value="prediction"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="union" compatibility="9.2.001" expanded="true" height="82" name="Union" width="90" x="849" y="187"/>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Naive Bayes" to_port="training set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Generate Aggregation (3)" to_port="example set input"/>
      <connect from_op="Generate Aggregation (3)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Generate Prediction Ranking" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="unmatched example set" to_op="Union" to_port="example set 2"/>
      <connect from_op="Generate Prediction Ranking" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
      <connect from_op="Rename by Replacing" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Union" to_port="example set 1"/>
      <connect from_op="Union" from_port="union" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks a lot

-Prentice

Telcontar120 · April 2019

Independent means the multiple classifications outcomes can happen separately without one affecting the other. In the example you gave, whether a bike has a flat tire does not affect whether the chain is worn or vice versa; a given bike could have one or the other or neither or both. So they are independent and should be modeled as two separate events. If you put them all in one label, then you have to pick which one to use if a bike has both a flat tire and a worn chain, and that is probably not what you want.
So to the extent this example is analogous to your real data, you probably want to create separate binominal labels for each condition, and then build separate predictive models for each. But in that case, I don't see why you would be interested in combining scores across those models, since the events are not related (and as shown above could happen together or not).

Prentice · April 2019

Aha, so if I understand correctly, in the example I gave I need to separate the categories in the example. So it would be:

Example: The bike has been visually inspected. Found a flat tire Label: Flat tire
Example: The bike has been visually inspected. Found the chain is worn. Label: Worn chain

And then when I get the whole example:
The bike has been visually inspected. Found a flat tire and found the chain is worn

I should just use "Generate Prediction Ranking" to get the second prediction of the example.

I think this would work!

Telcontar120 · April 2019

You have a couple of options. You could have a polynominal label that has more than two possible classes: so for example, a single label that could be either {cat,dog,rabbit,bear,etc}. This would allow you to predict the one most likely value from the full set.
Alternatively, if your labels are all truly independent, then you should keep multiple labels separate and then build individual models for each label. This will require you to do some looping since you cannot have more than one label designated in your dataset at the same time. But if these labels are truly independent, then you probably don't want to try to combine those scores in any way at the end anyways because what would that idea really represent?

Prentice · April 2019

What do you exactly mean with independent?

Let's assume this example, which looks a lot like my example set:

The bike has been visually inspected. Found a flat tire and found the chain is worn.

The label of this is the failure mode, in this case, there are two failure modes: Flat tire and Worn chain.
But as you said I can put this as a polynominal label, but how do I write this?

I'm exporting my data from Excel. So do I have to put this in one cell under one column, if yes, how?

Telcontar120 · April 2019

Yes, that should be fine.

Prentice · April 2019

Ok thanks!!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How can I make training data with multiple labels?

Best Answers

Answers