Special role attribute in applying model
Hello,
I notice that the apply operator requires that attributes in the unlabelled dataset to correspond exactly to attributes in the training dataset used for defining the model, even if the attributes have a customized special role. I would have expected that only regular attributes have to match (plus possibly weight attributes) as in my understanding modelling works only with regular attributes, the label attribute and possibly some weight attributes. I would like to know whether this is correct, or if attributes with a customized role impact modelling... If they don't, why does the apply oprator request them? Thanks in advance for your hints!
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
Hi @Gottfried,
Yes, roles are part of this function signature. No, there is no way for Apply Model to pass through. However, I can give you a tip here.
- Don't select the values out.
- If you have an ID role, it will pass. If you don't, you can use Generate ID to generate one.
- Once you Apply Model, you can join back using the id attribute as key.
Please, see attached.
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.0.002" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.0.002" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Age|Sex|Survived|Passenger Class"/>
</operator>
<operator activated="true" class="h2o:deep_learning" compatibility="9.0.000" expanded="true" height="82" name="Deep Learning" width="90" x="313" y="34">
<parameter key="activation" value="Maxout"/>
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="true" class="retrieve" compatibility="9.0.002" expanded="true" height="68" name="Retrieve Titanic Unlabeled" width="90" x="45" y="187">
<parameter key="repository_entry" value="//Samples/data/Titanic Unlabeled"/>
</operator>
<operator activated="true" class="generate_id" compatibility="9.0.002" expanded="true" height="82" name="Generate ID" width="90" x="179" y="187">
<parameter key="create_nominal_ids" value="true"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.0.002" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
<operator activated="true" class="apply_model" compatibility="9.0.002" expanded="true" height="82" name="Apply Model" width="90" x="447" y="85">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.0.002" expanded="true" height="82" name="Join" width="90" x="581" y="187">
<parameter key="use_id_attribute_as_key" value="true"/>
<list key="key_attributes"/>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Deep Learning" to_port="training set"/>
<connect from_op="Deep Learning" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve Titanic Unlabeled" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="147"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Here is a visual representation of what I usually do when I need to score by a few results but want them all. Notice that I'm selecting attributes only for training the deep learning operator, generating the ID's because the Titanic Dataset doesn't have any (but you should double check if your dataset has an ID or not), and then joining the scored results with the rest of the table through these ID's.
Hope this clarifies it!
2
Answers
Hello @Gottfried,
The Apply Model operator requires that your unlabelled data has the same function signature as the labelled one, with the exception of the label. If you trained your data with an id, three regular attributes and a weight attribute, your algorithm would consider ignoring the id and the attributes that aren't part of the model, and matching the regular attributes and consuming the weight attributes if the training algorithm uses these as inputs.
Since the Apply Model has no logic to know what are the requirements for the models unless these are passed as parameters (and I have found no evidence that these are), it is much easier to ask for the same function signature (as in same names, types and roles), no matter what algorithm you are trying. Now, extra attributes are just ignored and attached to the resulting data.
A function signature in this context is what a function asks as input and what will it return as output. Let's take an example from the C programming language:
int sum(int a, int b) {
return a + b;
};
The function signature for this function will always be a and b as an integer. The program, however, won't guarantee that it will work if you pass a floating point number or a string, therefore it will fail. Basically, Apply Model works the same way: when you train a model you generate a function that takes certain data structure, and when you apply that model, you are applying a function to that same data structure (with different values).
Hope this helps,
Thanks @rfuentealba,
That was actually the trick I meant: selecting (attributes, not values) before and joining back (yes, using Id) afterwards. So be it, then. THanks for your help!
Awesome @Gottfried, glad it helped!
Have fun!