ensemble learning

Thiru · February 2020

I tried a data set with ensemble learning (using KNN, decision tree, Naïve bayes). Im not able to see any improvement in the accuracy/precision/recall.

1. Is there any way to view the performance of individual models while using ensemble learning. - to compare the performance of the individual as well as ensemble in a single process.

2. when I use generate weight option - there is a warning for Knn - sub model , saying " input example set has example weights, but learner will ignore them". Even if we say -the learner ignores, the accuracy still reduces. First of all what is that warning coming only for Knn and its impact?

thanks
thiru

BalazsBarany · February 2020

Hi @Thiru,

you can do a Cross-Validation inside the ensemble model. You can use the Remember operator to store the individual performance results, or even store them into the repository.

Here's an example process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.5.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="-1"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.5.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="generate_weight_stratification" compatibility="9.5.001" expanded="true" height="82" name="Generate Weight (Stratification)" width="90" x="246" y="34">
<parameter key="total_weight" value="1.0"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="9.5.001" expanded="true" height="145" name="Validation" width="90" x="380" y="34">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="vote" compatibility="9.5.001" expanded="true" height="68" name="Vote" width="90" x="112" y="34">
<process expanded="true">
<operator activated="true" class="concurrency:cross_validation" compatibility="9.5.001" expanded="true" height="145" name="Validation DT" width="90" x="246" y="34">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.5.001" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="34">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.1"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="false" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance" compatibility="9.5.001" expanded="true" height="82" name="Performance (DT)" width="90" x="179" y="34">
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance (DT)" to_port="labelled data"/>
<connect from_op="Performance (DT)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (DT)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="false" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.5.001" expanded="true" height="68" name="Remember DTResults" width="90" x="447" y="85">
<parameter key="name" value="DTResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="store_which" value="1"/>
<parameter key="remove_from_process" value="true"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="9.5.001" expanded="true" height="145" name="Validation k-NN" width="90" x="112" y="187">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="9.5.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
<parameter key="k" value="5"/>
<parameter key="weighted_vote" value="true"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_port="training set" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="false" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance" compatibility="9.5.001" expanded="true" height="82" name="Performance (k-NN)" width="90" x="179" y="34">
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (k-NN)" to_port="labelled data"/>
<connect from_op="Performance (k-NN)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (k-NN)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="false" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.5.001" expanded="true" height="68" name="Remember k-NN results" width="90" x="447" y="238">
<parameter key="name" value="kNNResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="store_which" value="1"/>
<parameter key="remove_from_process" value="true"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="9.5.001" expanded="true" height="145" name="Validation Naive Bayes" width="90" x="246" y="340">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="naive_bayes" compatibility="9.5.001" expanded="true" height="82" name="Naive Bayes" width="90" x="112" y="34">
<parameter key="laplace_correction" value="true"/>
</operator>
<connect from_port="training set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="false" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance" compatibility="9.5.001" expanded="true" height="82" name="Performance (Naive Bayes)" width="90" x="179" y="34">
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
<connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (Naive Bayes)" to_port="labelled data"/>
<connect from_op="Performance (Naive Bayes)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (Naive Bayes)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="false" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.5.001" expanded="true" height="68" name="Remember Bayes results" width="90" x="447" y="391">
<parameter key="name" value="NaiveBayesResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="store_which" value="1"/>
<parameter key="remove_from_process" value="true"/>
</operator>
<connect from_port="training set 1" to_op="Validation DT" to_port="example set"/>
<connect from_port="training set 2" to_op="Validation k-NN" to_port="example set"/>
<connect from_port="training set 3" to_op="Validation Naive Bayes" to_port="example set"/>
<connect from_op="Validation DT" from_port="model" to_port="base model 1"/>
<connect from_op="Validation DT" from_port="performance 1" to_op="Remember DTResults" to_port="store"/>
<connect from_op="Validation k-NN" from_port="model" to_port="base model 2"/>
<connect from_op="Validation k-NN" from_port="performance 1" to_op="Remember k-NN results" to_port="store"/>
<connect from_op="Validation Naive Bayes" from_port="performance 1" to_op="Remember Bayes results" to_port="store"/>
<portSpacing port="source_training set 1" spacing="0"/>
<portSpacing port="source_training set 2" spacing="0"/>
<portSpacing port="source_training set 3" spacing="0"/>
<portSpacing port="source_training set 4" spacing="0"/>
<portSpacing port="sink_base model 1" spacing="0"/>
<portSpacing port="sink_base model 2" spacing="0"/>
<portSpacing port="sink_base model 3" spacing="0"/>
</process>
</operator>
<connect from_port="training set" to_op="Vote" to_port="training set"/>
<connect from_op="Vote" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="158">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model (4)" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance" compatibility="9.5.001" expanded="true" height="82" name="Performance (Vote)" width="90" x="179" y="34">
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<connect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (Vote)" to_port="labelled data"/>
<connect from_op="Performance (Vote)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (Vote)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="158">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.</description>
</process>
</operator>
<operator activated="true" class="recall" compatibility="9.5.001" expanded="true" height="68" name="Recall Decision Tree" width="90" x="581" y="136">
<parameter key="name" value="DTResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="remove_from_store" value="true"/>
</operator>
<operator activated="true" class="recall" compatibility="9.5.001" expanded="true" height="68" name="Recall k-NN" width="90" x="648" y="238">
<parameter key="name" value="kNNResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="remove_from_store" value="true"/>
</operator>
<operator activated="true" class="recall" compatibility="9.5.001" expanded="true" height="68" name="Recall Naive Bayes" width="90" x="715" y="340">
<parameter key="name" value="NaiveBayesResults"/>
<parameter key="io_object" value="PerformanceVector"/>
<parameter key="remove_from_store" value="true"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Generate Weight (Stratification)" to_port="example set input"/>
<connect from_op="Generate Weight (Stratification)" from_port="example set output" to_op="Validation" to_port="example set"/>
<connect from_op="Validation" from_port="model" to_port="result 1"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 2"/>
<connect from_op="Recall Decision Tree" from_port="result" to_port="result 3"/>
<connect from_op="Recall k-NN" from_port="result" to_port="result 4"/>
<connect from_op="Recall Naive Bayes" from_port="result" to_port="result 5"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="42"/>
<portSpacing port="sink_result 4" spacing="105"/>
<portSpacing port="sink_result 5" spacing="63"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

The warning is somewhat unexpected, as k-NN is being used for explaining example weighting on Academy:
https://academy.rapidminer.com/learn/video/sampling-weighting-intro
@jmergler @Knut-RM

The impact is: If you use example weighting to make some examples more important (e. g. minority class, or customers with a high revenue), you expect the models to make more effort in predicting these examples correctly. If the model is good and catches these examples correctly anyway, you won't see a big impact.

@mbs: No, Group Models won't help here. The output of the first model is a model. If you put a second model into it, it will complain because it expects an Example Set as its input.

Regards,
Balázs

[Deleted User] · February 2020

@Thiru

Hello

you can use Group model operator. This operator groups the given models into a single combined model. When this combined model is applied, it is equivalent to applying the original models in their respective order.

and you can use this link:

https://community.rapidminer.com/discussion/comment/61377#Comment_61377

All the best
mbs

[Deleted User] · February 2020

@BalazsBarany

Hello

Thank you for your help

that was the problem of document
https://docs.rapidminer.com/latest/studio/operators/rapidminer-studio-operator-reference.pdf

@sgenzer
please look in Group model part in pdf
Regards
mbs

jmergler · February 2020

Great discussion! Thanks @BalazsBarany for calling my attention to it. Yes, k-NN has it's own variable weighting scheme and example weighting will be ignored with that operator; if you are using an ensemble method like voting then the other learners will still use the example weighting. However, in the current training, the lecture portion shows k-NN with example weighting. Then the demo process we have uses a decision tree which does handle weighted examples. We'll work to improve the lecture!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

ensemble learning

Best Answer

Answers