classification models

bookitsa · January 2019

I have a data set with different variables about a disease and a variable with yes or now about the patient(if he has the disease or not). I have to create two classification models(decision tree and knn) and to make a diagram with a centralized chart of the performance of these individual methods. How I will do it? What is the process I have to follow? I saw the videos but I got confused as a beginner in rapidminer..

lionelderkrikor · January 2019

Hi @bookitsa,

What performance metric do you want to calculate ?
For a binominal problem (like yours), you can use the Compare ROCs operator.
Here a sample process using this operator :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="313" y="136">
        <parameter key="attribute_name" value="Survived"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="compare_rocs" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Compare ROCs" width="90" x="581" y="136">
        <parameter key="number_of_folds" value="10"/>
        <parameter key="split_ratio" value="0.7"/>
        <parameter key="sampling_type" value="stratified sampling"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="use_example_weights" value="true"/>
        <parameter key="roc_bias" value="optimistic"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="k-NN" width="90" x="246" y="85">
            <parameter key="k" value="5"/>
            <parameter key="weighted_vote" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree" width="90" x="246" y="238">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <connect from_port="train 1" to_op="k-NN" to_port="training set"/>
          <connect from_port="train 2" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model 1"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model 2"/>
          <portSpacing port="source_train 1" spacing="0"/>
          <portSpacing port="source_train 2" spacing="0"/>
          <portSpacing port="source_train 3" spacing="0"/>
          <portSpacing port="sink_model 1" spacing="0"/>
          <portSpacing port="sink_model 2" spacing="0"/>
          <portSpacing port="sink_model 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Compare ROCs" to_port="example set"/>
      <connect from_op="Compare ROCs" from_port="exampleSet" to_port="result 2"/>
      <connect from_op="Compare ROCs" from_port="rocComparison" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps,

Regards,

Lionel

bookitsa · January 2019

I can not understand the code you write... i suppose that i have to enter the data and somehow to enter the knn and then the decision tree and to measure the accuracy of the results..but i dont know the process...in other words which are the operators i have to use and in what order.

varunm1 · January 2019

Hi @bookitsa

@lionelderkrikor gave perfect example to compare two models. In case you are looking to get the performance indicators like AUC, Kappa etc using Cross-validation (recommended) you can check below code. Here you need to note the performances and check which worked well.

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="9.1.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="sum classification"/>
        <parameter key="number_examples" value="100"/>
        <parameter key="number_of_attributes" value="5"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="187"/>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="514" y="136">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="5"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="112" y="85">
            <parameter key="k" value="5"/>
            <parameter key="weighted_vote" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="training set" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="136">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="true"/>
            <parameter key="AUC (optimistic)" value="false"/>
            <parameter key="AUC" value="true"/>
            <parameter key="AUC (pessimistic)" value="false"/>
            <parameter key="precision" value="false"/>
            <parameter key="recall" value="false"/>
            <parameter key="lift" value="false"/>
            <parameter key="fallout" value="false"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="false_positive" value="false"/>
            <parameter key="false_negative" value="false"/>
            <parameter key="true_positive" value="false"/>
            <parameter key="true_negative" value="false"/>
            <parameter key="sensitivity" value="false"/>
            <parameter key="specificity" value="false"/>
            <parameter key="youden" value="false"/>
            <parameter key="positive_predictive_value" value="false"/>
            <parameter key="negative_predictive_value" value="false"/>
            <parameter key="psep" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" width="90" x="313" y="34">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="5"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="34">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="246" y="136">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="true"/>
            <parameter key="AUC (optimistic)" value="false"/>
            <parameter key="AUC" value="true"/>
            <parameter key="AUC (pessimistic)" value="false"/>
            <parameter key="precision" value="false"/>
            <parameter key="recall" value="false"/>
            <parameter key="lift" value="false"/>
            <parameter key="fallout" value="false"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="false_positive" value="false"/>
            <parameter key="false_negative" value="false"/>
            <parameter key="true_positive" value="false"/>
            <parameter key="true_negative" value="false"/>
            <parameter key="sensitivity" value="false"/>
            <parameter key="specificity" value="false"/>
            <parameter key="youden" value="false"/>
            <parameter key="positive_predictive_value" value="false"/>
            <parameter key="negative_predictive_value" value="false"/>
            <parameter key="psep" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Cross Validation (2)" to_port="example set"/>
      <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 2"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Thanks,
Varun

lionelderkrikor · January 2019

Hi again @bookitsa,

The "code" you see is the XML of the process. You have to import it in RapidMiner :

The step by step is :

In order to import such an XML description of your process, e.g. to use a process someone else has posted here in the forum, please follow the following steps:

Create a new process and go the the XML panel (see above).
Clear the view and copy the XML code you got into that panel.
Then press the green checkmark icon on top of the panel.
Switch back to the Process panel.

Regards,

Lionel

bookitsa · January 2019

Thank you @lionelderkrikorabout how to insert xml code! I didn't know this!

@varunm1you gave me an operator "generate data". In your code about what data it works? And how i ll made it to work for my data? Where will I insert them?

They dont tell us which method for the diagram to use because we are beginners and obviously they wanted to find one by searching...which diagram to prefer as a beginner?

varunm1 · January 2019

Hi @bookitsa

As we don't have your dataset, I just randomly generated data, In your case you need to delete generate data and import your data into RapidMiner by specifying label column in column attribure and attach your data set to the multiply operator. Multiply operatir just creates a copy of dataset to use for two algorithms which are inside cross validation operator. You can double click cross-validation and see which model is placed inside. You can see below tutorial from RM to see how to import data.

https://www.youtube.com/watch?v=eLR0IiBT76w

bookitsa · January 2019

For accuracy the knn' rate is 63,95 and for decision tree is 94,28. The k for knn is set to 5. The total data are 700records. Which is the appropriate number for k? Because i tried and values bigger from 5 but the rate very little changed. In general decision tree is faster from knn?

varunm1 · January 2019

@bookitsa Based on your dataset decision tree does better job. Appropriate number for k is based on checking different values and see if the accuracy is going up or down. I cannot generalize the results cause it always depends on how the data is? But decision tree has pruning which cuts the unnecessary attributes in the tree for better predictions

bookitsa · January 2019

In this photo we see that decision tree is faster, right? How can i describe it?

varunm1 · January 2019

@bookitsa I am not sure fast is the correct term. But decision tree is accurate. The curves your mentioned are ROC curves. These are plotted based on true and false positive rates. A simple explanation could be given from the confusion matrix which you can see in results where there is 'Accuracy' for each algorithm which looks like in below figure. You can see that there is a true negative and predicted negative '51' this means that 51 outputs in your data which are originally negative are predicted as negative but 2 are predicted as positive (wrong predictions). Similarly for positive labels 5 are predicted as negative (wrong prediction) and 42 are predicted as positive. The curves you mentioned depends on these values of both algorithms. You can search for ROC curves so that you can understand better. Also you can show Decision tree model by connecting 'mod' output to result and explain based on the model as its simple if else.

Image: https://us.v-cdn.net/6030995/uploads/editor/4g/o8q7s0n48rsf.jpg

bookitsa · January 2019

Thank you very much @varunm1 for your help!!
I am looking the two xml codes and i want to see which describes best the difference between the two models:decision tree and knn. I will see the curves googling!

bookitsa · January 2019

In the diagram ROC curves how can we put labels(in other words:names) on the axis?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

classification models

Answers

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing