The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Learning Curve

dK00dK00 Member Posts: 5 Learner I
Hello Rapidminer commuity,

I would like to compare the learning curve of three models, but I don't know how this should be applied in rapid miner. Can anyone help how can I plot the learning curve for each model?

Much appreciated!

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    I would use Optimize Parameters (Grid) for this.
    You connect the incoming data to Optimize Parameters. Inside the Optimize Parameters process you put a Sample operator and configure Optimize Parameters to try different settings of Sample. For example, you could sample 0.05, 0.1, 0.15 and so on from the original data set. Then you put the three cross validations with the different models behind the Sample and a Multiply operator. And you use Log to extract the performance from those and the sampling parameter. You will get a Log output in the Results view and you can visualize it, or use Log to Data after Optimize Parameters to turn it into a regular data table which you can export.

    Regards,
    Balázs

  • dK00dK00 Member Posts: 5 Learner I
    Hello @BalazsBarany,

    Thank you for your response.

    Would the suggested way generate a curve of the training and testing as illustrated in the attached picture?
     
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @dK00

    the performance of a cross validation returns the performance on the test set that wasn't used for building the model. This is the correct way to calculate the performance. 
    If you want to calculate the training performance, you can apply the model on its own input and get the performance from that result. But in data science we consider that cheating. Models should be tested on a test set, not the training set.

    I would actually expect the validation curve to also get better with more data. Where is this illustration coming from? It's strange. 

    You can generate these curves with varying training samples, but I doubt you will get similar curves.

    Another important aspect for the model performance, especially on the training set, is the model complexity. That is on the X axis in most similar illustrations and it describes the phenomenon of the training performance growing while the test performance getting worse when the point of overfitting has been reached.

    Regards,
    Balázs
     


  • earmijoearmijo Member Posts: 271 Unicorn
    edited June 2023
    dk00: 
    This is what I would do. I had to do it in two steps. Probably someone here more knowledgeable than me can do it in one step. In Process 1 (not shown below I splitted the famous diamonds dataset (ggplot): diamonds1 (80%) and diamonds2 (20%).  These are the datasets used in the process below. 

    Balázs: the learning curve is a tool to diagnose overfitting (Andrew Ng made it famous). It requires the computation of both the training error and the test error. When the TestError >> TrainingError this is taken a sign of overfitting. You could do two things to fix it then: simplify your model or get more data. There used to be an operator in RM to graph learning curves. 

    Hope this helps. 

    \Ernesto

    P.S. The graph I get for the learning curve is:





     <?xml version="1.0" encoding="UTF-8"?><process version="10.1.001">
      <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="10.1.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds1" width="90" x="112" y="340">
    <parameter key="repository_entry" value="diamonds1"/>
    </operator>
    <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="10.1.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="447" y="340">
    <list key="parameters">
    <parameter key="Filter Example Range.last_example" value="[5000;40000;7;linear]"/>
    </list>
    <parameter key="error_handling" value="fail on error"/>
    <parameter key="log_performance" value="true"/>
    <parameter key="log_all_criteria" value="false"/>
    <parameter key="synchronize" value="false"/>
    <parameter key="enable_parallel_execution" value="true"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="10.1.001" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="85">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="200"/>
    <parameter key="invert_filter" value="false"/>
    </operator>
    <operator activated="true" class="h2o:generalized_linear_model" compatibility="10.0.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="380" y="85">
    <parameter key="family" value="AUTO"/>
    <parameter key="link" value="family_default"/>
    <parameter key="solver" value="AUTO"/>
    <parameter key="reproducible" value="false"/>
    <parameter key="maximum_number_of_threads" value="4"/>
    <parameter key="use_regularization" value="false"/>
    <parameter key="lambda_search" value="false"/>
    <parameter key="number_of_lambdas" value="0"/>
    <parameter key="lambda_min_ratio" value="0.0"/>
    <parameter key="early_stopping" value="true"/>
    <parameter key="stopping_rounds" value="3"/>
    <parameter key="stopping_tolerance" value="0.001"/>
    <parameter key="standardize" value="true"/>
    <parameter key="non-negative_coefficients" value="false"/>
    <parameter key="add_intercept" value="true"/>
    <parameter key="compute_p-values" value="false"/>
    <parameter key="remove_collinear_columns" value="false"/>
    <parameter key="missing_values_handling" value="MeanImputation"/>
    <parameter key="max_iterations" value="0"/>
    <parameter key="specify_beta_constraints" value="false"/>
    <list key="beta_constraints"/>
    <parameter key="max_runtime_seconds" value="0"/>
    <list key="expert_parameters"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="10.1.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_regression" compatibility="10.1.001" expanded="true" height="82" name="Performance" width="90" x="782" y="34">
    <parameter key="main_criterion" value="first"/>
    <parameter key="root_mean_squared_error" value="true"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="prediction_average" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds2" width="90" x="313" y="391">
    <parameter key="repository_entry" value="diamonds2"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="10.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="238">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_regression" compatibility="10.1.001" expanded="true" height="82" name="Performance (2)" width="90" x="782" y="187">
    <parameter key="main_criterion" value="first"/>
    <parameter key="root_mean_squared_error" value="true"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="prediction_average" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    </operator>
    <operator activated="true" class="log" compatibility="10.1.001" expanded="true" height="103" name="Log" width="90" x="1050" y="85">
    <list key="log">
    <parameter key="Training RMSE" value="operator.Performance.value.root_mean_squared_error"/>
    <parameter key="Test RMSE" value="operator.Performance (2).value.root_mean_squared_error"/>
    <parameter key="Iteration Number" value="operator.Optimize Parameters (Grid).value.iteration_number"/>
    </list>
    <parameter key="sorting_type" value="none"/>
    <parameter key="sorting_k" value="100"/>
    <parameter key="persistent" value="false"/>
    </operator>
    <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Generalized Linear Model" to_port="training set"/>
    <connect from_op="Generalized Linear Model" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Retrieve diamonds2" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 2"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve diamonds1" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • earmijoearmijo Member Posts: 271 Unicorn
    Ok. I got it in one step using the Remember/Recall operators. 

    <?xml version="1.0" encoding="UTF-8"?><process version="10.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="10.1.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="10.1.001" expanded="true" height="68" name="Retrieve diamonds" width="90" x="112" y="391">
    <parameter key="repository_entry" value="diamonds"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="10.1.001" expanded="true" height="103" name="Split Data" width="90" x="380" y="391">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.8"/>
    <parameter key="ratio" value="0.2"/>
    </enumeration>
    <parameter key="sampling_type" value="automatic"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    </operator>
    <operator activated="true" class="remember" compatibility="10.1.001" expanded="true" height="68" name="Remember" width="90" x="581" y="595">
    <parameter key="name" value="TestSet"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="store_which" value="1"/>
    <parameter key="remove_from_process" value="true"/>
    </operator>
    <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="10.1.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="715" y="85">
    <list key="parameters">
    <parameter key="Filter Example Range.last_example" value="[5000;40000;7;linear]"/>
    </list>
    <parameter key="error_handling" value="fail on error"/>
    <parameter key="log_performance" value="true"/>
    <parameter key="log_all_criteria" value="true"/>
    <parameter key="synchronize" value="false"/>
    <parameter key="enable_parallel_execution" value="true"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="10.1.001" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="85">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="5000"/>
    <parameter key="invert_filter" value="false"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="10.1.001" expanded="true" height="68" name="Extract Macro" width="90" x="246" y="85">
    <parameter key="macro" value="SampleSize"/>
    <parameter key="macro_type" value="number_of_examples"/>
    <parameter key="statistics" value="average"/>
    <parameter key="attribute_name" value=""/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="h2o:generalized_linear_model" compatibility="10.0.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="380" y="85">
    <parameter key="family" value="gaussian"/>
    <parameter key="link" value="identity"/>
    <parameter key="solver" value="AUTO"/>
    <parameter key="reproducible" value="false"/>
    <parameter key="maximum_number_of_threads" value="4"/>
    <parameter key="use_regularization" value="false"/>
    <parameter key="lambda_search" value="false"/>
    <parameter key="number_of_lambdas" value="0"/>
    <parameter key="lambda_min_ratio" value="0.0"/>
    <parameter key="early_stopping" value="true"/>
    <parameter key="stopping_rounds" value="3"/>
    <parameter key="stopping_tolerance" value="0.001"/>
    <parameter key="standardize" value="true"/>
    <parameter key="non-negative_coefficients" value="false"/>
    <parameter key="add_intercept" value="true"/>
    <parameter key="compute_p-values" value="false"/>
    <parameter key="remove_collinear_columns" value="false"/>
    <parameter key="missing_values_handling" value="MeanImputation"/>
    <parameter key="max_iterations" value="0"/>
    <parameter key="specify_beta_constraints" value="false"/>
    <list key="beta_constraints"/>
    <parameter key="max_runtime_seconds" value="0"/>
    <list key="expert_parameters"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="10.1.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_regression" compatibility="10.1.001" expanded="true" height="82" name="Performance" width="90" x="782" y="34">
    <parameter key="main_criterion" value="first"/>
    <parameter key="root_mean_squared_error" value="true"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="prediction_average" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    </operator>
    <operator activated="true" class="recall" compatibility="10.1.001" expanded="true" height="68" name="Recall" width="90" x="246" y="340">
    <parameter key="name" value="TestSet"/>
    <parameter key="io_object" value="ExampleSet"/>
    <parameter key="remove_from_store" value="false"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="10.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="238">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_regression" compatibility="10.1.001" expanded="true" height="82" name="Performance (2)" width="90" x="782" y="187">
    <parameter key="main_criterion" value="first"/>
    <parameter key="root_mean_squared_error" value="true"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="prediction_average" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    </operator>
    <operator activated="true" class="log" compatibility="10.1.001" expanded="true" height="103" name="Log" width="90" x="1050" y="85">
    <list key="log">
    <parameter key="Training RMSE" value="operator.Performance.value.root_mean_squared_error"/>
    <parameter key="Test RMSE" value="operator.Performance (2).value.root_mean_squared_error"/>
    <parameter key="Iteration" value="operator.Optimize Parameters (Grid).value.iteration_number"/>
    <parameter key="Sample Size" value="operator.Extract Macro.value.macro_value"/>
    </list>
    <parameter key="sorting_type" value="none"/>
    <parameter key="sorting_k" value="100"/>
    <parameter key="persistent" value="false"/>
    </operator>
    <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generalized Linear Model" to_port="training set"/>
    <connect from_op="Generalized Linear Model" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Recall" from_port="result" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 2"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve diamonds" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Remember" to_port="store"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
Sign In or Register to comment.