performance by ID3
Hello
I have a trouble in performance of my dataset. They are real data and I collected them.
When I make a model by id3 tree, the accuracy is almost 95% but when I use of cross validation, the result will be between 50 and 60 percent in 10 folds.
Is there any problem in my data or my process? Which accuracy can I use in my project?
The subject of my project is discover relation between student performance and personality characteristics using by decision tree and regression.
Also how can i find best attributes for making relation by id3 and what operators should i use? Can i use of select attribute?
Here is my model and validation.
Model:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\yasamin\Documents\dataset.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1:Q301"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="field.true.integer.attribute"/>
<parameter key="1" value="course.true.integer.attribute"/>
<parameter key="2" value="year.true.integer.attribute"/>
<parameter key="3" value="number of children in family.true.integer.attribute"/>
<parameter key="4" value="which member of family?.true.integer.attribute"/>
<parameter key="5" value="Do grandfather and grandmother live with you?.true.integer.attribute"/>
<parameter key="6" value="Did your parents get divorced?.true.integer.attribute"/>
<parameter key="7" value="Did your father get married again?.true.integer.attribute"/>
<parameter key="8" value="Did your mother get married again?.true.integer.attribute"/>
<parameter key="9" value="Information identity.true.integer.attribute"/>
<parameter key="10" value="Normative identity.true.integer.attribute"/>
<parameter key="11" value="Confused or avoidance identity.true.integer.attribute"/>
<parameter key="12" value="Commitment identity.true.integer.attribute"/>
<parameter key="13" value="Positive affection.true.integer.attribute"/>
<parameter key="14" value="Negative affection.true.integer.attribute"/>
<parameter key="15" value="Average of first semester in year 2016.true.polynominal.label"/>
<parameter key="16" value="Average of previous year.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="false" class="concurrency:parallel_decision_tree" compatibility="8.2.001" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="238">
<parameter key="criterion" value="gini_index"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="discretize_by_frequency" compatibility="8.2.001" expanded="true" height="103" name="Discretize" width="90" x="179" y="34">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="use_sqrt_of_examples" value="false"/>
<parameter key="number_of_bins" value="2"/>
<parameter key="range_name_type" value="long"/>
<parameter key="automatic_number_of_digits" value="true"/>
<parameter key="number_of_digits" value="-1"/>
</operator>
<operator activated="false" class="weka:W-J48" compatibility="7.3.000" expanded="true" height="82" name="W-J48" width="90" x="112" y="187">
<parameter key="U" value="false"/>
<parameter key="C" value="0.25"/>
<parameter key="M" value="2.0"/>
<parameter key="R" value="false"/>
<parameter key="B" value="false"/>
<parameter key="S" value="false"/>
<parameter key="L" value="false"/>
<parameter key="A" value="false"/>
</operator>
<operator activated="false" class="chaid" compatibility="8.2.001" expanded="true" height="82" name="CHAID" width="90" x="581" y="187">
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="maximal_depth" value="20"/>
<parameter key="confidence" value="0.25"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
<parameter key="no_pre_pruning" value="false"/>
<parameter key="no_pruning" value="false"/>
</operator>
<operator activated="false" class="discretize_by_size" compatibility="8.2.001" expanded="true" height="103" name="Discretize (3)" width="90" x="246" y="187">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="size_of_bins" value="3"/>
<parameter key="sorting_direction" value="decreasing"/>
<parameter key="range_name_type" value="long"/>
<parameter key="automatic_number_of_digits" value="true"/>
<parameter key="number_of_digits" value="-1"/>
</operator>
<operator activated="false" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="معدل ترم اول سال 96-95|رشته|سال|نتیجه عاطفه مثبت|نتیجه عاطفه منفی|هویت اطلاعاتی|هویت تعهد|هویت سردرگم یا اجتنابی|هویت هنجاری"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="false" class="discretize_by_bins" compatibility="8.2.001" expanded="true" height="103" name="Discretize (4)" width="90" x="179" y="136">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="number_of_bins" value="4"/>
<parameter key="define_boundaries" value="false"/>
<parameter key="range_name_type" value="long"/>
<parameter key="automatic_number_of_digits" value="true"/>
<parameter key="number_of_digits" value="3"/>
</operator>
<operator activated="true" class="id3" compatibility="8.2.001" expanded="true" height="82" name="ID3" width="90" x="313" y="34">
<parameter key="criterion" value="information_gain"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_gain" value="0.1"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance" width="90" x="581" y="34">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="input 1" to_op="Read Excel" to_port="file"/>
<connect from_op="Read Excel" from_port="output" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="ID3" to_port="training set"/>
<connect from_op="ID3" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="ID3" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<connect from_op="Performance" from_port="example set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Validation:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\yasamin\Documents\dataset.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1:Q301"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="field.true.nominal.attribute"/>
<parameter key="1" value="course.true.nominal.attribute"/>
<parameter key="2" value="year.true.nominal.attribute"/>
<parameter key="3" value="number of children in family.true.nominal.attribute"/>
<parameter key="4" value="which member of family?.true.nominal.attribute"/>
<parameter key="5" value="Do grandfather and grandmother live with you?.true.nominal.attribute"/>
<parameter key="6" value="Did your parents get divorced?.true.nominal.attribute"/>
<parameter key="7" value="Did your father get married again?.true.nominal.attribute"/>
<parameter key="8" value="Did your mother get married again?.true.nominal.attribute"/>
<parameter key="9" value="Information identity.true.nominal.attribute"/>
<parameter key="10" value="Normative identity.true.nominal.attribute"/>
<parameter key="11" value="Confused or avoidance identity.true.nominal.attribute"/>
<parameter key="12" value="Commitment identity.true.nominal.attribute"/>
<parameter key="13" value="Positive affection.true.nominal.attribute"/>
<parameter key="14" value="Negative affection.true.nominal.attribute"/>
<parameter key="15" value="Average of first semester in year 2016.true.polynominal.label"/>
<parameter key="16" value="Average of previous year.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.2.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
<parameter key="split_on_batch_attribute" value="false"/>
<parameter key="leave_one_out" value="false"/>
<parameter key="number_of_folds" value="10"/>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="false" class="concurrency:parallel_decision_tree" compatibility="8.2.001" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="187">
<parameter key="criterion" value="gini_index"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="false" class="weka:W-J48" compatibility="7.3.000" expanded="true" height="82" name="W-J48" width="90" x="179" y="187">
<parameter key="U" value="false"/>
<parameter key="C" value="0.25"/>
<parameter key="M" value="2.0"/>
<parameter key="R" value="false"/>
<parameter key="B" value="false"/>
<parameter key="S" value="false"/>
<parameter key="L" value="false"/>
<parameter key="A" value="false"/>
</operator>
<operator activated="false" class="chaid" compatibility="8.2.001" expanded="true" height="82" name="CHAID" width="90" x="179" y="136">
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="confidence" value="0.25"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
<parameter key="no_pre_pruning" value="false"/>
<parameter key="no_pruning" value="false"/>
</operator>
<operator activated="false" class="discretize_by_bins" compatibility="8.2.001" expanded="true" height="103" name="Discretize" width="90" x="45" y="85">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="number_of_bins" value="5"/>
<parameter key="define_boundaries" value="false"/>
<parameter key="range_name_type" value="long"/>
<parameter key="automatic_number_of_digits" value="true"/>
<parameter key="number_of_digits" value="3"/>
</operator>
<operator activated="true" class="id3" compatibility="8.2.001" expanded="true" height="82" name="ID3" width="90" x="179" y="34">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_gain" value="0.1"/>
</operator>
<operator activated="false" class="weka:W-Id3" compatibility="7.3.000" expanded="true" height="82" name="W-Id3" width="90" x="45" y="187">
<parameter key="D" value="false"/>
</operator>
<connect from_port="training set" to_op="ID3" to_port="training set"/>
<connect from_op="ID3" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="test result set" to_port="result 4"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
I will be grateful that somebody help me.
Best regards
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @Yasmin,
In reality, you have to validate the feature selection itself, so the Optimize Selection (evolutionary) has to be implemented
inside the "training part" of a Cross Validation operator.
It's going to take even longer, but it's the best way to get reliable results. To understand why you have to proceed like that,
here a ressource written by Dr Ingo Mierswa.
Here the new process (you have to replace the data and the model(s) inside the Cross Validation operators :
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="112" y="85">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.2.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="85">
<parameter key="number_of_folds" value="5"/>
<process expanded="true">
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
<operator activated="true" class="remember" compatibility="8.2.001" expanded="true" height="68" name="Remember" width="90" x="179" y="34">
<parameter key="name" value="trainingSet"/>
</operator>
<operator activated="true" class="optimize_selection_evolutionary" compatibility="8.2.001" expanded="true" height="103" name="Optimize Selection (Evolutionary)" width="90" x="179" y="187">
<process expanded="true">
<operator activated="true" class="concurrency:cross_validation" compatibility="8.2.001" expanded="true" height="145" name="Cross Validation (2)" width="90" x="246" y="34">
<parameter key="number_of_folds" value="5"/>
<process expanded="true">
<operator activated="true" class="naive_bayes" compatibility="8.2.001" expanded="true" height="82" name="Naive Bayes (2)" width="90" x="246" y="34"/>
<connect from_port="training set" to_op="Naive Bayes (2)" to_port="training set"/>
<connect from_op="Naive Bayes (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (2)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Cross Validation (2)" to_port="example set"/>
<connect from_op="Cross Validation (2)" from_port="performance 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select_by_weights" compatibility="8.2.001" expanded="true" height="103" name="Select by Weights" width="90" x="313" y="34"/>
<operator activated="true" class="naive_bayes" compatibility="8.2.001" expanded="true" height="82" name="Naive Bayes" width="90" x="447" y="34"/>
<connect from_port="training set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Remember" to_port="store"/>
<connect from_op="Multiply" from_port="output 2" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
<connect from_op="Remember" from_port="stored" to_op="Select by Weights" to_port="example set input"/>
<connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_op="Select by Weights" to_port="weights"/>
<connect from_op="Select by Weights" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Select by Weights" from_port="weights" to_port="through 1"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="select_by_weights" compatibility="8.2.001" expanded="true" height="103" name="Select by Weights (2)" width="90" x="45" y="85"/>
<operator activated="true" class="recall" compatibility="8.2.001" expanded="true" height="68" name="Recall (4)" width="90" x="45" y="238">
<parameter key="name" value="trainingSet"/>
</operator>
<operator activated="true" class="superset" compatibility="8.2.001" expanded="true" height="82" name="Superset" width="90" x="246" y="187"/>
<operator activated="true" class="apply_model" compatibility="8.2.001" expanded="true" height="82" name="Apply Model" width="90" x="246" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="8.2.001" expanded="true" height="82" name="Performance" width="90" x="380" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Select by Weights (2)" to_port="example set input"/>
<connect from_port="through 1" to_op="Select by Weights (2)" to_port="weights"/>
<connect from_op="Select by Weights (2)" from_port="example set output" to_op="Superset" to_port="example set 1"/>
<connect from_op="Recall (4)" from_port="result" to_op="Superset" to_port="example set 2"/>
<connect from_op="Superset" from_port="superset 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="source_through 2" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Sonar" from_port="output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>I hope it helps,
Regards,
Lionel
2
Answers
Hi @Yasmin,
Here some elements of anwers :
1/ Model validation :
You highlighted the notion of training error and test error :
- In the first case (your process "Model"), you evaluate the training error of your model. Training error is the error that you get when you run the trained model back on the training data. Remember that this data has already been used to train the model and this necessarily doesn't mean that the model once trained will accurately perform when applied on "unseen data".
- In the second case (your process "Validation" with the Cross Validation operator), you evaluate the test error of your model.
Test error is the error when you get when you run the trained model on a set of data that it has previously never been exposed to.
This error is representative of the performance of your model on future "unseen data".
So to anwer to your question "Which accuracy can I use in my project ?", I encourage you to always use Cross Validation before applying a model.
NB : In your case, training accuracy = 95 % and test accuracy = ~50/60 % ==> A lower training error is expected when a method easily overfits to the training data, yet, poorly generalizes.
2/ Feature selection :
RapidMiner proposes many operators for Feature Selection :
It can be relevant to "experiment" these different methods on your data.
To help you to understand these different methods, you can read this ressource from Dr Ingo Mierswa :
Feature Selection
From my personnal experience, the Optimize Selection (Evolutionary) operator is particular performant : By using
this method of feature selection, you can improve the final accuracy of your model (compared without feature selection) !
I hope it helps,
Regards,
Lionel
Thank you so much for your response.
Actually I've added 2 attributes to mydataset and changed my label cause of I made and collected my dataset. So I have used of forward selection and optimize selection (Evolutionary) as you mentioned. The accuracy got higher almost 88% but execution time is slightly longer than forward selection. Isn't it a problem?
Also I'm not sure that my process is correct or wrong? Doesn't need to use of weight operators like optimize weights(Evolutionary)? Could you please guide me more?
Here is my process.
I appreciate you in advance.
Regards
Thank you so much for your help.