The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Choosing the best approach to impute numerical missing values
Hello everybody,
as part of a scientific project I have to develop a data preprocessing model for the university. Currently I am struggling with missing values.
I have a data set with exclusively numerical attributes, in which numerous values are missing. Now I would like to implement the following in RM:
- for each attribute I would like to use 2-3 different methods (e.g. linear interpolation, quadratic interpolation, cubic interpolation, kNN algorithm; other algorithms which can used to impute missing values are also welcome) to replace the missing values with statistically calculated values.
- Then I want to calculate the performance of each method for each attribute and at the end select the best method for imputing missing values for each attribute.
It would be great if someone could help me.
Many thanks in advance
Moritz
Tagged:
0
Maven
Answers
Dortmund, Germany
Surely, you can get the performance of imputation. Here is an example validation on the knn imputation method applied to missing age of titanic data. For a regression performance, we would need two columns: a "ground truth" column and an estimation from knn.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Samples/data/Titanic"/> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="Age.is_not_missing."/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> <description align="center" color="transparent" colored="false" width="126">use the data with non-missing age to validate the knn imputation methods</description> </operator> <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="103" name="Split Data" width="90" x="380" y="85"> <enumeration key="partitions"> <parameter key="ratio" value="0.8"/> <parameter key="ratio" value="0.2"/> </enumeration> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <description align="center" color="transparent" colored="false" width="126">80% for training set, 20% for testing</description> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="514" y="136"> <list key="function_descriptions"> <parameter key="new_age" value="0/0"/> <parameter key="class" value=""test""/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34"> <list key="function_descriptions"> <parameter key="new_age" value="Age"/> <parameter key="class" value=""train""/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="103" name="Append" width="90" x="648" y="85"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values" width="90" x="782" y="85"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value="new_age"/> <parameter key="attributes" value="Age|Cabin|Life Boat"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> <parameter key="iterate" value="true"/> <parameter key="learn_on_complete_cases" value="true"/> <parameter key="order" value="chronological"/> <parameter key="sort" value="ascending"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="246" y="34"> <parameter key="k" value="5"/> <parameter key="weighted_vote" value="true"/> <parameter key="measure_types" value="MixedMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="GeneralizedIDivergence"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> </operator> <connect from_port="example set source" to_op="k-NN" to_port="training set"/> <connect from_op="k-NN" from_port="model" to_port="model sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_model sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values</description> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="916" y="85"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="class.equals.test"/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="1050" y="85"> <parameter key="attribute_name" value="Age"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"> <parameter key="new_age" value="prediction"/> </list> </operator> <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="1184" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="relative_error" value="true"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="true"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="true"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="prediction_average" value="true"/> <parameter key="spearman_rho" value="true"/> <parameter key="kendall_tau" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_op="Retrieve Titanic" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Split Data" to_port="example set"/> <connect from_op="Split Data" from_port="partition 1" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Split Data" from_port="partition 2" to_op="Generate Attributes (2)" to_port="example set input"/> <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/> <connect from_op="Append" from_port="merged set" to_op="Impute Missing Values" to_port="example set in"/> <connect from_op="Impute Missing Values" from_port="example set out" to_op="Filter Examples (2)" to_port="example set input"/> <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Performance" to_port="labelled data"/> <connect from_op="Performance" from_port="performance" to_port="result 1"/> <connect from_op="Performance" from_port="example set" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>The "generate attribute" will create a new column new_age for missing data imputation and another label for the class of train/test sets. 80% of the non-missing age will be kept before imputation, and you can change the split ratio in "split data". We pretend the rest 20% age will be missing and use knn to impute that.
To impute the missing values in new_age, I dropped the ground truth "age", and skipped imputation for life boat and cabin.
YY
many thanks for your answer.
Correct. After your XML code, I now have one way to remove missing values, along with an evaluation (RMSE). But now I would like to use two more methods to see if they are better than the kNN algorithm. At the end I want to select the best method to derive the missing values depending on the best RMSE value. I don't currently know how to do this.
Can you help me?
BR
Moritz
You can easily replicate the imputation by other machine learning algorithms. Check out this
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Samples/data/Titanic"/> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="34"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="Age.is_not_missing."/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> <description align="center" color="transparent" colored="false" width="126">use the data with non-missing age to validate the knn imputation methods</description> </operator> <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="103" name="Split Data" width="90" x="380" y="85"> <enumeration key="partitions"> <parameter key="ratio" value="0.8"/> <parameter key="ratio" value="0.2"/> </enumeration> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <description align="center" color="transparent" colored="false" width="126">80% for training set, 20% for testing</description> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="514" y="136"> <list key="function_descriptions"> <parameter key="new_age" value="0/0"/> <parameter key="class" value=""test""/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34"> <list key="function_descriptions"> <parameter key="new_age" value="Age"/> <parameter key="class" value=""train""/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="103" name="Append" width="90" x="648" y="85"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="782" y="85"> <parameter key="attribute_name" value="Age"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"> <parameter key="Name" value="NAME"/> <parameter key="Ticket Number" value="TICKET"/> </list> </operator> <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="145" name="Multiply" width="90" x="916" y="85"/> <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values" width="90" x="1117" y="34"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value="new_age"/> <parameter key="attributes" value="Age|Cabin|Life Boat"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> <parameter key="iterate" value="true"/> <parameter key="learn_on_complete_cases" value="true"/> <parameter key="order" value="chronological"/> <parameter key="sort" value="ascending"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="246" y="34"> <parameter key="k" value="5"/> <parameter key="weighted_vote" value="true"/> <parameter key="measure_types" value="MixedMeasures"/> <parameter key="mixed_measure" value="MixedEuclideanDistance"/> <parameter key="nominal_measure" value="NominalDistance"/> <parameter key="numerical_measure" value="EuclideanDistance"/> <parameter key="divergence" value="GeneralizedIDivergence"/> <parameter key="kernel_type" value="radial"/> <parameter key="kernel_gamma" value="1.0"/> <parameter key="kernel_sigma1" value="1.0"/> <parameter key="kernel_sigma2" value="0.0"/> <parameter key="kernel_sigma3" value="2.0"/> <parameter key="kernel_degree" value="3.0"/> <parameter key="kernel_shift" value="1.0"/> <parameter key="kernel_a" value="1.0"/> <parameter key="kernel_b" value="0.0"/> </operator> <connect from_port="example set source" to_op="k-NN" to_port="training set"/> <connect from_op="k-NN" from_port="model" to_port="model sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_model sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with KNN</description> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="1251" y="34"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="class.equals.test"/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by KNN" width="90" x="1452" y="34"> <parameter key="attribute_name" value="new_age"/> <parameter key="target_role" value="prediction"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: KNN" width="90" x="1586" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="relative_error" value="true"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="true"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="true"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="prediction_average" value="true"/> <parameter key="spearman_rho" value="true"/> <parameter key="kendall_tau" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description> </operator> <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (2)" width="90" x="1117" y="289"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value="new_age"/> <parameter key="attributes" value="Age|Cabin|Life Boat"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> <parameter key="iterate" value="true"/> <parameter key="learn_on_complete_cases" value="true"/> <parameter key="order" value="chronological"/> <parameter key="sort" value="ascending"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="9.0.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="313" y="34"> <parameter key="number_of_trees" value="20"/> <parameter key="reproducible" value="false"/> <parameter key="maximum_number_of_threads" value="4"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="maximal_depth" value="5"/> <parameter key="min_rows" value="10.0"/> <parameter key="min_split_improvement" value="0.0"/> <parameter key="number_of_bins" value="20"/> <parameter key="learning_rate" value="0.1"/> <parameter key="sample_rate" value="1.0"/> <parameter key="distribution" value="AUTO"/> <parameter key="early_stopping" value="false"/> <parameter key="stopping_rounds" value="1"/> <parameter key="stopping_metric" value="AUTO"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> </operator> <operator activated="false" class="h2o:deep_learning" compatibility="9.0.000" expanded="true" height="82" name="Deep Learning" width="90" x="313" y="289"> <parameter key="activation" value="Rectifier"/> <enumeration key="hidden_layer_sizes"> <parameter key="hidden_layer_sizes" value="50"/> <parameter key="hidden_layer_sizes" value="50"/> </enumeration> <enumeration key="hidden_dropout_ratios"/> <parameter key="reproducible_(uses_1_thread)" value="false"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="epochs" value="10.0"/> <parameter key="compute_variable_importances" value="false"/> <parameter key="train_samples_per_iteration" value="-2"/> <parameter key="adaptive_rate" value="true"/> <parameter key="epsilon" value="1.0E-8"/> <parameter key="rho" value="0.99"/> <parameter key="learning_rate" value="0.005"/> <parameter key="learning_rate_annealing" value="1.0E-6"/> <parameter key="learning_rate_decay" value="1.0"/> <parameter key="momentum_start" value="0.0"/> <parameter key="momentum_ramp" value="1000000.0"/> <parameter key="momentum_stable" value="0.0"/> <parameter key="nesterov_accelerated_gradient" value="true"/> <parameter key="standardize" value="true"/> <parameter key="L1" value="1.0E-5"/> <parameter key="L2" value="0.0"/> <parameter key="max_w2" value="10.0"/> <parameter key="loss_function" value="Automatic"/> <parameter key="distribution_function" value="AUTO"/> <parameter key="early_stopping" value="false"/> <parameter key="stopping_rounds" value="1"/> <parameter key="stopping_metric" value="AUTO"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="missing_values_handling" value="MeanImputation"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> <list key="expert_parameters_"/> </operator> <connect from_port="example set source" to_op="Gradient Boosted Trees" to_port="training set"/> <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_model sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with GBT</description> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (3)" width="90" x="1251" y="289"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="class.equals.test"/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by GBT" width="90" x="1452" y="289"> <parameter key="attribute_name" value="new_age"/> <parameter key="target_role" value="prediction"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: GBT" width="90" x="1586" y="289"> <parameter key="main_criterion" value="first"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="relative_error" value="true"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="true"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="true"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="prediction_average" value="true"/> <parameter key="spearman_rho" value="true"/> <parameter key="kendall_tau" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description> </operator> <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (3)" width="90" x="1117" y="493"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value="new_age"/> <parameter key="attributes" value="Age|Cabin|Life Boat"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> <parameter key="iterate" value="true"/> <parameter key="learn_on_complete_cases" value="true"/> <parameter key="order" value="chronological"/> <parameter key="sort" value="ascending"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.0.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="313" y="34"> <parameter key="family" value="AUTO"/> <parameter key="link" value="family_default"/> <parameter key="solver" value="AUTO"/> <parameter key="reproducible" value="false"/> <parameter key="maximum_number_of_threads" value="4"/> <parameter key="use_regularization" value="true"/> <parameter key="lambda_search" value="false"/> <parameter key="number_of_lambdas" value="0"/> <parameter key="lambda_min_ratio" value="0.0"/> <parameter key="early_stopping" value="true"/> <parameter key="stopping_rounds" value="3"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="standardize" value="true"/> <parameter key="non-negative_coefficients" value="false"/> <parameter key="add_intercept" value="true"/> <parameter key="compute_p-values" value="false"/> <parameter key="remove_collinear_columns" value="false"/> <parameter key="missing_values_handling" value="MeanImputation"/> <parameter key="max_iterations" value="0"/> <parameter key="specify_beta_constraints" value="false"/> <list key="beta_constraints"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> </operator> <connect from_port="example set source" to_op="Generalized Linear Model (2)" to_port="training set"/> <connect from_op="Generalized Linear Model (2)" from_port="model" to_port="model sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_model sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with GLM</description> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (4)" width="90" x="1251" y="493"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="class.equals.test"/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by GLM" width="90" x="1452" y="493"> <parameter key="attribute_name" value="new_age"/> <parameter key="target_role" value="prediction"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: GLM" width="90" x="1586" y="493"> <parameter key="main_criterion" value="first"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="relative_error" value="true"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="true"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="true"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="prediction_average" value="true"/> <parameter key="spearman_rho" value="true"/> <parameter key="kendall_tau" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description> </operator> <operator activated="true" class="impute_missing_values" compatibility="9.1.000" expanded="true" height="68" name="Impute Missing Values (4)" width="90" x="1117" y="697"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value="new_age"/> <parameter key="attributes" value="Age|Cabin|Life Boat"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> <parameter key="iterate" value="true"/> <parameter key="learn_on_complete_cases" value="true"/> <parameter key="order" value="chronological"/> <parameter key="sort" value="ascending"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="h2o:deep_learning" compatibility="9.0.000" expanded="true" height="82" name="Deep Learning (2)" width="90" x="447" y="34"> <parameter key="activation" value="Rectifier"/> <enumeration key="hidden_layer_sizes"> <parameter key="hidden_layer_sizes" value="50"/> <parameter key="hidden_layer_sizes" value="50"/> </enumeration> <enumeration key="hidden_dropout_ratios"/> <parameter key="reproducible_(uses_1_thread)" value="false"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="epochs" value="10.0"/> <parameter key="compute_variable_importances" value="false"/> <parameter key="train_samples_per_iteration" value="-2"/> <parameter key="adaptive_rate" value="true"/> <parameter key="epsilon" value="1.0E-8"/> <parameter key="rho" value="0.99"/> <parameter key="learning_rate" value="0.005"/> <parameter key="learning_rate_annealing" value="1.0E-6"/> <parameter key="learning_rate_decay" value="1.0"/> <parameter key="momentum_start" value="0.0"/> <parameter key="momentum_ramp" value="1000000.0"/> <parameter key="momentum_stable" value="0.0"/> <parameter key="nesterov_accelerated_gradient" value="true"/> <parameter key="standardize" value="true"/> <parameter key="L1" value="1.0E-5"/> <parameter key="L2" value="0.0"/> <parameter key="max_w2" value="10.0"/> <parameter key="loss_function" value="Automatic"/> <parameter key="distribution_function" value="AUTO"/> <parameter key="early_stopping" value="false"/> <parameter key="stopping_rounds" value="1"/> <parameter key="stopping_metric" value="AUTO"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="missing_values_handling" value="MeanImputation"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> <list key="expert_parameters_"/> </operator> <connect from_port="example set source" to_op="Deep Learning (2)" to_port="training set"/> <connect from_op="Deep Learning (2)" from_port="model" to_port="model sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_model sink" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">apply imputation on the missing values with DL</description> </operator> <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (5)" width="90" x="1251" y="697"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="class.equals.test"/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Impute by DL" width="90" x="1452" y="697"> <parameter key="attribute_name" value="new_age"/> <parameter key="target_role" value="prediction"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance: DL" width="90" x="1586" y="697"> <parameter key="main_criterion" value="first"/> <parameter key="root_mean_squared_error" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="relative_error" value="true"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="true"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="true"/> <parameter key="correlation" value="true"/> <parameter key="squared_correlation" value="true"/> <parameter key="prediction_average" value="true"/> <parameter key="spearman_rho" value="true"/> <parameter key="kendall_tau" value="true"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <description align="center" color="transparent" colored="false" width="126">check performance e.g. RMSE on testing</description> </operator> <connect from_op="Retrieve Titanic" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Split Data" to_port="example set"/> <connect from_op="Split Data" from_port="partition 1" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Split Data" from_port="partition 2" to_op="Generate Attributes (2)" to_port="example set input"/> <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/> <connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Impute Missing Values" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="Impute Missing Values (2)" to_port="example set in"/> <connect from_op="Multiply" from_port="output 3" to_op="Impute Missing Values (3)" to_port="example set in"/> <connect from_op="Multiply" from_port="output 4" to_op="Impute Missing Values (4)" to_port="example set in"/> <connect from_op="Impute Missing Values" from_port="example set out" to_op="Filter Examples (2)" to_port="example set input"/> <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Impute by KNN" to_port="example set input"/> <connect from_op="Impute by KNN" from_port="example set output" to_op="Performance: KNN" to_port="labelled data"/> <connect from_op="Performance: KNN" from_port="performance" to_port="result 1"/> <connect from_op="Performance: KNN" from_port="example set" to_port="result 2"/> <connect from_op="Impute Missing Values (2)" from_port="example set out" to_op="Filter Examples (3)" to_port="example set input"/> <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Impute by GBT" to_port="example set input"/> <connect from_op="Impute by GBT" from_port="example set output" to_op="Performance: GBT" to_port="labelled data"/> <connect from_op="Performance: GBT" from_port="performance" to_port="result 3"/> <connect from_op="Performance: GBT" from_port="example set" to_port="result 4"/> <connect from_op="Impute Missing Values (3)" from_port="example set out" to_op="Filter Examples (4)" to_port="example set input"/> <connect from_op="Filter Examples (4)" from_port="example set output" to_op="Impute by GLM" to_port="example set input"/> <connect from_op="Impute by GLM" from_port="example set output" to_op="Performance: GLM" to_port="labelled data"/> <connect from_op="Performance: GLM" from_port="performance" to_port="result 5"/> <connect from_op="Performance: GLM" from_port="example set" to_port="result 6"/> <connect from_op="Impute Missing Values (4)" from_port="example set out" to_op="Filter Examples (5)" to_port="example set input"/> <connect from_op="Filter Examples (5)" from_port="example set output" to_op="Impute by DL" to_port="example set input"/> <connect from_op="Impute by DL" from_port="example set output" to_op="Performance: DL" to_port="labelled data"/> <connect from_op="Performance: DL" from_port="performance" to_port="result 7"/> <connect from_op="Performance: DL" from_port="example set" to_port="result 8"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="210"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="168"/> <portSpacing port="sink_result 6" spacing="0"/> <portSpacing port="sink_result 7" spacing="147"/> <portSpacing port="sink_result 8" spacing="0"/> <portSpacing port="sink_result 9" spacing="0"/> </process> </operator> </process>Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
thank you very much for your answers.
@yyhuang: Thanks for your work. Even if your approach doesn't do everything I would like to do, it helps me a lot. With my dataset the GBL, GLM and DL algorithm don not work. I always get the error message as shown on the screenshot. Can you tell me what the problem is? (To be honest, I don't know exactly how the algorithms work, I hope that's not mandatory) Do you have any idea how I can integrate linear interpolation or quadratic interpolation into the system?
To everybody:
BG
Moritz
Thanks for sharing the screenshot. If there is any nominal ID-like attribute in the input, the H2O model will have errors. That 's why I dropped some columns that is ID-like with "set role" before the imputation. You can apply invert-selection of some attributes to troubleshoot. A quick check on the input node of the deep learning learner by right-clicking the node and view example set could be useful. You can also share your data and process here for us to further investigate.
YY