How to use Polynomial Regression in rapidminer correctly
Hello, everyone. This is my first forum post asking questions about polynomial regression in rapidminer.
The original data is:x:4194.06 3466.45 2070.08 874.98 corresponding to y:91540.07 109460.36 120338.64 102182.19
As shown in the first flow, the first result expression is obtained by using the polynomial regression operator.
<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.6.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85">
<parameter key="excel_file" value="C:\Users\1\Desktop\question data.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="x.true.real.attribute"/>
<parameter key="1" value="y.true.real.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="set_role" compatibility="9.6.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
<parameter key="attribute_name" value="y"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="x" value="regular"/>
</list>
</operator>
<operator activated="true" class="polynomial_regression" compatibility="9.6.000" expanded="true" height="82" name="Polynomial Regression" width="90" x="313" y="85">
<parameter key="max_iterations" value="5000"/>
<parameter key="replication_factor" value="2"/>
<parameter key="max_degree" value="2"/>
<parameter key="min_coefficient" value="-100.0"/>
<parameter key="max_coefficient" value="100.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Polynomial Regression" to_port="training set"/>
<connect from_op="Polynomial Regression" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The second flow, based on the original data, creates a new list of attributes as x^2=z, and uses the linear regression operator to make the second result expression.
<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.6.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85">
<parameter key="excel_file" value="C:\Users\1\Desktop\question data.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="x.true.real.attribute"/>
<parameter key="1" value="y.true.real.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="85">
<list key="function_descriptions">
<parameter key="z" value="x*x"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="false" class="rename" compatibility="9.6.000" expanded="true" height="82" name="Rename" width="90" x="246" y="238">
<parameter key="old_name" value="x"/>
<parameter key="new_name" value="x^2"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="set_role" compatibility="9.6.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="85">
<parameter key="attribute_name" value="y"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="x" value="regular"/>
</list>
</operator>
<operator activated="true" class="linear_regression" compatibility="9.6.000" expanded="true" height="103" name="Linear Regression" width="90" x="514" y="85">
<parameter key="feature_selection" value="none"/>
<parameter key="alpha" value="0.05"/>
<parameter key="max_iterations" value="10"/>
<parameter key="forward_alpha" value="0.05"/>
<parameter key="backward_alpha" value="0.05"/>
<parameter key="eliminate_colinear_features" value="false"/>
<parameter key="min_tolerance" value="0.05"/>
<parameter key="use_bias" value="true"/>
<parameter key="ridge" value="1.0E-8"/>
</operator>
<operator activated="false" class="polynomial_regression" compatibility="9.6.000" expanded="true" height="82" name="Polynomial Regression" width="90" x="581" y="238">
<parameter key="max_iterations" value="5000"/>
<parameter key="replication_factor" value="2"/>
<parameter key="max_degree" value="2"/>
<parameter key="min_coefficient" value="-100.0"/>
<parameter key="max_coefficient" value="100.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I want to ask why the results of the two processes are not the same, the original data presents a quadratic nonlinear relationship, and why the quadratic expression cannot be made by polynomial regression.
Thanks you very much!
Answers
Thanks for sharing the data and process.
If we have got four (4) example and train a polynomial regression, we may fail for the model. So I filled up the gap with interpolation to add more data here. Also this Polynomial regression will not perform well without the normalization...
Process attached here for your reference.
These two models are close but I can not guarantee the polynomial will output similar coefficient without normalization
I would strongly suggest to use GLM with new attribute manually created or attributes from Auto Feature Engineer.
Happy Rapid-Mining and Stay Healthy!
YY
Sorry in advance, I don't know how to use the function of this forum.That's why it took so long to reply
First of all, thank you for your answer . According to your description, I am as the data is too little, and not standardized, to lead to the results out? But these four samples are real data , need the four data to construct a yuan quadratic polynomial, Because nonlinear equations can be converted to linear equations , so I use z instead of x2, I have the linear regression equation. But why do with polynomial regression is not to come out, how do you explain that please?Polynomial regression is there any limit to this operator ?