PCA (kernel) RM vs Python : Differents results
Hi,
Sorry in advance if I did a mistake, but I discovered significant differences, between RapidMiner and Python, in the calculation of kpc_i by PCA (kernel).
1. But first, why in PCA (kernel) there is not , like the "classic" PCA operator :
- in the parameters, the parameter dimensionnality reduction ?
- in the results, the the eigenvectors and eigenvalues tables results (with standard deviation, proportion of variance etc .).
How exploit, in practice, this operator ?
2. Like said above, there is several orders of magnitudes in the calculation of kpc_i (i use for calculation a kernel = "polynomial" and degree = "3"):
RM : kpc_i ~10e12 / Python : kpc_i ~10e5
After research, it seems that kpc_i = eigenvectors x sqrt(eigenvalues). It seems that maybe RM don't take the sqrt in account.
You can find the process here, and the dataset in attached file :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\12_Feature_12.2_cereals-PCA.xlsx"/>
<parameter key="imported_cell_range" value="A1:P78"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="name.true.polynominal.attribute"/>
<parameter key="1" value="mfr.true.polynominal.attribute"/>
<parameter key="2" value="type.true.polynominal.attribute"/>
<parameter key="3" value="calories.true.integer.attribute"/>
<parameter key="4" value="protein.true.integer.attribute"/>
<parameter key="5" value="fat.true.integer.attribute"/>
<parameter key="6" value="sodium.true.integer.attribute"/>
<parameter key="7" value="fiber.true.numeric.attribute"/>
<parameter key="8" value="carbo.true.numeric.attribute"/>
<parameter key="9" value="sugars.true.integer.attribute"/>
<parameter key="10" value="potass.true.integer.attribute"/>
<parameter key="11" value="vitamins.true.integer.attribute"/>
<parameter key="12" value="shelf.true.integer.attribute"/>
<parameter key="13" value="weight.true.numeric.attribute"/>
<parameter key="14" value="cups.true.numeric.attribute"/>
<parameter key="15" value="rating.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="name"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="mfr" value="id"/>
<parameter key="type" value="id"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="name|mfr|type"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="principal_component_analysis_kernel" compatibility="8.0.001" expanded="true" height="103" name="PCA (Kernel)" width="90" x="514" y="34">
<parameter key="kernel_type" value="polynomial"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel (2)" width="90" x="112" y="391">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\12_Feature_12.2_cereals-PCA.xlsx"/>
<parameter key="imported_cell_range" value="A1:P78"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="name.true.polynominal.attribute"/>
<parameter key="1" value="mfr.true.polynominal.attribute"/>
<parameter key="2" value="type.true.polynominal.attribute"/>
<parameter key="3" value="calories.true.integer.attribute"/>
<parameter key="4" value="protein.true.integer.attribute"/>
<parameter key="5" value="fat.true.integer.attribute"/>
<parameter key="6" value="sodium.true.integer.attribute"/>
<parameter key="7" value="fiber.true.numeric.attribute"/>
<parameter key="8" value="carbo.true.numeric.attribute"/>
<parameter key="9" value="sugars.true.integer.attribute"/>
<parameter key="10" value="potass.true.integer.attribute"/>
<parameter key="11" value="vitamins.true.integer.attribute"/>
<parameter key="12" value="shelf.true.integer.attribute"/>
<parameter key="13" value="weight.true.numeric.attribute"/>
<parameter key="14" value="cups.true.numeric.attribute"/>
<parameter key="15" value="rating.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="124" name="PCA Kernel Python" width="90" x="313" y="391">
<parameter key="script" value="import pandas as pd from sklearn.decomposition import KernelPCA # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): X = data.iloc[:,3:] attribute = list(X) kpca = KernelPCA(n_components = 13, kernel = 'poly',degree = 3) #Calculation of kpca_i k_PCA = kpca.fit_transform(X) #Calculation of eigenvalues eigen_values = kpca.lambdas_ #Calculation of eigenvectors eigen_vectors = kpca.alphas_ #Writing of results in datatables K_PCA = pd.DataFrame(data = k_PCA, columns = ['kpc_1','kpc_2','kpc_3','kpc_4','kpc_5','kpc_6','kpc_7','kpc_8','kpc_9','kpc_10','kpc_11','kpc_12','kpc_13']) components = pd.DataFrame(data = ['PC 1','PC 2','PC 3','PC 4','PC 5','PC 6','PC 7','PC 8','PC 9','PC 10','PC 11','PC 12','PC 13'],columns = ['Components']) eigenvalues = pd.DataFrame(data = eigen_values, columns = ['Eigenvalues']) components = components.join(eigenvalues) attributes = pd.DataFrame(data = attribute,columns = ['Attribute']) eigenvectors = pd.DataFrame(data = eigen_vectors, columns = ['PC 1','PC 2','PC 3','PC 4','PC 5','PC 6','PC 7','PC 8','PC 9','PC 10','PC 11','PC 12','PC 13']) attributes = attributes.join(eigenvectors) # connect 2 output ports to see the results return K_PCA,components,attributes"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="PCA (Kernel)" to_port="example set input"/>
<connect from_op="PCA (Kernel)" from_port="example set output" to_port="result 1"/>
<connect from_op="PCA (Kernel)" from_port="preprocessing model" to_port="result 5"/>
<connect from_op="Read Excel (2)" from_port="output" to_op="PCA Kernel Python" to_port="input 1"/>
<connect from_op="PCA Kernel Python" from_port="output 1" to_port="result 2"/>
<connect from_op="PCA Kernel Python" from_port="output 2" to_port="result 3"/>
<connect from_op="PCA Kernel Python" from_port="output 3" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>
Can you enlighten me about these subjects ?
Thanks you,
Best regards,
Lionel
Comments
Hi Lionel,
I checked your results and I've noticed that with the Kernel PCA operator the number of principal components is 77 (equal to the number of examples)! I also tried the tutorial process on Kernel PCA and it goes from 5 attributes to 200 PCs (again 200 examples). Furthermore, all PCs have the same variance (I calculated it with the Covariance Matrix operator). This is surely incorrect.
It pains me to say this, but I would use the python script for your task.
Best,
Sebastian
Hi Sebastian,
Thanks you for your feedback and your analysis.
Best regards,
Lionel.
NB : I suppose that there will be a fix in a next release of RapidMiner ?
moving to Product Feedback.
Scott
I have already forwarded the problem to develoment. Can you confirm my observations on your end?
I noticed the same issue. The result of PCA (Kernel) has always the same number of principal components equal to the number of records in the example set. This is certainly a bug. Please let me know when this gets fixed ?.