The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Clustering
A RapidMiner user wants to know the answer to this question: "Hey, I am looking to run a clustering model but all my data is qualitative. I was wondering if RapidMiner supports clustering algorithms for qualitative data?"
Tagged:
6
Answers
If all the measures are nominal (qualitative), k-means operator with measures type nominalMeasures and distance as nominalDistance works.
Hope this helps
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Ingo
Did you try adding "text to nominal" operator before clustering algorithm?
I think that will do it.
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
I see that it is working fine except for the cluster visualization part because of some missing values in the centroid table. I am not so sure about it, might be my friend @lionelderkrikor can help with this.
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Data para clusterizar" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/data/Data para clusterizar"/>
</operator>
<operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="text"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="text"/>
<parameter key="block_type" value="value_matrix"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="514" y="34">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="6"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="136"/>
<operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="782" y="34"/>
<operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="514" y="187">
<parameter key="attribute_name" value="cluster"/>
<parameter key="sorting_direction" value="increasing"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
<list key="function_descriptions">
<parameter key="cluster_label" value="cluster"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="238">
<parameter key="attribute_name" value="cluster_label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="187">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="cluster_label|id|idCarrera|idCat_inversion|idDestinoDeseado"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="136">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="20"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="false"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<connect from_op="Retrieve Data para clusterizar" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
<connect from_op="Text to Nominal" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
<connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 4"/>
<connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
<connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Set Role" from_port="original" to_port="result 2"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Yes, in deed there is something weird with this process but linked to the fact that the features are "nominal".
Honestly, I don't know how RapidMiner internally handle the nominal features. So to avoid this bug, I used Nominal to Numerical operator / (dummy coding). In passing, I updated the Select Attributes operator with these new generated dummies variables and now the process and the visualizations are working.
The process :
Regards,
Lionel
This thread is very interesting because it allows to open a debate :
Firstly, for distance based algorithm (like K-means), is it always relevant to "one hot encod" the features of type "category" in RapidMiner ?
If I'm asking this question, it is because, although RapidMiner has the ability to handle directly with the features of type "category", in Auto Model there is a one hot encoding of such features in the pre-processing step ...
If we go further in this pre-processing step, in Auto-Model, we see that if a feature of type "category" as more than 10 values, then this feature is removed from the modelling step.
By searching I found that it corresponds to the "Max nominal values" (= 10 by default) of the Remove Low Quality function of CLEANSE in Turbo Prep.
My question is : Is there any reason for this hard-coded value of 10 in Auto-Model?
Intuitively, I would say that this parameter has to be related to the size of the initial dataset instead of a hard-coded value ? (11 possible values for a 10M rows dataset and 11 possible values for a 100 rows dataset have no the same meaning) but maybe there is other reason(s) (time computation, curse to dimensionnality...).
Moreover I want to mention, that with this strategy, in some cases (for example the current @WalterRioja 's dataset), in Auto-Model, you have all your features status as "green" (thus in theory used for modelling), but in reality only a subset of these features are effectively used for modelling (and thus only a subset of these features appear in the builded model). I think that may surprise the user...
Once again, I just want to open the debate, always in the spirit of RapidMiner (and more generally data-science) knowledge improvment, and try to make RapidMiner software better than it already is...
To conclude, have a nice day (or night ... )
Regards,
Lionel
Ingo
A second question would be, why when I run an automodel -without making any changes- I see negative values for some clusters. Why does this happen?
Thank you all
Ingo
Is this supported in automodel? Because when I've run my data with AutoModel the tree is shown based on those negatives values I talked about before.
Thanks!
Ingo