Clustering

MarlaBot · June 2019

A RapidMiner user wants to know the answer to this question: "Hey, I am looking to run a clustering model but all my data is qualitative. I was wondering if RapidMiner supports clustering algorithms for qualitative data?"

varunm1 · June 2019

Hello @MarlaBot

If all the measures are nominal (qualitative), k-means operator with measures type nominalMeasures and distance as nominalDistance works.

Hope this helps

IngoRM · June 2019

Yes, many of our cluster algorithms do in fact - just make sure you use a distance measure which supports nominal (qualitative) column types. This is the default for most and if you use clustering in Auto Model it will take care of this for you. In addition, there are data transformations you can apply to transform your data into numerical formats before you use any of the clustering algorithms.

Hope this helps,
Ingo

WalterRioja · June 2019

@IngoRM when I run automodel it only considers de quantitative attribute to cluster. What would be your suggestion? Thanks in advance!

WalterRioja · June 2019

@varunm1 would you mind helping me more? I tried but it shows a message of non-nominal attribute even when it's text type. Any suggestion?

varunm1 · June 2019

Hello @WalterRioja

Did you try adding "text to nominal" operator before clustering algorithm?

I think that will do it.

WalterRioja · June 2019

@varunm1 it's not working yet. Would you mind executing it as you explained? I can load my data here, please.

varunm1 · June 2019

sure then, provide your data and XML process (View --> Show Panel --> XML).

WalterRioja · June 2019

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">

</context>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

varunm1 · June 2019

Hello @WalterRioja

I see that it is working fine except for the cluster visualization part because of some missing values in the centroid table. I am not so sure about it, might be my friend @lionelderkrikor can help with this.

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Data para clusterizar" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/data/Data para clusterizar"/>
</operator>
<operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="text"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="text"/>
<parameter key="block_type" value="value_matrix"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="514" y="34">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="6"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="MixedMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="136"/>
<operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="782" y="34"/>
<operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="514" y="187">
<parameter key="attribute_name" value="cluster"/>
<parameter key="sorting_direction" value="increasing"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
<list key="function_descriptions">
<parameter key="cluster_label" value="cluster"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="238">
<parameter key="attribute_name" value="cluster_label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="187">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="cluster_label|id|idCarrera|idCat_inversion|idDestinoDeseado"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="136">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="20"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="false"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<connect from_op="Retrieve Data para clusterizar" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
<connect from_op="Text to Nominal" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
<connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 4"/>
<connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
<connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Set Role" from_port="original" to_port="result 2"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

WalterRioja · June 2019

@lionelderkrikor please I need your help, thanks!

lionelderkrikor · June 2019

Hi @varunm1, hi @WalterRioja,

Yes, in deed there is something weird with this process but linked to the fact that the features are "nominal".
Honestly, I don't know how RapidMiner internally handle the nominal features. So to avoid this bug, I used Nominal to Numerical operator / (dummy coding). In passing, I updated the Select Attributes operator with these new generated dummies variables and now the process and the visualizations are working.
The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.3.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
        <parameter key="excel_file" value="C:\Users\Lionel\Downloads\Data para clusterizar.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="idCat_inversion.true.polynominal.attribute"/>
          <parameter key="1" value="idCarrera.true.polynominal.attribute"/>
          <parameter key="2" value="idDestinoDeseado.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="text"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="text"/>
        <parameter key="block_type" value="value_matrix"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="9.3.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="34">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value="idCat_inversion"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="coding_type" value="dummy coding"/>
        <parameter key="use_comparison_groups" value="false"/>
        <list key="comparison_groups"/>
        <parameter key="unexpected_value_handling" value="all 0 and warning"/>
        <parameter key="use_underscore_in_name" value="false"/>
      </operator>
      <operator activated="true" breakpoints="after" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="514" y="34">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="6"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="determine_good_start_values" value="true"/>
        <parameter key="measure_types" value="MixedMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="238"/>
      <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="782" y="34"/>
      <operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="514" y="238">
        <parameter key="attribute_name" value="cluster"/>
        <parameter key="sorting_direction" value="increasing"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
        <list key="function_descriptions">
          <parameter key="cluster_label" value="cluster"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="238">
        <parameter key="attribute_name" value="cluster_label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="983" y="187">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="cluster_label|id|idCarrera = Administración|idCarrera = Antropología|idCarrera = Arqueología|idCarrera = ArquitecturayUrbanismo|idCarrera = ArtesEscénicas|idCarrera = ArteyDiseño|idCarrera = Biología|idCarrera = CienciasdelaComunicación|idCarrera = CienciasdelaSalud|idCarrera = CienciasSociales|idCarrera = Contabilidad|idCarrera = Derecho|idCarrera = DiseñoGráfico|idCarrera = Economía|idCarrera = Educación|idCarrera = Enfermería|idCarrera = Finanzas|idCarrera = GestiónyAltaDirección|idCarrera = Hoteleríayturismo|idCarrera = Idiomas|idCarrera = IngenieríaAmbiental|idCarrera = IngenieríaCivil|idCarrera = IngenieríadeSistemas|idCarrera = IngenieríaIndustrial|idCarrera = IngenieríaMecanica|idCarrera = IngenieríaQuimica|idCarrera = Marketing|idCarrera = MedicinaHumana|idCarrera = MedicinaVeterinaria|idCarrera = Música|idCarrera = NegociosInternacionales|idCarrera = Otros|idCarrera = Psicología|idCarrera = Publicidadyafines|idCarrera = TrabajoSocial|idCat_inversion = Inversiónalta|idCat_inversion = Inversiónbásica|idCat_inversion = Inversiónpromedio|idDestinoDeseado = Afganistán|idDestinoDeseado = Albania|idDestinoDeseado = Alemania|idDestinoDeseado = Argelia|idDestinoDeseado = Argentina|idDestinoDeseado = Armenia|idDestinoDeseado = Australia|idDestinoDeseado = Austria|idDestinoDeseado = Azerbaiyán|idDestinoDeseado = Bahrein|idDestinoDeseado = Bangladesh|idDestinoDeseado = Benin|idDestinoDeseado = Bielorrusia|idDestinoDeseado = Bolivia|idDestinoDeseado = BosniaHerzegovina|idDestinoDeseado = Botsuana|idDestinoDeseado = Brasil|idDestinoDeseado = Bulgaria|idDestinoDeseado = BurkinaFaso|idDestinoDeseado = Bélgica|idDestinoDeseado = CaboVerde|idDestinoDeseado = Camboya|idDestinoDeseado = Camerún|idDestinoDeseado = Canadá|idDestinoDeseado = Chile|idDestinoDeseado = ChinaContinental|idDestinoDeseado = Colombia|idDestinoDeseado = Corea|idDestinoDeseado = CostadeMarfil|idDestinoDeseado = CostaRica|idDestinoDeseado = Croacia|idDestinoDeseado = Dinamarca|idDestinoDeseado = EAU|idDestinoDeseado = Ecuador|idDestinoDeseado = Egipto|idDestinoDeseado = ElSalvador|idDestinoDeseado = Eslovaquia|idDestinoDeseado = Eslovenia|idDestinoDeseado = España|idDestinoDeseado = EstadosUnidos|idDestinoDeseado = Estonia|idDestinoDeseado = Etiopía|idDestinoDeseado = Fiji|idDestinoDeseado = Filipinas|idDestinoDeseado = Finlandia|idDestinoDeseado = Francia|idDestinoDeseado = Gabón|idDestinoDeseado = Georgia|idDestinoDeseado = Ghana|idDestinoDeseado = Grecia|idDestinoDeseado = Guatemala|idDestinoDeseado = HongKong|idDestinoDeseado = Hungría|idDestinoDeseado = India|idDestinoDeseado = Indonesia|idDestinoDeseado = Irlanda|idDestinoDeseado = Irán|idDestinoDeseado = Islandia|idDestinoDeseado = Italia|idDestinoDeseado = Japón|idDestinoDeseado = Jordán|idDestinoDeseado = Kazajstán|idDestinoDeseado = Kenia|idDestinoDeseado = Kirguizstán|idDestinoDeseado = Kuwait|idDestinoDeseado = Laos|idDestinoDeseado = Letonia|idDestinoDeseado = Liberia|idDestinoDeseado = Lituania|idDestinoDeseado = Líbano|idDestinoDeseado = Macedonia|idDestinoDeseado = Malasia|idDestinoDeseado = Malawi|idDestinoDeseado = Malta|idDestinoDeseado = Marruecos|idDestinoDeseado = Mauricio|idDestinoDeseado = Moldavia|idDestinoDeseado = Mongolia|idDestinoDeseado = Montenegro|idDestinoDeseado = Mozambique|idDestinoDeseado = Myanmar|idDestinoDeseado = México|idDestinoDeseado = Namibia|idDestinoDeseado = Nepal|idDestinoDeseado = Nicaragua|idDestinoDeseado = Nigeria|idDestinoDeseado = Noruega|idDestinoDeseado = NuevaZelanda|idDestinoDeseado = Omán|idDestinoDeseado = Pakistán|idDestinoDeseado = Panamá|idDestinoDeseado = Paraguay|idDestinoDeseado = PaísesBajos|idDestinoDeseado = Perú|idDestinoDeseado = Polonia|idDestinoDeseado = Portugal|idDestinoDeseado = PuertoRico|idDestinoDeseado = ReinoUnido|idDestinoDeseado = RepublicaCheca|idDestinoDeseado = RepúblicaDominicana|idDestinoDeseado = Ruanda|idDestinoDeseado = Rumania|idDestinoDeseado = Rusia|idDestinoDeseado = Senegal|idDestinoDeseado = Serbia|idDestinoDeseado = Seychelles|idDestinoDeseado = Singapur|idDestinoDeseado = SriLanka|idDestinoDeseado = Sudáfrica|idDestinoDeseado = Suecia|idDestinoDeseado = Suiza|idDestinoDeseado = Tailandia|idDestinoDeseado = Taiwán|idDestinoDeseado = Tanzania|idDestinoDeseado = Tayikistan|idDestinoDeseado = Togo|idDestinoDeseado = Turquía|idDestinoDeseado = Túnez|idDestinoDeseado = Ucrania|idDestinoDeseado = Uganda|idDestinoDeseado = Uruguay|idDestinoDeseado = Venezuela|idDestinoDeseado = Vietnam"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="false" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="391">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="cluster_label|id|idCarrera|idCat_inversion|idDestinoDeseado"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="187">
        <parameter key="criterion" value="gain_ratio"/>
        <parameter key="maximal_depth" value="20"/>
        <parameter key="apply_pruning" value="true"/>
        <parameter key="confidence" value="0.25"/>
        <parameter key="apply_prepruning" value="false"/>
        <parameter key="minimal_gain" value="0.01"/>
        <parameter key="minimal_leaf_size" value="2"/>
        <parameter key="minimal_size_for_split" value="4"/>
        <parameter key="number_of_prepruning_alternatives" value="3"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
      <connect from_op="Text to Nominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
      <connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 4"/>
      <connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
      <connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Set Role" from_port="original" to_port="result 2"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Hope this helps,

Regards,

Lionel

lionelderkrikor · June 2019

Dear all,

This thread is very interesting because it allows to open a debate :
Firstly, for distance based algorithm (like K-means), is it always relevant to "one hot encod" the features of type "category" in RapidMiner ?
If I'm asking this question, it is because, although RapidMiner has the ability to handle directly with the features of type "category", in Auto Model there is a one hot encoding of such features in the pre-processing step ...
If we go further in this pre-processing step, in Auto-Model, we see that if a feature of type "category" as more than 10 values, then this feature is removed from the modelling step.
By searching I found that it corresponds to the "Max nominal values" (= 10 by default) of the Remove Low Quality function of CLEANSE in Turbo Prep.
My question is : Is there any reason for this hard-coded value of 10 in Auto-Model?
Intuitively, I would say that this parameter has to be related to the size of the initial dataset instead of a hard-coded value ? (11 possible values for a 10M rows dataset and 11 possible values for a 100 rows dataset have no the same meaning) but maybe there is other reason(s) (time computation, curse to dimensionnality...).
Moreover I want to mention, that with this strategy, in some cases (for example the current @WalterRioja 's dataset), in Auto-Model, you have all your features status as "green" (thus in theory used for modelling), but in reality only a subset of these features are effectively used for modelling (and thus only a subset of these features appear in the builded model). I think that may surprise the user...

Once again, I just want to open the debate, always in the spirit of RapidMiner (and more generally data-science) knowledge improvment, and try to make RapidMiner software better than it already is...

To conclude, have a nice day (or night ...

)

Regards,

Lionel

IngoRM · June 2019

Yeah, the hard-coded 10 bugs me as well. However, the problem with one hot encoding is that it can easily let your feature space explore and is hard to predict beforehand what is going to happen. AM aims at robust results in all cases, not necessarily the optimal results in some. That is the reason why we allow to open up the process at the end, to allow you to make changes and try what they do for you...

Hope this makes sense,
Ingo

WalterRioja · June 2019

Hello everyone! Thanks for the support, so I have a question. If I wanted to run an automodel to cluster my data (the same I've shared before) would I get an 'incomplete' wrong result? The fact is all of the three items I need to process are "category" type (those are IDs of other tables in my database).
A second question would be, why when I run an automodel -without making any changes- I see negative values for some clusters. Why does this happen?

Thank you all

IngoRM · June 2019

If I wanted to run an automodel to cluster my data (the same I've shared before) would I get an 'incomplete' wrong result?

No, there are not wrong. These are just some of the millions of choices you need to do as a data scientist. As I said before, what AM is doing works for most people / use cases, but may not be what you desire in your case. That can happen. In situations where this is more likely, Auto Model exposes the relevant parameter to the user in the UI. This is not the case here, but you can still open the process in Studio, make the desired change, and run it again to get the new results.

A second question would be, why when I run an automodel -without making any changes- I see negative values for some clusters. Why does this happen?

For clustering (or in fact all distance-based methods in machine learning) you normalize the data before the ML algorithm is applied. This will prevent that some of the columns with a bigger range of values overrule the other columns. The normalization we perform is a so-called z-standardization and the resulting values will have mean 0 and standard deviation of 1. Hence the negative values...

Hope this helps,
Ingo

WalterRioja · June 2019

@IngoRM about the second question, How could I see the rules of the cluster in a tree based not on the z-standardization resulting values but data (for example, age between 1 and 10 cluster 1, between 11 and 12 cluster 2, etc).
Is this supported in automodel? Because when I've run my data with AutoModel the tree is shown based on those negatives values I talked about before.

Thanks!

IngoRM · June 2019

Hi @WalterRioja

Good idea! The change is actually not that hard so I will look into getting this into AM for one of the future releases. If you want to try yourself, you can open the clustering process from AM at the end and use the operator De-Normalize on the preprocessing model from the Normalize operator. You can then apply this de-normalization model on the training data before the tree is built. Below is a screenshot of the necessary changes.

Stay tuned,
Ingo

Image: https://us.v-cdn.net/6030995/uploads/editor/fe/1wds8kwk1qz4.png

WalterRioja · June 2019

@IngoRM that's exactly what I needed. Thank you very much!!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Clustering

Answers

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing