How to combine Logistic regression with SOM as a hybrid model?

komeil_shaeri · October 2016

Hi,

I need to combine Logistic regression with SOM or DBSCAN as a hybrid model. This will be a hybrid "Classification + Clustering" model in which a classifier can be trained first, and its output is used as the input for the cluster to improve the clustering results.

Thanks,

Thomas_Ott · October 2016

Just take your pre-processed (ETL'd) data, feed it into a X-val with your Logistic Regression, the use an apply model on the outside to to score your training set and put it into the clustering algo. Of course I'm simplifying it, but it should be quite easy to do.

Update: Something like this?

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.003" expanded="true" height="68" name="Sonar" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="split_validation" compatibility="7.2.003" expanded="true" height="124" name="Validation" width="90" x="179" y="136">
        <process expanded="true">
          <operator activated="true" class="h2o:logistic_regression" compatibility="7.2.000" expanded="true" height="82" name="Logistic Regression" width="90" x="156" y="34"/>
          <connect from_port="training" to_op="Logistic Regression" to_port="training set"/>
          <connect from_op="Logistic Regression" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.2.003" expanded="true" height="82" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.2.003" expanded="true" height="82" name="Performance" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.2.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="34">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="dbscan" compatibility="7.2.003" expanded="true" height="82" name="Clustering" width="90" x="648" y="34"/>
      <connect from_op="Sonar" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="18"/>
      <portSpacing port="sink_result 2" spacing="90"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

komeil_shaeri · October 2016

Thanks for your response ...

The problem is when I hybridize the algorithms, the performance measures (accuracy, precision, recall) don't change even if I disable the x-validation operator which contains the logistic regression. I don't know why logistic regression cannot affect the overall performance...

Please see the attached file.

Thanks

komeil_shaeri · October 2016

Hi,

In this example, first I have applied decision tree (DT) on Titanic data. The resulting accuracy is 80.29%.

When the DT is hybridized with Fuzzy C-means (FCM), still the performance accuracy is 80.29%. This means that the system does not take into account the FCM. Is there another way to integrate the Classification and Clustering models? Can you help me on this issue?

DT process:

<?xml version="1.0" encoding="UTF-8"?>
<process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator class="process" name="Process" expanded="true" compatibility="7.2.002" activated="true">
<parameter value="init" key="logverbosity"/>
<parameter value="2001" key="random_seed"/>
<parameter value="never" key="send_mail"/>
<parameter value="" key="notification_email"/>
<parameter value="30" key="process_duration_for_mail"/>
<parameter value="SYSTEM" key="encoding"/>
<process expanded="true">
<operator class="retrieve" name="Retrieve Titanic" expanded="true" compatibility="7.2.002" activated="true" y="34" x="45" width="90" height="68">
<parameter value="//Samples/data/Titanic" key="repository_entry"/>
</operator>
<operator class="replace_missing_values" name="Replace Missing Values" expanded="true" compatibility="7.2.002" activated="true" y="136" x="45" width="90" height="103">
<parameter value="false" key="return_preprocessing_model"/>
<parameter value="false" key="create_view"/>
<parameter value="all" key="attribute_filter_type"/>
<parameter value="" key="attribute"/>
<parameter value="" key="attributes"/>
<parameter value="false" key="use_except_expression"/>
<parameter value="attribute_value" key="value_type"/>
<parameter value="false" key="use_value_type_exception"/>
<parameter value="time" key="except_value_type"/>
<parameter value="attribute_block" key="block_type"/>
<parameter value="false" key="use_block_type_exception"/>
<parameter value="value_matrix_row_start" key="except_block_type"/>
<parameter value="false" key="invert_selection"/>
<parameter value="false" key="include_special_attributes"/>
<parameter value="average" key="default"/>
<list key="columns"/>
</operator>
<operator class="set_role" name="Set Role" expanded="true" compatibility="7.2.002" activated="true" y="289" x="45" width="90" height="82">
<parameter value="Survived" key="attribute_name"/>
<parameter value="label" key="target_role"/>
<list key="set_additional_roles"/>
</operator>
<operator class="x_validation" name="Validation" expanded="true" compatibility="7.2.002" activated="true" y="34" x="246" width="90" height="124">
<parameter value="false" key="create_complete_model"/>
<parameter value="true" key="average_performances_only"/>
<parameter value="false" key="leave_one_out"/>
<parameter value="10" key="number_of_validations"/>
<parameter value="automatic" key="sampling_type"/>
<parameter value="false" key="use_local_random_seed"/>
<parameter value="1992" key="local_random_seed"/>
<process expanded="true">
<operator class="parallel_decision_tree" name="Decision Tree" expanded="true" compatibility="7.2.002" activated="true" y="34" x="162" width="90" height="82">
<parameter value="gain_ratio" key="criterion"/>
<parameter value="20" key="maximal_depth"/>
<parameter value="true" key="apply_pruning"/>
<parameter value="0.25" key="confidence"/>
<parameter value="true" key="apply_prepruning"/>
<parameter value="0.1" key="minimal_gain"/>
<parameter value="2" key="minimal_leaf_size"/>
<parameter value="4" key="minimal_size_for_split"/>
<parameter value="3" key="number_of_prepruning_alternatives"/>
</operator>
<connect to_port="training set" to_op="Decision Tree" from_port="training"/>
<connect to_port="model" from_port="model" from_op="Decision Tree"/>
<connect to_port="through 1" from_port="exampleSet" from_op="Decision Tree"/>
<portSpacing spacing="0" port="source_training"/>
<portSpacing spacing="0" port="sink_model"/>
<portSpacing spacing="0" port="sink_through 1"/>
<portSpacing spacing="0" port="sink_through 2"/>
</process>
<process expanded="true">
<operator class="apply_model" name="Apply Model" expanded="true" compatibility="7.2.002" activated="true" y="34" x="112" width="90" height="82">
<list key="application_parameters"/>
<parameter value="false" key="create_view"/>
</operator>
<operator class="performance" name="Performance" expanded="true" compatibility="7.2.002" activated="true" y="136" x="246" width="90" height="82">
<parameter value="true" key="use_example_weights"/>
</operator>
<connect to_port="model" to_op="Apply Model" from_port="model"/>
<connect to_port="unlabelled data" to_op="Apply Model" from_port="test set"/>
<connect to_port="labelled data" to_op="Performance" from_port="labelled data" from_op="Apply Model"/>
<connect to_port="averagable 1" from_port="performance" from_op="Performance"/>
<portSpacing spacing="0" port="source_model"/>
<portSpacing spacing="0" port="source_test set"/>
<portSpacing spacing="0" port="source_through 1"/>
<portSpacing spacing="0" port="source_through 2"/>
<portSpacing spacing="0" port="sink_averagable 1"/>
<portSpacing spacing="0" port="sink_averagable 2"/>
</process>
</operator>
<operator class="apply_model" name="Apply Model (3)" expanded="true" compatibility="7.2.002" activated="true" y="34" x="447" width="90" height="82">
<list key="application_parameters"/>
<parameter value="false" key="create_view"/>
</operator>
<operator class="performance" name="Performance (2)" expanded="true" compatibility="7.2.002" activated="true" y="85" x="648" width="90" height="82">
<parameter value="true" key="use_example_weights"/>
</operator>
<connect to_port="example set input" to_op="Replace Missing Values" from_port="output" from_op="Retrieve Titanic"/>
<connect to_port="example set input" to_op="Set Role" from_port="example set output" from_op="Replace Missing Values"/>
<connect to_port="training" to_op="Validation" from_port="example set output" from_op="Set Role"/>
<connect to_port="model" to_op="Apply Model (3)" from_port="model" from_op="Validation"/>
<connect to_port="unlabelled data" to_op="Apply Model (3)" from_port="training" from_op="Validation"/>
<connect to_port="labelled data" to_op="Performance (2)" from_port="labelled data" from_op="Apply Model (3)"/>
<connect to_port="result 1" from_port="performance" from_op="Performance (2)"/>
<portSpacing spacing="0" port="source_input 1"/>
<portSpacing spacing="0" port="sink_result 1"/>
<portSpacing spacing="0" port="sink_result 2"/>
</process>
</operator>
</process>

DT-FCM process:

<?xml version="1.0" encoding="UTF-8"?>
<process version="7.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator class="process" name="Process" expanded="true" compatibility="7.2.002" activated="true">
<parameter value="init" key="logverbosity"/>
<parameter value="2001" key="random_seed"/>
<parameter value="never" key="send_mail"/>
<parameter value="" key="notification_email"/>
<parameter value="30" key="process_duration_for_mail"/>
<parameter value="SYSTEM" key="encoding"/>
<process expanded="true">
<operator class="retrieve" name="Retrieve Titanic" expanded="true" compatibility="7.2.002" activated="true" y="34" x="45" width="90" height="68">
<parameter value="//Samples/data/Titanic" key="repository_entry"/>
</operator>
<operator class="replace_missing_values" name="Replace Missing Values" expanded="true" compatibility="7.2.002" activated="true" y="136" x="45" width="90" height="103">
<parameter value="false" key="return_preprocessing_model"/>
<parameter value="false" key="create_view"/>
<parameter value="all" key="attribute_filter_type"/>
<parameter value="" key="attribute"/>
<parameter value="" key="attributes"/>
<parameter value="false" key="use_except_expression"/>
<parameter value="attribute_value" key="value_type"/>
<parameter value="false" key="use_value_type_exception"/>
<parameter value="time" key="except_value_type"/>
<parameter value="attribute_block" key="block_type"/>
<parameter value="false" key="use_block_type_exception"/>
<parameter value="value_matrix_row_start" key="except_block_type"/>
<parameter value="false" key="invert_selection"/>
<parameter value="false" key="include_special_attributes"/>
<parameter value="average" key="default"/>
<list key="columns"/>
</operator>
<operator class="set_role" name="Set Role" expanded="true" compatibility="7.2.002" activated="true" y="289" x="45" width="90" height="82">
<parameter value="Survived" key="attribute_name"/>
<parameter value="label" key="target_role"/>
<list key="set_additional_roles"/>
</operator>
<operator class="x_validation" name="Validation" expanded="true" compatibility="7.2.002" activated="true" y="34" x="246" width="90" height="124">
<parameter value="false" key="create_complete_model"/>
<parameter value="true" key="average_performances_only"/>
<parameter value="false" key="leave_one_out"/>
<parameter value="10" key="number_of_validations"/>
<parameter value="automatic" key="sampling_type"/>
<parameter value="false" key="use_local_random_seed"/>
<parameter value="1992" key="local_random_seed"/>
<process expanded="true">
<operator class="parallel_decision_tree" name="Decision Tree" expanded="true" compatibility="7.2.002" activated="true" y="34" x="162" width="90" height="82">
<parameter value="gain_ratio" key="criterion"/>
<parameter value="20" key="maximal_depth"/>
<parameter value="true" key="apply_pruning"/>
<parameter value="0.25" key="confidence"/>
<parameter value="true" key="apply_prepruning"/>
<parameter value="0.1" key="minimal_gain"/>
<parameter value="2" key="minimal_leaf_size"/>
<parameter value="4" key="minimal_size_for_split"/>
<parameter value="3" key="number_of_prepruning_alternatives"/>
</operator>
<connect to_port="training set" to_op="Decision Tree" from_port="training"/>
<connect to_port="model" from_port="model" from_op="Decision Tree"/>
<connect to_port="through 1" from_port="exampleSet" from_op="Decision Tree"/>
<portSpacing spacing="0" port="source_training"/>
<portSpacing spacing="0" port="sink_model"/>
<portSpacing spacing="0" port="sink_through 1"/>
<portSpacing spacing="0" port="sink_through 2"/>
</process>
<process expanded="true">
<operator class="apply_model" name="Apply Model" expanded="true" compatibility="7.2.002" activated="true" y="34" x="112" width="90" height="82">
<list key="application_parameters"/>
<parameter value="false" key="create_view"/>
</operator>
<operator class="performance" name="Performance" expanded="true" compatibility="7.2.002" activated="true" y="136" x="246" width="90" height="82">
<parameter value="true" key="use_example_weights"/>
</operator>
<connect to_port="model" to_op="Apply Model" from_port="model"/>
<connect to_port="unlabelled data" to_op="Apply Model" from_port="test set"/>
<connect to_port="labelled data" to_op="Performance" from_port="labelled data" from_op="Apply Model"/>
<connect to_port="averagable 1" from_port="performance" from_op="Performance"/>
<portSpacing spacing="0" port="source_model"/>
<portSpacing spacing="0" port="source_test set"/>
<portSpacing spacing="0" port="source_through 1"/>
<portSpacing spacing="0" port="source_through 2"/>
<portSpacing spacing="0" port="sink_averagable 1"/>
<portSpacing spacing="0" port="sink_averagable 2"/>
</process>
</operator>
<operator class="apply_model" name="Apply Model (3)" expanded="true" compatibility="7.2.002" activated="true" y="34" x="447" width="90" height="82">
<list key="application_parameters"/>
<parameter value="false" key="create_view"/>
</operator>
<operator class="nominal_to_numerical" name="Nominal to Numerical" expanded="true" compatibility="7.2.002" activated="true" y="187" x="447" width="90" height="103">
<parameter value="false" key="return_preprocessing_model"/>
<parameter value="false" key="create_view"/>
<parameter value="all" key="attribute_filter_type"/>
<parameter value="" key="attribute"/>
<parameter value="" key="attributes"/>
<parameter value="false" key="use_except_expression"/>
<parameter value="nominal" key="value_type"/>
<parameter value="false" key="use_value_type_exception"/>
<parameter value="file_path" key="except_value_type"/>
<parameter value="single_value" key="block_type"/>
<parameter value="false" key="use_block_type_exception"/>
<parameter value="single_value" key="except_block_type"/>
<parameter value="false" key="invert_selection"/>
<parameter value="false" key="include_special_attributes"/>
<parameter value="dummy coding" key="coding_type"/>
<parameter value="false" key="use_comparison_groups"/>
<list key="comparison_groups"/>
<parameter value="all 0 and warning" key="unexpected_value_handling"/>
<parameter value="false" key="use_underscore_in_name"/>
</operator>
<operator class="prules:FCM" name="Fuzzy C-Means" expanded="true" compatibility="7.0.000" activated="true" y="391" x="447" width="90" height="103">
<parameter value="true" key="add_cluster_attribute"/>
<parameter value="false" key="add_as_label"/>
<parameter value="false" key="Add partition matrix"/>
<parameter value="3" key="Clusters"/>
<parameter value="50" key="Iterations"/>
<parameter value="2.0" key="Fuzzynes"/>
<parameter value="1.0E-4" key="MinGain"/>
<parameter value="MixedMeasures" key="measure_types"/>
<parameter value="MixedEuclideanDistance" key="mixed_measure"/>
<parameter value="NominalDistance" key="nominal_measure"/>
<parameter value="EuclideanDistance" key="numerical_measure"/>
<parameter value="GeneralizedIDivergence" key="divergence"/>
<parameter value="radial" key="kernel_type"/>
<parameter value="1.0" key="kernel_gamma"/>
<parameter value="1.0" key="kernel_sigma1"/>
<parameter value="0.0" key="kernel_sigma2"/>
<parameter value="2.0" key="kernel_sigma3"/>
<parameter value="3.0" key="kernel_degree"/>
<parameter value="1.0" key="kernel_shift"/>
<parameter value="1.0" key="kernel_a"/>
<parameter value="0.0" key="kernel_b"/>
<parameter value="false" key="use_local_random_seed"/>
<parameter value="1992" key="local_random_seed"/>
</operator>
<operator class="performance" name="Performance (2)" expanded="true" compatibility="7.2.002" activated="true" y="391" x="648" width="90" height="82">
<parameter value="true" key="use_example_weights"/>
</operator>
<connect to_port="example set input" to_op="Replace Missing Values" from_port="output" from_op="Retrieve Titanic"/>
<connect to_port="example set input" to_op="Set Role" from_port="example set output" from_op="Replace Missing Values"/>
<connect to_port="training" to_op="Validation" from_port="example set output" from_op="Set Role"/>
<connect to_port="model" to_op="Apply Model (3)" from_port="model" from_op="Validation"/>
<connect to_port="unlabelled data" to_op="Apply Model (3)" from_port="training" from_op="Validation"/>
<connect to_port="example set input" to_op="Nominal to Numerical" from_port="labelled data" from_op="Apply Model (3)"/>
<connect to_port="exampleSet" to_op="Fuzzy C-Means" from_port="example set output" from_op="Nominal to Numerical"/>
<connect to_port="labelled data" to_op="Performance (2)" from_port="exampleSet" from_op="Fuzzy C-Means"/>
<connect to_port="result 1" from_port="performance" from_op="Performance (2)"/>
<portSpacing spacing="0" port="source_input 1"/>
<portSpacing spacing="0" port="sink_result 1"/>
<portSpacing spacing="0" port="sink_result 2"/>
</process>
</operator>
</process>

Many thanks,

Komeil

Thomas_Ott · October 2016

I'm a bit confused as to why you want to first classify the data and then segment it? These are two different methods of learning (Supervised and Unsupervised). In the supervised method you start with knowing the truth, you know who died and didn't die in the Titanic disaster. Normally, in the Unsupervised way, you typically don't have a class label and look for statisical characteristics that 'segment' like groups together. In what you are trying to do here is build a model on the Titatinic data set with a label and then throw out that label and segement out the regular attributes. You will get different performance measures for sure, one for a classification problem and the other for a segementation problem.

If you're looking to combine multiple algorithms, have you tried our stacking (ensembing) operator?

komeil_shaeri · October 2016

Stacked Generalization is good for combining multiple classifiers. I'm wondering if is there any way to combine clustering techniques with each other? I heard about "Consensus Clustering" which is similar to stacking but for clustering methods.

Thomas_Ott · October 2016

Maybe what you can do is select one class from the Logistic Regression result and then pass that to the clustering process. This way you can segment out those attributes for the single class.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to combine Logistic regression with SOM as a hybrid model?

Answers