The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How can I validate a DBSCAN clustering using only internal criteria?

agucaba123agucaba123 Member Posts: 3 Learner III
edited December 2018 in Help

Hello, I'm trying to do a validation of different clustering models using ONLY internal criteria. With centroid-based clustering, like K-means and K-medoid, I used DB index and an extension that evaluates the silhouette index. My problem is that DB and silhouette indexs are not available for DBSCAN, and the others operators of RapidMiner Studio like density, or item distrubution make no sense to me in this case.

 

I saw this post, but I couldn't find an answer: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Cluster-Performance-DBScan-and-agglomerative-Clustering/m-p/40748#M27683

By the way, I readed that in previous versions of RapidMiner existed an operator called "Cluster internal validation". https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Cannot-find-the-cluster-internal-validation-operator-in-rapid/m-p/25745

Is this operator still available? 

 

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @agucaba123,

     

    I'm not aware of an operator called "Cluster internal validation".

    However, you can eventually calculate the Silhouette Coefficient using a Python script.

    If you are interested in, can you share your dataset and your process in order to see if it's possible.

     

    Regards,

     

    Lionel

  • agucaba123agucaba123 Member Posts: 3 Learner III

    Hi Lionel. I can't share the dataset but I tried to apply a Silhouette coeficient and the result was this:

     

    DBSCAN.png

     

    I looped the epsilon parameter between 0,1 and 2. The MinsPoints were defined as 5, 10 and 20. What does it means the Silhouette index in each case? Is it useful for validation in this clustering method? Because when the epsilon parameter rises, the segmentation is worse (the numbers under the value of epsilon are the sizes of the clusters)

     

    Thanks for your time. 

  • septian_bagusseptian_bagus Member Posts: 2 Contributor I
    Hi agucaba, 

    can you let me know how you gt those silhouette numbers using rapidminer?
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @septian_bagus,

    I'm not aware of a Silhouette coefficient as metrics implemented in RapidMiner (thanks to correct me if I'm wrong).
    However, you can obtain the Silhouette coefficient after building a model using a Python script inside RapidMiner
    You can find here a process with a DBSCAN model and the associated silhouette coefficient using a Python script : 
    (You have to install Python on your computer and install the Python Scripting extension from the Marketplace)
    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="85">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="a1|a4"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="103" name="Execute Python" width="90" x="380" y="85">
            <parameter key="script" value="import pandas as pd&#10;from sklearn.cluster import DBSCAN&#10;from sklearn import metrics&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;epsilon = 0.3&#10;minPts = 10&#10;def rm_main(data):&#10;&#10;  X = data[['a1','a4']]&#10;  db = DBSCAN(eps=epsilon, min_samples=minPts).fit(X)&#10;&#10;  labels = db.labels_&#10;&#10;  n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)&#10;  &#10;  Silhouette = metrics.silhouette_score(X,labels)&#10;&#10;  data['labels'] = labels&#10;&#10;  data['cluster'] = n_clusters_&#10;&#10;  data['silhouette'] = Silhouette&#10;&#10;  &#10;  &#10;&#10;    &#10;    # connect 2 output ports to see the results&#10;  return data"/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope it helps,

    Regards,

    Lionel
     
  • MaerkliMaerkli Member Posts: 84 Guru
    Hallo Lionel,

    The obsolete  Operator Cluster internal evaluation implemented internal evaluation measures:
    • Global Silhouette Index
    • Min Max Cut
    • XB Index
    • Davies Bouldin
    Source: RapidMiner_ Data Mining Use Cases and Business Analytics Applications [Hofmann & Klinkenberg 2013-10-25].

    Bonne journée,
    Maerkli



  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.