The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Text Clustering using rapidminer

singing_bird_1singing_bird_1 Member Posts: 16 Contributor I
edited November 2018 in Help

Hi all,

I am new in rapidminer

I have documents and I want to cluster them using k-medoids algorithm with cosine distance

I watched many videos, read tutorials and tried so much but it gives me wrong results (I compared the results with the results of another program)

so, please please write to me full steps to load, cluster and evaluate the documents.

Note: the documents are stored in a csv file such that, each document is put in only one cell and as total they are 396 rows or docs

help me please

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    What do you mean by "it gives you wrong results"?  Can you be more specific?  Also if you can attach your RapidMiner process xml it would be easier to troubleshoot.

    Thanks,

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    thank you so much for your help

    I mean by wrong result is the distribution of the documents among the clusters

    cluster0:22 items

    cluster1:31 items

    cluster2:343 items

     

    attached my process 

    thanks

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    CLustering using k-means (or any of its variations) is not designed to divide the records evenly into clusters, but rather to minimize distance within clusters while it maximizes distance between clusters.  Thus, if your only reasoning for why the clustering didn't work is that you have a very lumpy distribution of documents across clusters, I don't think that is a valid inference.

    I looked at your process and since I don't have access to the data, I was not able to run it to validate the results.  There did not appear to be any process errors, but there are a couple of things that are unusual--for example, why are you running "data to similarity" after text processing and then running the clusters on that output?  "Data to similarity" is going to generate a record for every pairwise comparison among your original data elements so you end up with many more records than you start with.  More conventionally you would run the clustering directly on the output of the text processing.  I was also not able to interpret your performance operator either---is it a custom extension you coded or purchased in the marketplace?  If not, which extension is it from?  My installation of RapidMiner did not recognize it.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    thank you so much for your reply

    I don't know how to attach my process

    I am attaching the dataset that iam using

    performance operator is an extension and I attached it to rapidminer (it is silhouette coefficient)

    I used data to similarity to convert or represent the docs tobinary vectors

    can you please tell me how to attach the process? so that you can know exactly what the problem is

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    With your data file I created a modified version of your process.  This version runs without errors.  I substituted the k-means clustering for k-mediods since it is much faster.  I also changed your word vector to term frequency (you had it set at binary term occurences) and changed your distance metric to cosine similarity.  I deactivated the data to similarity operator since it was not needed.

    What extension is the performance operator from?  I could not find it.  So I left it off.  But the new clusters are more evenly distributed if you are concerned about that.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
    <parameter key="random_seed" value="2001"/>
    <process expanded="true">
    <operator activated="false" class="data_to_similarity" compatibility="7.6.000" expanded="true" height="82" name="Data to Similarity" width="90" x="380" y="187">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <operator activated="false" class="multiply" compatibility="7.6.000" expanded="true" height="68" name="Multiply" width="90" x="514" y="187"/>
    <operator activated="false" class="dummy" compatibility="7.6.000" expanded="true" height="68" name="Performance" width="90" x="648" y="187"/>
    <operator activated="true" class="read_csv" compatibility="7.6.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\brian\Downloads\All_clusters_RM.csv"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" breakpoints="after" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="112" y="136"/>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" breakpoints="after" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="313" y="34">
    <parameter key="vector_creation" value="Term Frequency"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="85"/>
    <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
    <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="k_means" compatibility="7.6.000" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="34">
    <parameter key="k" value="3"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <connect from_op="Data to Similarity" from_port="example set" to_op="Multiply" to_port="input"/>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_op="Clustering (2)" to_port="example set"/>
    <connect from_op="Clustering (2)" from_port="cluster model" to_port="result 1"/>
    <connect from_op="Clustering (2)" from_port="clustered set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    thank you so much

    I have a question

    how can I run your modified process?

    how can i attach it and run it?

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    See the instructions here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-can-I-share-processes-without-RapidMiner-Server/ta-p/37047

    Basically just copy the xml onto the xml tab in RapidMiner and then hit the green check mark.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I
    thank you so much for your help
Sign In or Register to comment.