The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Text Clustering using K-Medoids Algorithm

puteri_prameswaputeri_prameswa Member Posts: 3 Contributor I
edited November 2018 in Help

Hi All!

 

I'm new to RapidMiner. I have 1000+ online reviews generated from Tripadvisor.com. I want to apply K-Medoids algorithm to cluster those reviews into cluster. The reason why I chose K-Medoids bcs I want to find the "medoid" for each cluster, which I believe is able to represent the contents of the entire cluster. I already applied some nodes such as:

- Read Excel

- Select Attributes

- Nominal to Text

- Process Documents from Data (Tokenization, Stemming, Stopwords Removal)

- and the Clustering node itself

 

But I can't seem to find the proporsional cluster. From 1000+ data with k = 2, the ratio of of members of clusters 1 and 2 is 99 : 1. 

 

 

Please help mee!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    have you tried to use

     

    a) TF-IDF

    b) cosine similarity as distance measure

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I agree with @mschmitz suggestions.  However, there is no guarantee when using any of the k-means family of clustering algorithms that the clusters will be of equal sizes.  The algorithm isn't looking directly at the cluster sizes, but rather at intra-cluster similarity vs inter-cluster similarity.  You may want to try X-Means which will test a large range of possible k values and suggest the best one based on BIC.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.