Clustering of the Text

gunjanamit · June 2012

I wanted to cluster the survey comments in different categories like

Comment Category

Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant

I want to read to read the comments from excel and write it back in excel with Category.

Can anyone please suggest how to do this?

MariusHelf · June 2012

Hi,

if you already know which categories you are looking for, you should label your training data manually with these categories and then train a classification algorithm on it. A good choice for text processing could be the SVM.
If you can't or don't want to label your data, just run a clustering algorithm like k-Means on your preprocessed documents, and have a look at the clusters afterwards to see if they make sense for you.

Best, Marius

gunjanamit · June 2012

I have followed the below process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
<process expanded="true" height="252" width="681">
<operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Users\guagg\Desktop\All\RapidMiner\read.xls"/>
<parameter key="imported_cell_range" value="A1:A6"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="313" y="75">
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="true"/>
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NominalMeasures"/>
<parameter key="nominal_measure" value="RussellRaoSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="5.2.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="120"/>
<connect from_op="Read Excel" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

But its not giving me correct results.

Results
cluster_0 I love food
cluster_1 washroom stinks
cluster_2 service is poor
cluster_0 food is great
cluster_0 not great service

Last one should be Cluster 2 not Cluster 0.

Please suggest!!!

MariusHelf · June 2012

You are processing texts, so you should have a close look at the Text Extension. You'll find links to tutorials in the post linked in my signature.

Best, Marius

gunjanamit · June 2012

Marius,

I cant find the link. Please give again.

Regards
gunjan

MariusHelf · June 2012

Just click my sigature where it says in big red letters "click here" and read the first item in linked post.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Clustering of the Text

Answers