Looping Clusters and store them in Repository
Hi everybody,
My dataset consists 4000 examples, 4 special attributes (ID, cluster, text and outlier), and 570 regular attributes from textprocessing. What I have done with the data so far was only to cluster it. Now I have 37 clusters and I want to store the 1 example set for each cluster in my repository.
Thats where my problem is: I think it should be possible with macros, "loop cluster" - and the "store" -operator, but I cant figure out how to set the parameters right.
I have a snippet attached from the data.
And the XML of my process so far:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Daten KAM clustered (opt.)" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Datenbearbeitung MA/Filter Outliers/Daten KAM clustered (opt.)"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="ID|label|text"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
<parameter key="attribute_name" value="label"/>
<parameter key="target_role" value="cluster"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="loop_clusters" compatibility="8.2.000" expanded="true" height="82" name="Loop Clusters" width="90" x="648" y="34">
<process expanded="true">
<operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="label.equals.%{myMacro_0}"/>
</list>
</operator>
<operator activated="true" class="store" compatibility="8.2.000" expanded="true" height="68" name="Store" width="90" x="648" y="34">
<parameter key="repository_entry" value="999TEST"/>
</operator>
<connect from_port="cluster subset" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="out 1"/>
<portSpacing port="source_cluster subset" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="68" name="Set Macros" width="90" x="313" y="136">
<list key="macros">
<parameter key="myMacro_0" value=""cluster_0""/>
</list>
</operator>
<connect from_op="Retrieve Daten KAM clustered (opt.)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Loop Clusters" to_port="example set"/>
<connect from_op="Loop Clusters" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
My goal is to apply the "Extract Topics from Document (LDA)" operator on every cluster with number of topics = 1 so that I can see the top words for each cluster.
Thank you all in advance
flo
Best Answers
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi,
Group into Collection and Loop Collection from Toolbox does it.
Let me know if you need help with LDA. It's somewhat my baby.
BR,
Martin
Edit: I guess you do not want to use LDA, but simple process documents or so.
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany2 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Hi @flo,
have a look at the attached process. Is should do what you want?
BR,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve OpenRanks Reviews Beijing" width="90" x="45" y="34">
<parameter key="repository_entry" value="data/OpenRanks Reviews Beijing"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Review"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="5.0"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="85"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:k_means" compatibility="8.2.001" expanded="true" height="82" name="Clustering" width="90" x="447" y="34"/>
<operator activated="true" class="operator_toolbox:group_into_collection" compatibility="1.3.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="34">
<parameter key="group_by_attribute" value="cluster"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="8.2.001" expanded="true" height="82" name="Loop Collection" width="90" x="849" y="34">
<process expanded="true">
<operator activated="true" class="extract_macro" compatibility="8.2.001" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
<parameter key="macro" value="clusterId"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="cluster"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="cluster"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="aggregate" compatibility="8.2.001" expanded="true" height="82" name="Aggregate (2)" width="90" x="179" y="34">
<parameter key="use_default_aggregation" value="true"/>
<parameter key="default_aggregation_function" value="sum"/>
<list key="aggregation_attributes"/>
</operator>
<operator activated="true" class="transpose" compatibility="8.2.001" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
<operator activated="true" class="sort" compatibility="8.2.001" expanded="true" height="82" name="Sort" width="90" x="447" y="34">
<parameter key="attribute_name" value="att_1"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="8.2.001" expanded="true" height="82" name="Filter Example Range" width="90" x="581" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="5"/>
<description align="center" color="transparent" colored="false" width="126">Take Top5</description>
</operator>
<operator activated="true" class="replace" compatibility="8.2.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="id"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="replace_what" value="sum\((.+)\)"/>
<parameter key="replace_by" value="$1"/>
</operator>
<operator activated="true" class="rename" compatibility="8.2.001" expanded="true" height="82" name="Rename" width="90" x="849" y="34">
<parameter key="old_name" value="att_1"/>
<parameter key="new_name" value="sum"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="34">
<list key="function_descriptions">
<parameter key="cluster" value="%{clusterId}"/>
</list>
</operator>
<connect from_port="single" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
<connect from_op="Aggregate (2)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
<connect from_op="Transpose" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve OpenRanks Reviews Beijing" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Group Into Collection" to_port="exa"/>
<connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="50" resized="true" width="481" x="271" y="235">Task: Calculate the top 5 most frequent words per cluster</description>
</process>
</operator>
</process>Edit: Also have a look at this blog post: https://medium.com/@mSchmitz_/understanding-clustering-cf0117148ef4
i think this is closer to what you really want.
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany2
Answers
Hi @mschmitz,
This topic inspire me 2 questions about your (nice) baby Martin :
In deed, I executed the tutorial of this operator. For recall, in this tutorial, we create and analyze 5 documents which are strictly the same :
- when number of topics = 5, all documents have the same topic :
- when number of topics = 10, the document have different topics :
My first question is why, in this last case for similar documents, we don't have the same topic (like in the first case) ?
My second question is how should we interpret the weight of words : The more the weight is high, the more the word is "caracteristic" of the topic / the more the word "explain" the topic ?
Thank you,
Regards,
Lionel
Hey @lionelderkrikor,
this is totally artificial since the data is the same. The optimization uses some randomness for the start. It assings a word to a topic and so on. Thus the different names. If there would be something in, then this would change. I think you just get the priors out.
I got a topic extraction on Tripadvisor Reviews somewhere. I thought i posted a blog post on it - but i can't find it? @sgenzer did i maybe just not post it?
BR,
Martin
Dortmund, Germany
Hi,
Thank you @mschmitz those operators were exactly what I was looking for.
Since I have each of the clusters in one Collection I thought I could use the "Extract Topics from Document" (with number of topics : 1) on those Collections to see the TOP words for each cluster....
But I have been thinking now:
What I did was to cluster my text data by k means first and after that I did the LDA "Extract Topics from Document", so my question is:
Isn't that somewhat the same ? I mean both operators seperates texts into "clusters" or "topics" except LDA can give me the TOP x words for each topic.
Best regards
flo
Hi @flo,
exactly. LDA is somewhat like a clustering. It also groups your documents into k-groups. The big difference is, that LDA is a Latent model.
This means:
Which makes it different to normal clusterings. I think what you want is just a Process Documents on each cluster and use WordList to Data to get the frequency overview.
Best,
Martin
Dortmund, Germany
@mschmitz thank you for the fast reply.
...haha you know better what I want than I do :P
Process Documents on each cluster with the top frequent words as a WordList was what I thought that I can achieve with the LDA.
Anyway thank you very much.
Best regards
flo
Hello
Dear friend @flo
Did you perform the LDA algorithm on any cluster?
Thanks if you tell me
With respect
Hello @m_keshavarz_com,
I tried to perform LDA on the clusters but it didnt work (log 0.0000). But what I will try is just to get a wordlist from each cluster and sort them top down. That should deliver a similar result to the LDA hopefully.
Sorry I cant help more than that ....
Best
flo
Hi @mschmitz
I hope I am not bothering you.
Thank you so far for your input - the process documents ( vector creation: term occurrences) on each cluster gives a good overview.
What I end up with is the following table:
My question is now is there a way to show only the top 5 words per cluster ( no occurrences ) through some magic ETL which I dont know yet or is there no other choice than to transpose this table and and sort each cluster in deacreasing order manually ?
Best regards
flo
Hi @mschmitz
Yes that was very much what I wanted to do. I have modified the process a little bit so that it shows me the TOP 5 words for each cluster in one example set more or less like this:
ClusterID TOP1 TOP2 TOP3 TOP4 TOP5
Cluster_1
Cluster_2
Cluster_3
Thank you very much.
Best regards
flo
Hello Dear friends and forum professors sorry..... I also want to find repetitive words in each cluster and the centers of each cluster But I do not know how Somebody tell me?
and
I'm from @mschmitz
I used . But for 8 clusters, only cluster words are 0,1,2,5,6,7
Gave the And the words did not give clusters 4 and 5
what's wrong?
thanks for your help
Hi dear friend @flo
Thank you very much for your help
For me, lda also had the result likelihood = 0 on clusters
I did not understand your sentence
Can you explain more?and how?
"But what I will try is just to get a wordlist from every cluster and sort them down the top. That would hopefully bring a similar result to the LDA."
Thanks a lot
Hi @m_keshavarz_com , @student_compute
maybe this can help you, if you are looking for the most frequent words for each cluster:
Of course thanks to @mschmitz for most of the process.
Best
flo
Hello. thank you very much dear friend:smileyhappy:
But I want to know how to find the repetitive words of each cluster and the center of each cluster?
Thank you so much for your kindness