The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Clustering"
Legacy User
Member Posts: 0 Newbie
Ingo - I've taken this as far as I can and now I'm stuck! I've created the following experiment that attempts to cluster text extracted from a sample Excel file containing 14 examples, 0 special attributes and 8 regular attributes. Here's the syntax so far ...
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="datamanagement" value="long_array"/>
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
</operator>
This process produce 3 clusters. Cluster 0 has 33 items, Cluster 1 has 55 items, and Cluster 3 has 12 on a total of 100 examples. At this point, I want to apply a meaningful, user-friendly label to each cluster that captures the key theme of each cluster. How can I figure out the key theme for each cluster? What steps are next?
Please help!
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="datamanagement" value="long_array"/>
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
</operator>
This process produce 3 clusters. Cluster 0 has 33 items, Cluster 1 has 55 items, and Cluster 3 has 12 on a total of 100 examples. At this point, I want to apply a meaningful, user-friendly label to each cluster that captures the key theme of each cluster. How can I figure out the key theme for each cluster? What steps are next?
Please help!
Tagged:
0
Answers
first of all I do not really understand what you are trying to achieve with the [tt]ExampleSetGenerator[/tt] in your process. The inclusion of the operator in your process actually means that you preprocess some texts from your excel file, than artificially generate completely independent data and cluster the latter data.
Secondly, it is not that easy to assign a meaningful label without user-interaction. You can however gain an insight to what describes your clusters (which words occur more often in a cluster than in another one) by having a look at the cluster centroids or by learning a (descriptive) classification model.
Kinds regards,
Tobias
On your advice in the second paragraph, can you clarify what steps are required? The guidance provided is too general. Please provide clearer set of instructions that will help me complete what you are advising.
Thanks!
I think Tobias is right here in giving only general advice since you also asked a general question. Don't get me wrong but without more information it is often not possible to give more concrete hints. And in other cases, the processes for the desired task get complex enough to eat an hour of our time. And this is actually a case which even combines both aspect (as Tobias pointed out by saying "it is not that easy to...").
However, to be more constructive and give you some additional hints which operators could help you in your trials here are some basic recipes (which will probably not directly deliver the desired solution):
Option 1: use ChangeAttributeRole to change the role of the cluster to label and learn a predictive and understandable model describing your clusters based on the other attributes (e.g. a decision tree operator). Define labels based on such a model and apply the operator "Mapper" to map "cluster0" to "descriptive name 1" and so on. This is only a semi-automatic approach but often delivers the best results.
Option 2: a more automatic approach is to define a label attribute for each cluster distinguishing the current cluster from all others (e.g. with AttributeConstruction, change it to the label afterwards with ChangeAttributeRole). Now learn a weighting from this artificially labeled data (e.g. with Relief). Transform the weights into a data set with the AttributeWeights2ExampleSet operator and sort it according to the weight (Sorting). Keep only the k rows with highest weight (ExampleRangeFilter). Use macros (e.g. DataMacroDefinition) or any other means to retrieve the data from the data set and use the result again as base for the "Mapper" operator. Put everything in a loop (ValueIterator) and you get your generic and automatic result.
Option 3: for texts the operator "CorpusBasedWeighting" together with the recipe of Option 2 and a loop can also be useful.
Hope that helps. Cheers,
Ingo
Error in: Sorting (Sorting) The attribute 'weights' does not exist. The example set does not contain an attribute with the given name.
Where did I go wrong in configuring the Sorting operator?
Kind regards,
Tobias
<operator name="ExampleSet2AttributeWeights" class="ExampleSet2AttributeWeights">
</operator>
<operator name="Sorting" class="Sorting" breakpoints="after">
<parameter key="attribute_name" value="Weights"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
Thoughts?
Regards,
Tobias
I made the change, but am still getting ...
Error in: Sorting (Sorting) The attribute 'Weight' does not exist. The example set does not contain an attribute with the given name.
What else might I try?
Btw. you can insert breakpoints into the tree by double-clicking on an operator, i.e. the process execution stops at the specified point and you can inspect the intermediate results and observe, that there is no attribute called [tt]Weights[/tt] in any example set you have ...
Kind regards,
Tobias
Regards,
Tobias
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Program Files\Rapid-I\RapidMiner\test.log"/>
<parameter key="resultfile" value="C:\Program Files\Rapid-I\RapidMiner\test.res"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="datamanagement" value="long_array"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="no">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="gaussian mixture clusters"/>
<parameter key="number_examples" value="14"/>
<parameter key="number_of_attributes" value="8"/>
<parameter key="datamanagement" value="long_array"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
<operator name="AttributeConstruction" class="AttributeConstruction">
<list key="function_descriptions">
</list>
</operator>
<operator name="Relief" class="Relief">
</operator>
<operator name="AttributeWeights2ExampleSet" class="AttributeWeights2ExampleSet">
</operator>
<operator name="Sorting" class="Sorting">
<parameter key="attribute_name" value="Weight"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
</operator>
Before I attempt the DataMacroDefinition that Ingo recommends as the next step, I would like to understand where I can find the key words that are associated with the 3 clusters that this experiment produces. With only 3 clusters, manually extracting and applying the key words to my cluster analysis will probablly be faster than building a macro at this stage.
the process looks fine until the text processing. Then things go wrong:
- What's the purpose of the ExampleSetGenerator here? Remove it.
- You added the attribute construction. Fine. But you forgot to define the necessary parameters. Create a new attribute with value "target" if the cluster attribute equals the target cluster and "non_target" otherwise.
- Before you apply Relief you have to specify that the newly created attribute should be used as label with the ChangeAttributeRole operator.
- The rest seems to be fine but using a loop together with macros would be much more elegant and scales to larger numbers of clusters as well.
Cheers,Ingo
Error in: KMeans (KMeans) The example set contains non-numerical attribute #4: comments (string/single_value)/values=[mediocre, sunny, overcast, rain, good tool, great tool] which is not allowed for value based similarities. Some learning schemes and algorithms can handle only numerical attributes, for example KMeans clustering or most support vector machines (SVM). You can use one of the preprocessing operator before applying this operator in order to transform the nominal attributes.
If I keep the examplesetgenerator in-play, then the error is eliminated. Thoughts?
<operator name="AttributeConstruction" class="AttributeConstruction">
<list key="function_descriptions">
<parameter key="target" value="cluster=target"/>
</list>
</operator>
Produces this error ...
Error in: AttributeConstruction (AttributeConstruction) Generation exception: 'Unrecognized symbol "target"
Syntax Error (assignment not enabled)
' An operator failed to generate a new attribute, macro, or other object which is calculated on the fly.
the correct format is
Here is the basic process. At the end, you could replace "cluster_0" by "cluster_0_att3+att4+att" with the "Mapping" operator. You could also perform everything in an automatic loop, try other weighting schemes etc.
By the way: I moved this topic into the "problems" board.
Cheers,
Ingo