"Text Clustering"

Legacy User · April 2009

Ingo - I've taken this as far as I can and now I'm stuck! I've created the following experiment that attempts to cluster text extracted from a sample Excel file containing 14 examples, 0 special attributes and 8 regular attributes. Here's the syntax so far ...

<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="datamanagement" value="long_array"/>
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="gaussian mixture clusters"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
</operator>

This process produce 3 clusters. Cluster 0 has 33 items, Cluster 1 has 55 items, and Cluster 3 has 12 on a total of 100 examples. At this point, I want to apply a meaningful, user-friendly label to each cluster that captures the key theme of each cluster. How can I figure out the key theme for each cluster? What steps are next?

Please help!

TobiasMalbrecht · April 2009

Hi,

first of all I do not really understand what you are trying to achieve with the [tt]ExampleSetGenerator[/tt] in your process. The inclusion of the operator in your process actually means that you preprocess some texts from your excel file, than artificially generate completely independent data and cluster the latter data.

Secondly, it is not that easy to assign a meaningful label without user-interaction. You can however gain an insight to what describes your clusters (which words occur more often in a cluster than in another one) by having a look at the cluster centroids or by learning a (descriptive) classification model.

Kinds regards,
Tobias

Legacy User · April 2009

OK, if the ExampleSetGenerator should be removed as it is redundant, then consider it done.

On your advice in the second paragraph, can you clarify what steps are required? The guidance provided is too general. Please provide clearer set of instructions that will help me complete what you are advising.

Thanks!

IngoRM · April 2009

Hi,

I think Tobias is right here in giving only general advice since you also asked a general question. Don't get me wrong but without more information it is often not possible to give more concrete hints. And in other cases, the processes for the desired task get complex enough to eat an hour of our time. And this is actually a case which even combines both aspect (as Tobias pointed out by saying "it is not that easy to...").

However, to be more constructive and give you some additional hints which operators could help you in your trials here are some basic recipes (which will probably not directly deliver the desired solution):

Option 1: use ChangeAttributeRole to change the role of the cluster to label and learn a predictive and understandable model describing your clusters based on the other attributes (e.g. a decision tree operator). Define labels based on such a model and apply the operator "Mapper" to map "cluster0" to "descriptive name 1" and so on. This is only a semi-automatic approach but often delivers the best results.

Option 2: a more automatic approach is to define a label attribute for each cluster distinguishing the current cluster from all others (e.g. with AttributeConstruction, change it to the label afterwards with ChangeAttributeRole). Now learn a weighting from this artificially labeled data (e.g. with Relief). Transform the weights into a data set with the AttributeWeights2ExampleSet operator and sort it according to the weight (Sorting). Keep only the k rows with highest weight (ExampleRangeFilter). Use macros (e.g. DataMacroDefinition) or any other means to retrieve the data from the data set and use the result again as base for the "Mapper" operator. Put everything in a loop (ValueIterator) and you get your generic and automatic result.

Option 3: for texts the operator "CorpusBasedWeighting" together with the recipe of Option 2 and a loop can also be useful.

Hope that helps. Cheers,
Ingo

Legacy User · April 2009

Ingo, Yes, your follow-up is helpful. It offers a set of options that I can attempt to get to my end-point. Since my experience with RM to date is essentially trial-and-error, some overall guidance, or "recipes", are very helpful in moving a novice RM user down the learning curve! I completely support learing-by-doing, but frameworks are always useful.

Legacy User · April 2009

Ingo Mierswa wrote:

Option 2: a more automatic approach is to define a label attribute for each cluster distinguishing the current cluster from all others (e.g. with AttributeConstruction, change it to the label afterwards with ChangeAttributeRole). Now learn a weighting from this artificially labeled data (e.g. with Relief). Transform the weights into a data set with the AttributeWeights2ExampleSet operator and sort it according to the weight (Sorting). Keep only the k rows with highest weight (ExampleRangeFilter). Use macros (e.g. DataMacroDefinition) or any other means to retrieve the data from the data set and use the result again as base for the "Mapper" operator. Put everything in a loop (ValueIterator) and you get your generic and automatic result.

Hope that helps. Cheers,
Ingo

Ingo, I chose to implement Option 2. All worked well until I reached the Sorting Operator. I sorted according to weights and generated the following error

Error in: Sorting (Sorting) The attribute 'weights' does not exist. The example set does not contain an attribute with the given name.

Where did I go wrong in configuring the Sorting operator?

TobiasMalbrecht · April 2009

Hi,

newstuff wrote:

Error in: Sorting (Sorting) The attribute 'weights' does not exist. The example set does not contain an attribute with the given name.

Where did I go wrong in configuring the Sorting operator?

The attribute containing the weight which is generated by the [tt]AttributeWeights2ExampleSet[/tt] is called [tt]Weight[/tt]. Did you exactly specified this as the sorting attribute?

Kind regards,
Tobias

Legacy User · April 2009

Yes, I ran the trilal using the term Weights. Here is the syntax ....

<operator name="ExampleSet2AttributeWeights" class="ExampleSet2AttributeWeights">
</operator>
<operator name="Sorting" class="Sorting" breakpoints="after">
<parameter key="attribute_name" value="Weights"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>

Thoughts?

TobiasMalbrecht · April 2009

Hi,

newstuff wrote:

<operator name="ExampleSet2AttributeWeights" class="ExampleSet2AttributeWeights">
</operator>
<operator name="Sorting" class="Sorting" breakpoints="after">
<parameter key="attribute_name" value="Weights"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>

Thoughts?

Yes of course: the attribute is called [tt]Weight[/tt] not [tt]Weights[/tt] ... without the "s"!

Regards,
Tobias

Legacy User · April 2009

Tobias Malbrecht wrote:

Yes of course: the attribute is called [tt]Weight[/tt] not [tt]Weights[/tt] ... without the "s"!

Tobias

I made the change, but am still getting ...

Error in: Sorting (Sorting) The attribute 'Weight' does not exist. The example set does not contain an attribute with the given name.

What else might I try?

TobiasMalbrecht · April 2009

Hi,

newstuff wrote:

<operator name="ExampleSet2AttributeWeights" class="ExampleSet2AttributeWeights">
</operator>
<operator name="Sorting" class="Sorting" breakpoints="after">
<parameter key="attribute_name" value="Weights"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>

ups, just read your post again and realized you did not use the [tt]AttributeWeights2ExampleSet[/tt] operator as Ingo suggested but the [tt]ExampleSet2AttributeWeights[/tt] operator. This of course makes the difference. So follow Ingo's suggestion, use a weighting scheme and the [tt]AttributeWeights2ExampleSet[/tt] operator. Then sort by the attribute [tt]Weight[/tt] and it should work.

Btw. you can insert breakpoints into the tree by double-clicking on an operator, i.e. the process execution stops at the specified point and you can inspect the intermediate results and observe, that there is no attribute called [tt]Weights[/tt] in any example set you have ...

Kind regards,
Tobias

Legacy User · April 2009

Tobias Malbrecht wrote:

ups, just read your post again and realized you did not use the [tt]AttributeWeights2ExampleSet[/tt] operator as Ingo suggested but the [tt]ExampleSet2AttributeWeights[/tt] operator. This of course makes the difference. So follow Ingo's suggestion, use a weighting scheme and the [tt]AttributeWeights2ExampleSet[/tt] operator. Then sort by the attribute [tt]Weight[/tt] and it should work.

I could not find an AttributeWeights2ExampleSet in RM4.3 so I used ExampleSet2AttributeWeights. Assuming that they are the same operator appears to be problematic. Will I find the AttributeWeights2ExampleSet in RM 4.3?

TobiasMalbrecht · April 2009

Hi,

newstuff wrote:

I could not find an AttributeWeights2ExampleSet in RM4.3 so I used ExampleSet2AttributeWeights. Assuming that they are the same operator appears to be problematic. Will I find the AttributeWeights2ExampleSet in RM 4.3?

no, they are definitely not the same. And another no, the [tt]AttributeWeights2ExampleSet[/tt] operator was only introduced in 4.3.1, i.e after the release of 4.3. If you update to version 4.4 you will of course be able to use it.

Regards,
Tobias

Legacy User · April 2009

OK, RM 4.4 is installed and the following experiment will run without errror ....
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Program Files\Rapid-I\RapidMiner\test.log"/>
<parameter key="resultfile" value="C:\Program Files\Rapid-I\RapidMiner\test.res"/>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\feedback.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="datamanagement" value="long_array"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="comments"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="no">
<parameter key="default_content_language" value="english"/>
<parameter key="vector_creation" value="TermOccurrences"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="PorterStemmer" class="PorterStemmer">
</operator>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="gaussian mixture clusters"/>
<parameter key="number_examples" value="14"/>
<parameter key="number_of_attributes" value="8"/>
<parameter key="datamanagement" value="long_array"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
<operator name="AttributeConstruction" class="AttributeConstruction">
<list key="function_descriptions">
</list>
</operator>
<operator name="Relief" class="Relief">
</operator>
<operator name="AttributeWeights2ExampleSet" class="AttributeWeights2ExampleSet">
</operator>
<operator name="Sorting" class="Sorting">
<parameter key="attribute_name" value="Weight"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
</operator>

Before I attempt the DataMacroDefinition that Ingo recommends as the next step, I would like to understand where I can find the key words that are associated with the 3 clusters that this experiment produces. With only 3 clusters, manually extracting and applying the key words to my cluster analysis will probablly be faster than building a macro at this stage.

IngoRM · April 2009

Hello,

the process looks fine until the text processing. Then things go wrong:

What's the purpose of the ExampleSetGenerator here? Remove it.
You added the attribute construction. Fine. But you forgot to define the necessary parameters. Create a new attribute with value "target" if the cluster attribute equals the target cluster and "non_target" otherwise.
Before you apply Relief you have to specify that the newly created attribute should be used as label with the ChangeAttributeRole operator.
The rest seems to be fine but using a loop together with macros would be much more elegant and scales to larger numbers of clusters as well.

Cheers,
Ingo

Legacy User · April 2009

Ingo Mierswa wrote:

the process looks fine until the text processing. Then things go wrong:
What's the purpose of the ExampleSetGenerator here? Remove it.

You're right things definitely go wrong with the text processing. Tobias previously advised that I take it out; however, when I do execute this action leaving everything else unchanged, I receive the following error ...

Error in: KMeans (KMeans) The example set contains non-numerical attribute #4: comments (string/single_value)/values=[mediocre, sunny, overcast, rain, good tool, great tool] which is not allowed for value based similarities. Some learning schemes and algorithms can handle only numerical attributes, for example KMeans clustering or most support vector machines (SVM). You can use one of the preprocessing operator before applying this operator in order to transform the nominal attributes.

If I keep the examplesetgenerator in-play, then the error is eliminated. Thoughts?

IngoRM · April 2009

Not without seeing the (intermediate) results, sorry.

Legacy User · April 2009

Ingo Mierswa wrote:

You added the attribute construction. But you forgot to define the necessary parameters. Create a new attribute with value "target" if the cluster attribute equals the target cluster and "non_target" otherwise.
[

No worries on the previous post regarding the K-means errror. I got that one fixed! But, not have much success properly configuring the attribute constuction operator. I am consistently generating syntax error. Can you elaborate on how best to define this new attribute. This syntax ...

<operator name="AttributeConstruction" class="AttributeConstruction">
<list key="function_descriptions">
<parameter key="target" value="cluster=target"/>
</list>
</operator>

Produces this error ...

Error in: AttributeConstruction (AttributeConstruction) Generation exception: 'Unrecognized symbol "target"
Syntax Error (assignment not enabled)
' An operator failed to generate a new attribute, macro, or other object which is calculated on the fly.

IngoRM · April 2009

Hi again,

the correct format is


if (cluster=="cluster_0","target","non_target")

Here is the basic process. At the end, you could replace "cluster_0" by "cluster_0_att3+att4+att" with the "Mapping" operator. You could also perform everything in an automatic loop, try other weighting schemes etc.


<operator name="Root" class="Process" expanded="yes">
    <operator name="Create Data" class="OperatorChain" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function"	value="gaussian mixture clusters"/>
            <parameter key="number_examples"	value="500"/>
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class"	value="attribute_name_filter"/>
            <parameter key="parameter_string"	value="label"/>
            <parameter key="invert_filter"	value="true"/>
            <parameter key="apply_on_special"	value="true"/>
        </operator>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k"	value="3"/>
    </operator>
    <operator name="AttributeConstruction" class="AttributeConstruction">
        <list key="function_descriptions">
          <parameter key="label"	value="if (cluster==&quot;cluster_0&quot;,&quot;target&quot;,&quot;non_target&quot;)"/>
        </list>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="label"/>
        <parameter key="target_role"	value="label"/>
    </operator>
    <operator name="Relief" class="Relief">
    </operator>
    <operator name="AttributeWeights2ExampleSet" class="AttributeWeights2ExampleSet">
    </operator>
    <operator name="Sorting" class="Sorting">
        <parameter key="attribute_name"	value="Weight"/>
        <parameter key="sorting_direction"	value="decreasing"/>
    </operator>
</operator>

By the way: I moved this topic into the "problems" board.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Clustering"

Answers