Text clustering in RapidMiner Studio
I'm trying to do an unsupervised clustering of text in RM. The data is in a .CSV file. One attribute is a text field with free text that I would like to cluster. I have configured this as a data source in my repository. I marked the field as type text. I also marked the id field as type id. I believe I need to create a word vector for each example in my set. I think I do this using "Process Documents from Data". I have this set for create word vector using TF-IDF.
Inside of Process Documents, I have a tokenizer, case transformer, stopword filter, stemmer, and n-gram builder in sequence. I wired the output of Process Documents to the input of k-means clustering. Everything runs for a while and then halts with an error that the example set contains non-numeric values in a column. Is there a way to focus the clustering on only the attributes of interest (i.e. the terms found in process documents)? Or do I have to filter out the other attributes first?
I also tried switching the k-means measure type to mixed, but then I get an error that I have missing values.
All of the articles I read on clustering text describe the process I'm using, but it doesn't work for me. Please help.
Answers
filtering the attributes did the trick. I stripped out everything except the id and text. the resulting term vectors came through and clustered. progress...
Dear dsackin,
could you provide an example process? In general i would recommend to only cluster on the values returned by process documents i.e the TF/IDF values.
~martin
Dortmund, Germany
I got this working. I had to use Select Attributes operator to filter the input to the process documents operator down to just an id and the text field. Then the output of the process documents was just id plus all of the term attributes and that document's TF-IDF score for each term.
Now I'm trying to figure out how to assign a "top terms" summary to each cluster. I used Extract Cluster Prototypes on the Cluster Model output. I get a new example set with one example per cluster. Each example has a cluster label plus prototype scores for each term for each cluster. What I would like to do is find a way to pivot that somehow so I can get a list of terms and scores and sort and threshhold the top N for each cluster.
Going from this:
CLUSTER,BOAT,CAR,PLANE,TRAIN
cluster_0, 0.02,0.31,0.23,0.00
cluster_1, 0.22,0.01,0.0,0.0
To this:
CLUSTER,TERM,SCORE
cluster_0,boat,0.02
cluster_0,car,0.31
cluster_0,plane,0.23
cluster_0,train,0.00
cluster_1,boat,0.22
...
then group by cluster label, sort by score, and output top N scoring terms.
I tried using both Transpose and Pivot on the Extract Cluster Prototypes results, but can't seem to get to what I think I need. I need help w/ that or some other way to generate descriptive labels for the resulting clusters.
Thanks
Hi,
attached is my process to do a similar thing. I usually to a feature selection technique in a one vs all fashion. This answers the question "what are the most distinguishing attributes for this cluster". I use the top 3 features (=words) as a new name for the cluster.
Taking the cluster centroid is a bit problematic. Just because it has a high value somewhere does not make this attribute important for the cluster.
Best,
Martin
Dortmund, Germany
Martin,
Thanks for the guidance and sample process. I added your term weighting, concatentation, filtering, and dictionary lookup to my process. But it fails to run. I get an error dialog that says something like "Process failed. There are no obvious errors but you should run in debug mode or check the log"
Here is the log:
Dec 7, 2016 12:06:38 PM INFO: Loading initial data.
Dec 7, 2016 12:06:38 PM INFO: Process //Local Repository/processes/datarole/clustering starts
Dec 7, 2016 12:06:51 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Dec 7, 2016 12:06:51 PM SEVERE: Here:
Dec 7, 2016 12:06:51 PM SEVERE: Process[1] (Process)
Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Main Process'
Dec 7, 2016 12:06:51 PM SEVERE: +- Retrieve MABostonPlumbing[1] (Retrieve)
Dec 7, 2016 12:06:51 PM SEVERE: +- Sample[1] (Sample)
Dec 7, 2016 12:06:51 PM SEVERE: +- Select Attributes[1] (Select Attributes)
Dec 7, 2016 12:06:51 PM SEVERE: +- Process Documents from Data[1] (Process Documents from Data)
Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Vector Creation'
Dec 7, 2016 12:06:51 PM SEVERE: | +- Transform Cases (3)[200] (Transform Cases)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Tokenize (3)[200] (Tokenize)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Filter Stopwords (English)[200] (Filter Stopwords (English))
Dec 7, 2016 12:06:51 PM SEVERE: | +- Stem (Snowball)[200] (Stem (Snowball))
Dec 7, 2016 12:06:51 PM SEVERE: | +- Generate n-Grams (Terms)[200] (Generate n-Grams (Terms))
Dec 7, 2016 12:06:51 PM SEVERE: ==> +- X-Means[1] (X-Means)
Dec 7, 2016 12:06:51 PM SEVERE: +- Set Role[0] (Set Role)
Dec 7, 2016 12:06:51 PM SEVERE: +- Loop Values[0] (Loop Values)
Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Iteration'
Dec 7, 2016 12:06:51 PM SEVERE: | +- Replace[0] (Replace)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Replace (2)[0] (Replace)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Weight by Correlation[0] (Weight by Correlation)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Weights to Data[0] (Weights to Data)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Filter Example Range[0] (Filter Example Range)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Aggregate[0] (Aggregate)
Dec 7, 2016 12:06:51 PM SEVERE: | +- Generate Attributes[0] (Generate Attributes)
Dec 7, 2016 12:06:51 PM SEVERE: +- Append[0] (Append)
Dec 7, 2016 12:06:51 PM SEVERE: +- Replace (Dictionary)[0] (Replace (Dictionary))
Dec 7, 2016 12:06:51 PM SEVERE: java.lang.ArrayIndexOutOfBoundsException
I also notice that I have an error on WeightByCorrelation within LoopValues. It says "metadata.error.missing_role". However, the input data does have an attribute (cluster) whose role is "label" (applied using the SetRole operator in the parent process). I can verify this on the input connector.
I'm attaching the current process XML. I need to see if I can also post sample data.
Also, the same (I think) missing role error is propagated into the parent process on the LoopValues operator input where it says "The attribute 'cluster' is missing in the input data set", but you can see in the screenshot that it is present.