Set of unique strings - ways to organize and structure, group related elements?
Hi. I need to automate a task.
I have a list of strings (where each line is a keyword, a search querie) that goes like this
cord to connect laptop to tv
how do i connect my laptop to my tv
cable to connect laptop to tv
how to connect laptop to smart tv
connect laptop to tv hdmi windows 10
...
Each of these strings is unique, as in none of them is an exact match to any other but most of them can be grouped by topic and most of the topics can be further split into subtopics and so on. That's what I want to do. And I want as many different ways of grouping as possible to see all the possible ways that these elements relate to each other. I would like to extract any information that would help to organize and structure this data set.
I already know how to calculate word frequency for my lists in RM. That's a start. I can use the most frequent relevant tokens as topic candidats for manual grouping. But I'm not sure where to go from there if I want to do it automatically. The problem is that all the examples of clustering that I find online are dealing with documents and not with lists of strings and I don't think that any of those techiques can be used in my case.
Answers
Hi,
well you can just do clustering or topic detection on the Bag of Words of your short seach strings. There is absolutly no reason not do to this. The only question is how you define similarity between two Bag of Words.
Attached is an example doing both, clustering and topic detection on your example data. It needs operator toolbox to run.
BR,
Martin
Dortmund, Germany
Ok, Thanks. That looks like something I'm after.
So, I now have as a result the list of ID's of all strings with a cluster assigned to every ID.
Now, how do I get a column with the actual string in that results table. ID's are not very helpful. Or am I missing something?
Thanks again for your help.
Hi @far_in_out,
i think you just need to tick "keep text" in the process documents. Or just join it back to the original table using the Join operator.
BR,
Martin
Dortmund, Germany
Ok, thanks. That worked.
Are there clustering algorythms that can put one element into multiple clusters (for RapidMiner)?
Hi,
i think Expectation Maximization does it. But i would honeslty rather think about a LDA then. It's not exactly clustering but also assigns documents to multiple topics.
BR,
Martin
Dortmund, Germany