The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Topic extraction on Rapidminer
Hello everyone. I am new to rapidminer.
I've been doing the googling but I haven't found a way to do this yet. Is there a way for rapidminer to detect the topic of a bunch of documents and extract it? Could there be a way to extract the similarity of each document and match how well it matches with a specific keyword. And if there is, could someone write that or link me to such a topic? thanks
I've been doing the googling but I haven't found a way to do this yet. Is there a way for rapidminer to detect the topic of a bunch of documents and extract it? Could there be a way to extract the similarity of each document and match how well it matches with a specific keyword. And if there is, could someone write that or link me to such a topic? thanks
0
Answers
Although I am no expert in text mining, your question can be solved by following the normal pattern as proposed for instance http://vancouverdata.blogspot.be/2010/11/text-analytics-with-rapidminer-loading.html. The topic of a document is related to the tags if available or to the key words you quantified by text mining.
Cheers
Sven
there are most likely solutions for you inside rapidminer. I would say there are basicly three ways to go:
- Supervised learning
If you have documents with a Tag (e.g. China) you can go for supervised learning and built a model on each tag which detects the different topics. If you have tagged data, i would go this way. The tutorial above should help you with this
- Clustering
If you do not have tagged examples, you can go for clustering. Then you group together similar things. Most likely you want to use either K-Means or K-Medoids for this task. The problem is here: How many Topics do we search for? How to interpret the results? And of course for tags: A text might be in more than one topic (E.g. Hotel and China).
- Simple similiarty
You can calculate a similarity between two texts using cross distances. Might be helpful in a lot of cases.
Cheers,
Martin
Dortmund, Germany
The type of data analysis that I am doing is downloading 1000s of documents from a database by doing a headline (heading) search. Thing is, just because the heading has a certain word in it, might not mean that the topic is about that, hence the topic search. The idea that I have with clustering is to use rapidminer to cluster using a suitable value of k and then taking the cluster that has the most amount of objects as the most topical one. Reasoning for this is, let's say, if a database of 10000 documents all have the word "china" in the title, then the cluster that is most closely related together probably has something to do with the heading/search term. The type of documents is financial. I want to ask you from your experience, if this is a viable way to interpret the topic of financial documents through clustering. Thank you for your advice.
Cheers,
BadBoy20
a small tip: It is often useful to add a supervised learning feature selection after your clustering. The result is: Which words make this cluster different from the others? I would do a one vs all strategy here.
cheers,
Martin
Dortmund, Germany