autotagging and autocategorizing text pieces
Hello Rapid Minder community!
First of all thank you for taking the time to read my question. Seocndly i apologize for my ignorance. I am totally new to data mining and i have looked around the community but did not find any other post answering my question. Perhaps this is because of my lack of knowledge. Okay so this is my problem:
I have around 5000 text pieces. I have categorized and tagged them. I want to build a rulebook that can autotag and autocategorize new text pieces. I have about 600 tags and about 20 categories. Every snippet can have different tags but only one category. Specifically i want:
-to analyze the text so i can automatically give this snippet the correct tags (up to 4) from a list i have made myself.
- to analyze the text (or analyse the tags whatever is easier) and find rules for putting them in a category automatically
I have no idea how to even begin this process and i would be forever grateful if someone would be willing to guide me through this process!
Answers
Hi @mayageudens
I could advise you on the second part, text categorizing (I have done this before as a big project for categorizing web sites based on their content and detecting restricted categories like adult, druge, weapons etc), though I am not ready at this moment to advise on tagging the texts, as it seems to be pretty different task I haven't ever aproached.
1. Start with installing "Text processing" RapidMiner extension from the marketplace as this is gonna be the main tool for you.
2. Study operators "PROCESS DOCUMENTS FROM FILES" or "PROCESS DOCUMENTS FROM DATA", depending on the way your text data is stored. I have actually used the first one as I had all the data stored in text files which were then read by this operator.
3. Important thing is that you have to vectorize text data for further classification. I used TF-IDF for creating word vectors from text files.
4. For classifying text documents I found the simplest k-NN classification algorithm could produce really good results.
Here are also some screenshots from my process I used for the task. This doesn't mean that simply copying the structure will do the trick on your data, but at least it can give you many hints about how to approach the problem.
Whole process:
Process documents from files:
Vectorizing settings:
Labelling and files structure (I used a separate directory for storing documents for each category):
Cross validation:
I am also attaching slides about the whole project which I have presented on RapidMiner Wisdom 2015 conference in Ljubljana. Maybe this also might be a source of some knowledge
Vladimir
http://whatthefraud.wtf
Agreed that the full scope of everything you have requested would be quite a complicated project, and quite likely beyond the scope of a forum answer. Thanks to @kypexin for a great starting point of resources!
A few additional comments/questions for your consideration:
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for answering!
I realize now this project is maybe too big for me to handle or to set up. I will give you guys a little more information. I have a website that takes information about a bulk of events and categorizes and tags them. You can see the website here: http://findout.be/.
-As you can see, the tagging is really necessary. the category in itself is not enough to give people enough information about the event.
-Sadly it is also impossible for me to simplify the categories. Every event takes place in a venue, since every venues has as about 3 possible categories ( a club would almost never organize a workshop). Perhaps this will help me along?
- I really don't need a 'rulebook' If it is possible to set up this system and link it to my website database.
What do you guys think will be the best way to achieve this? I think i realized i need help, i would be okay with spending some money on this but my budget is very very limited..
I truly appreciate the help you've already gave me!
One other option you have would be to post this as a project in the RapidMiner Experfy data science channel: https://www.experfy.com/channels/rapid-miner/marketplaces
There you can post a brief project description and your requirements, provide some sample data, state your budget and timeframe, and invite qualified data scientists to bid on the project. You'll probably be pleasantly surprised as to what you can get there.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts