Text Classification/Categorization Help
Hey there!
Just looking for some help regarding a project I'm currently working on. I'm very new to RapidMiner and AI in general and I'm looking for some direction.
I have a noSQL MongoDB that is storing 8000 different scraped jobs. The main attributes are Description, Title, Text and Keywords and I have assigned the label jobs to all of them.
I want to be able to automatically classify/categorize all my jobs into different job sectors based on their job titles, for example a software development job would be categorized into the technology sector. I am really clueless on how to actually go about and implement this and how RapidMiner's different classification models work, any help would be greatly appreciated.
Thanks for reading!
Answers
Hi @1505993,
2 methods :
1. You can "cluster" your 8000 different scraped jobs into "k" clusters where k is the number of job categories you are considering.
for example if your categories envisaged are "technology sector", "Human Ressources sector", "engineering sector", "Marketing sector", then k = 4.
You can use k-means operator to do that.
But difficult to say a priori if this method will be effective on your data.
2. For me a more reliable method is to train a classification model, but it takes more work :
You have to train first a model (kNN, Naive Bayes, Neural networks etc. - difficult to say a priori which model is the best) on a part of your data (for example for 1000 jobs / 8000). For this part of your data you have to label the job category (to resume my last example, label your 1000 jobs with "technology sector", "Human Ressources sector", "engineering sector", "Marketing sector"), then evaluate the performance of your model with Cross Validation operator and finally you can apply the model to your 7000 (8000-1000) "unlabelled" jobs.
I hope it helps,
Regards,
Lionel
I will try to train a classification model and compare each model to see how accurate the results are.
The only concern I have with this method is the labeling of the 1000 jobs. I will write a function in C# and change the labels in the database to the secotrs but doesn't that make the classification model redundant? Couldn't I just do that for all of the jobs?
Appreciate the help, just needed some direction.
Hi @1505993,
I have difficulties to understand :
You have already an AI program in C# that is able to automatically label jobs according to differents variables (job title etc.) ?
If that's the case, in deed, you don't need to train a model and you don't need RapidMiner .....
but one question : have you evaluate the performance of this program (the accuracy = total right predictions / total predictions) ?
To explain in more detail my approach :
1. You have first to label manually 1000 jobs. I insist on "manually" bacause this 1000 jobs have to be 100 % correctly labeled (an AI program can't reach 100 % accuracy) and that's why I said "it takes more work".
2. Train many models (kNN, Neural Networks etc.) on this labeled dataset of 1000 jobs.
3. Evaluate the accuracy of these models using the Cross Validation operator. (this accuracy is representative of the accuracy of your models on unlabelled data).
4. Select and apply the best model on your unlabelled dataset (your remaining 7000 jobs).
I hope that it's clearer.
Regards,
Lionel
Hi @1505993
I did a project on text classification once, so I think I could cite here one of my answers in the other thread regarding text classification, hope this might be helpful or inspiring for you in some way: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/autotagging-and-autocategorizing-text-pieces/m-p/43717/highlight/true#M29049
Vladimir
http://whatthefraud.wtf