The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Classification of different comments
Hey RapidMiner community!
So, I've been trying the last 10 hours or so to do, what I believe is a very simple task, yet I can't seem to get it right.
I have this Excel file, with the Columns ActionType | CreatedDate | ActorName | TextValue | Category
This file has around 14.000 rows.
I have manually entered a Category, based on the TextValue which is a Facebook comment.
I need RapidMiner to categorize the remaining rows from my file with a Category based on the TextValue.
How do I do this the best way?
Thanks a lot!
0
Answers
hello @mavi16ab - welcome to the community. I'm happy to help here but can you please give me a little more to go on? If you could please post your current XML process (see "Read Before Posting" on right) and some sample rows of data, that would make things go better.
Scott
Hey @sgenzer
Thanks a lot for your time to answering my question! Tbh, I'm not really sure how I post the process as XML file?
But regarding your question for my data, it looks like this in the spreadsheet:
As you can see, based on the TextValue, I choose a corresponding Category. I need RapidMiner to do this same process on all the comments which I have not categorized.
ah I see. OK that's pretty standard text mining.
So for posting XML, the instructions are here when you post a message:
And can you just attach that spreadsheet to a post? You can do it here:
@sgenzer alright, this is what I got for you.
The process:
And I have attached the spreadsheet for you.
Thanks man.
haha PokemonGo. Nice. Well this should get in you in the right direction...
Good luck!
Scott
EDIT: I should have added the Apply Model part.
@sgenzer
Alright, I tried to import your XML code, but when trying to use it I get a few errors. Do I need any plugins?
This is what I get:
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@sgenzer
As a complete rookie to all this, how do I improve the accuracy? As it stands now, it's about 60%, which is a bit better than what I achieved. How do I train it?
Thanks for taking the time.
thanks @Telcontar120 for answering the question for the text processing extension
as for how to train the model, that's a much bigger question than one thread can go here for tutorialsanswer. I just wrote that process to whet your appetite. I would strongly suggest you a) go through the built-in tutorials in RapidMiner Studio, and b) go through the "Getting Started with RapidMiner" YouTube video series to begin to answer that. RapidMiner makes data science fast and simple, but it does not do it for you. We're always here to help on the community when you have questions.
Good luck!
Scott
YouTube playlist
@sgenzer
Sorry for contacting you through my thread, but I could find no way to PM you.
I have been working and reading everything I can on the Rapidminer and text mining and classification in general, but no matter how many things I try, I CAN'T get an accuracy above 36% ??? Please help a desperate student in need.
EDIT: Nvm, I should be logged in before I get the chance to PM you. My bad
You're getting a bad model because your data is all over the place. I just loaded in @sgenzer's process and your @mavi16ab CSV file.
From the looks of the categories you have 56% of your dataset as being "Other," what does that mean? The other categories are so small in some cases that the model is suffering from a highly imbalanced dataset. Plus there's all kinds of missing data points too. I would suggest doing some missing value replacements where you can and trying to balance up the data set a bit.
Just by getting rid of the missing values and cleaning up the data set I get almost 60% accuracy. Text Processing is a lot of fun but there are so many ways to mess up your model. It requires a lot of up front thinking.
Hey @Thomas_Ott
Means so much, that you took some time to go through my thread - what an awesome community this is! Regarding the dataset, I figured the balancing of categories were very skewered, which is why I have corrected between my replies - should obviously have noted this. The "Other" category is kinda like a "catch-all", so that if a post doesn't fit in any of the categories it will go to "Other". To be honest, I don't have much knowelgde in optimizing the data, as it's actually used for a school business project, and I have no prior experince within this field, which I is why I have tried to learn as much as possible these last few weeks (to no avail).
I did manage to get an accuracy of 70% myself, but then it just placed all in the "Other" category, which was obviosly not the point.
Perhaps you could guide me through the steps you have taken?
Thanks again man, means a lot!
The biggest problem I see is the amount of classes you have relative to the small sample size of each class. What I would look at is either 1) consolidating the classes into maybe a total of three or four classes, or 2) get more examples for each class. The learner is make too broad of a generalization for your data set, so it lumps everything into the "other" category.
Also, a word of caution. Don't rely solely on the 'accuracy' perfomance. It can be misleading when you have imbalanced datasets. Look at your precision and recall stats of the confusion matrix too. It will help you identify what classes are being correctly classified and which are not. That, in itself, can be a clue to help you build a better model.
Thanks again @Thomas_Ott.
I did follow your instructions, and added more examples to each of the categories, as you can see in my attached data sets. Still, not moving my accuracy by much, so I assume I must be doing some fundamentelly wrong. I've attached my data set, and hoping you could add some more insight.
Thanks again!
Like I said before, look at your classes again and work backwards from there. Do you really need that many classes? It appears that Spam class lumps in with the Other class a lot. Is there anything different about them? Plus, have you tried and Ensemble model(s)? You could do a combination of Voting, Bagging, or Boosting using different algos.
@Thomas_Ott technically, I guess I don't need the spam category. Also, I was wondering if it would help to remove the "Other" category?
You said, that you reached about 60% accuracy after some cleaning. What did you do?
Thanks for your time and help.