"Text mining from Excel file and Split validation"
hi.
thanks to my teacher I've entered the fantastic world of Rapidminer. I love it, even though Im still a newbie.
Im trying to proceed with a text classification modeling starting with an Excel file with two columns:
Column1 Column2
ROW 1 attribute (text) Label(binomial: simply 0 for negative review and 1 for positive review)
up till now we only work with positive reviews in Txt stored in a folder and negative reviews in Txt stored in another folder, we defined the two of them as positive class and negative class.
I've tried to proceed like this with Read Excel - Process Documents (Tokenize, remove stopwords and case) - Validation (training with SVM + Applay model and Performance)
I've used Nominal to numerical to avoid SVM capacity problems, but as a result I get only the rooted mean square error, in the Performance vector.
I was looking for the Accuracy of my model instead... sorry for the bad question, I hope somenody can help.
Can I use a txt file as an alternative? see attached file.
thanks a lot in advance
Answers
Hi @federico_schiro,
Have you try to use Performance (Classification) or Performance (Binominal Classification) operators
as Performance operator ?
Regards,
Lionel
thanks a lot, I got what I was looking for. wow.
Do you think its likely to get a better accuracy if I work with more reviews as corpus? I have 2000 more reviews (from Amazon and iMdb)
with the Yelp reviews I have 56%
thanks a lot. it works.
I have a question regarding the degree of accuracy. I got 56% here. Is it possible to raise it by adding more reviews in my corpus?
Hopefully the other reviews wont make it worse. do you think it makes sense to work with 2000more reviews from 2 different platforms or would that make things worse?
Thanks a lot again
There's a lots of ways to possibly improve your classification results. Some right off the bat that could help is pruning, n_grams, and filtering low character words. You might want to review how you tokenize the words too. If you have lots of numbers in the corpus, the default tokenization paramater of 'non letters' will wipe those out.
Next you can use another algo, like Linear SVM or Deep Learning. I would use them in conjuction with a Cross Validation, not Split.
fantastic. thanks a lot. you people are very supportive.
I forgot about some Stemming. Using a Stem (Porter) operator, I've got 3% more accuracy.
Do you think 3%-30% pruning is ok? or can I change it to get better?
what are the options with Tokenize? I've selected "Non words", by default
So Text processing is almost an art form as much as it is analytics, it will require some thinking from the domain expert. I don't know what the corpus is that you're trying to classify but sometimes a 3/30% pruning is right, other times 5/80 is good. The short answer is that it depends.
Of course, if you used an Optimize Parameter operator, you could tun the actual pruning percentages to find the optimal % for the best performance measure.
With respect to tokenization, I talk about that in my video here: https://www.youtube.com/watch?v=ia2iV5Ws3zo. I do a lot of Twitter mining so a hastag #datascience would be obliterated using the non-letters parameter. Whereas specify character, I could just split on ".,![]"
another question
sofar I still havent really fully understand what the blue curve here (ROC threshold) represents.
I got what the red one expresses, but what about the blue one?
(I know, my accuracy isnt that great, thats why my red curve looks like that, "cringe")
Thanks!
hi,
it represents a confidence threshold. ROC is calculated like this.
Take a confidence threshold of 0.99 and calculate TPR/FPR for this - > datapoint
Take a confidence threshold of 0.98 and calculate TPR/FPR for this -> data point
The red curve are the TPR/FPR value. The blue curve are the corresponding thresholds to get this values.
Best,
Martin
Dortmund, Germany
hi Martin and thnks for the answer..
I can understand what the ROC curve is but its with the threshold curve (Blue) that I feel confused.
I watched several videos about it, also thought so: when the Threshold is high, I have a higher TPR (coz I "accept" only high predictive probabilities = its easier to get it predicted right), whereas when the threshold is low (for instance <0.5 predictive probability) I see a higher TFR
also, I tried the TF IDF without Prune, and my Acccuracy skyrocketed!
Hi,
the threshold tells you not too much about your performances. Read it like this: If you want to get this TPR/FPR value, you need to use the blue threshold.
Does this make more sense?
Best,
Martin
Dortmund, Germany