Binary text classification - Help in process needed.
Hey guys,
We want to do a binary classification on a text data set with the distribution 80% negative class, 20% positive class. In order to reach maximum statistical meaningfulness, we want to do so by using 10-fold cross validation.
If we model this within Rapidminer, we are unsuccessful since it doesn’t output any statistical metrics (like precision, recall, etc):
We found a workaround that works, but it doesn’t make any sense out of a ML perspective: If we first divide into training or test and then use 10-fold-crossvalidation it works — But the training or test split should be part of the crossvaligdation (9 training folds, 1 test fold, 10 iterations). So right now the only way to get this working is by FIRST dividing into test and training and THEN use X-Validation. Did we model it the right way or did we miss anything?
If you need any more information for helping us, just comment.
Thank you very much in advanced.
Best regards!
Answers
Ok, silly questions but did you set a label role in your data set?
This sounds like a strange problem, but it's very hard to troubleshoot from a screenshot of a process--can you post the process itself for review? You can export it from the file menu and attach it as a file.
Thanks,
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hey T-Bone,
yes I set a label role
Regards,
Hey Brian,
thank you for your answer.
Here is the process which gives me results but makes no sense
It would be great if you could help me. If you need any more information I am happy to provide them
Best regards,
Thiemo
I would double check your process, something doesn't appear to be correct because I can easily extract P/R's and confusion matrix.
See the sample XML below. This process takes Tweets, does a bit of processing up front and generates a random label. The Process Documents from Data operator then processes them to TF-IDF (you can select Binary Occurances) and spits out the confusion matrix.
Hi @thiemo,
I took your original process, and modified it only by inputting a simple toy example set using the identical Excel format (since I don't have your original dataset). Then I removed your outer split validation, and ran it again only using the cross-validation that you had as an inner operator. And it works fine! Here's the modified process. So if you are having problems, I suspect it must be something strange related to your original dataset. There's nothing that appears to be wrong with the process or with the cross-validation operator. Sorry I couldn't be more definitive.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
And here's the Excel file I used as input in case you are interested.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hey Brian,
thank you very much for your solution. I downloaded the process and the excel and tried it and it works perfectly, but I do not get the performance parameters such as accurancy, recall, precision and the AUC?
How can I use this process and receive those 4 parameters?
Regards,
Thiemo
Hi @thiemo,
I'm not sure what you mean--those performance metrics are all available in the performance tab output from the process when it runs. See the attached screenshot. This is part of the output for the process I supplied with no changes. Of course, the values are useless with my test examples since there are only 10 of them, but you can see that AUC, accuracy, precision, and recall are all available. If you run it on a larger dataset then they should all be there.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi Biran,
thanks again for the quick answer.
However, if I take the process you uploaded and use the excel of you, I get an result but not the statistical paremeters such as precision and recall.
Did you do anything special while importing the data? I just set the type of need data to binominal. What can I do to get the precison and recall for the data?
Thanks you and best regards,
Thiemo
What you see in the Statistics tab is just some basic descriptive statistics of your data set, there will be no P/R or confusion matrix because you didn't do any modeling yet. This view is similar to a summary or head command in Python/R.
You need to attached a Cross Validation operator with a machine learning algoritm emebded + performance operator to generate the P/R's and confusion matrix.
Hi T-Bone,
thank you for the answer.
Exaclty this was my intitial problem. If I add another corss validation operater with a performance operator around the actual process, then it makes no sense anymore, right?
Regards,
Thiemo
From that point in your process (where you show the staistics tab) now connect a Cross Validation operator (insert your algo in the Training side and an Apply Model and Performance operator) THEN connect the "Per" port on the Cross VAlidation to the Results port. This will out put the P/R's etc for you.