The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Text mining from Excel file and Split validation"

federico_schirofederico_schiro Member Posts: 6 Contributor II
edited June 2019 in Help

hi.

 

thanks to my teacher I've entered the fantastic world of Rapidminer. I love it, even though Im still a newbie.

Im trying to proceed with a text classification modeling starting with an Excel file with two columns:           
                          Column1                      Column2                

ROW 1              attribute (text)              Label(binomial: simply 0 for negative review and 1 for positive review)

 

up till now we only work with positive reviews in Txt stored in a folder and negative reviews in Txt stored in another folder,  we defined the two of them as positive class and negative class.

I've tried to proceed like this with Read Excel - Process Documents (Tokenize, remove stopwords and case) - Validation (training with SVM + Applay model and Performance)

I've used Nominal to numerical to avoid SVM capacity problems, but as a result I get only the rooted mean square error, in the Performance vector.

I was looking for the Accuracy of my model instead... sorry for the bad question, I hope somenody can help.
Can I use a txt file as an alternative? see attached file.
thanks a lot in advance

 

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @federico_schiro,

     

    Have you try to use Performance (Classification) or Performance (Binominal Classification) operators

    as Performance operator ?

     

    Regards,

     

    Lionel

  • federico_schirofederico_schiro Member Posts: 6 Contributor II

    thanks a lot, I got what I was looking for. wow. 
    Do you think its likely to get a better accuracy if I work with more reviews as corpus? I have 2000 more reviews (from Amazon and iMdb) 

    with the Yelp reviews I have 56%

    PerformanceVector:
    accuracy: 56.00%
    ConfusionMatrix:
    True: 0 1
    0: 41 29
    1: 59 71

     

  • federico_schirofederico_schiro Member Posts: 6 Contributor II

    thanks a lot. it works.

    I have a question regarding the degree of accuracy. I got 56% here. Is it possible to raise it by adding more reviews in my corpus?
    Hopefully the other reviews wont make it worse. do you think it makes sense to work with 2000more reviews from 2 different platforms or would that make things worse?
    Thanks a lot again

    PerformanceVector:
    accuracy: 56.00%
    ConfusionMatrix:
    True: 0 1
    0: 41 29
    1: 59 71
    AUC: 0.614 (positive class: 1)

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    There's a lots of ways to possibly improve your classification results. Some right off the bat that could help is pruning, n_grams, and filtering low character words.  You might want to review how you tokenize the words too. If you have lots of numbers in the corpus, the default tokenization paramater of 'non letters' will wipe those out.

     

    Next you can use another algo, like Linear SVM or Deep Learning. I would use them in conjuction with a Cross Validation, not Split. 

  • federico_schirofederico_schiro Member Posts: 6 Contributor II

    fantastic. thanks a lot. you people are very supportive.

    I forgot about some Stemming. Using a Stem (Porter) operator, I've got 3% more accuracy.

    Do you think 3%-30% pruning is ok? or can I change it to get better?
    what are the options with Tokenize? I've selected "Non words", by default

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    So Text processing is almost an art form as much as it is analytics, it will require some thinking from the domain expert. I don't know what the corpus is that you're trying to classify but sometimes a 3/30% pruning is right, other times 5/80 is good. The short answer is that it depends. 

     

    Of course, if you used an Optimize Parameter operator, you could tun the actual pruning percentages to find the optimal % for the best performance measure. 

     

    With respect to tokenization, I talk about that in my video here: https://www.youtube.com/watch?v=ia2iV5Ws3zo. I do a lot of Twitter mining so a hastag #datascience would be obliterated using the non-letters parameter. Whereas specify character, I could just split on ".,![]"

  • federico_schirofederico_schiro Member Posts: 6 Contributor II

    another question :)

    sofar I still havent really fully understand what the blue curve here (ROC threshold) represents.
    I got what the red one expresses, but what about the blue one?
    (I know, my accuracy isnt that great, thats why my red curve looks like that, "cringe")

    Thanks!

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    hi,

     

    it represents a confidence threshold. ROC is calculated like this.

    Take a confidence threshold of 0.99 and calculate TPR/FPR for this - > datapoint

    Take a confidence threshold of 0.98 and calculate TPR/FPR for this -> data point

     

    The red curve are the TPR/FPR value. The blue curve are the corresponding thresholds to get this values.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • federico_schirofederico_schiro Member Posts: 6 Contributor II

    hi Martin and thnks for the answer..

     

    I can understand what the ROC curve is but its with the threshold curve (Blue) that I feel confused. 

    I watched several videos about it, also thought so: when the Threshold is high, I have a higher TPR (coz I "accept" only high predictive probabilities = its easier to get it predicted right), whereas when the threshold is low (for instance <0.5 predictive probability) I see a higher TFR

     

    also, I tried the TF IDF without Prune, and my Acccuracy skyrocketed!

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    the threshold tells you not too much about your performances. Read it like this: If you want to get this TPR/FPR value, you need to use the blue threshold.

     

    Does this make more sense?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.