Polynominal sentiment analysis in SVM

HeikoeWin786 · July 2020

Dear all,

I am trying to perform SVM on the dataset where customer review as polynominal and sentiment score as bionominal. I had read the tutorials and figured out that SVM can only handle numerical and needed to convert nominal to numerical. However, is it to convert both customer reviews ans sentiment score to numerical? In which steps we need to convert? After processed the data? I am a bit confused of how sentiment analysis work in SVM in rapidminer. The RM tutorial under the sample templates is using text and binominal and not even converting to numerical.
Can anyone suggest me how to fix this issue correctly?
I had attached my process flow for your easy reference.

thanks.
Heikoe

jacobcybulski · July 2020

You will face here a dillema of accuracy vs efficiency. If your data sample is sufficiently large it is likely that a simple split/holdout validation inside the drid optimisation will be enough. However, if you though that you needed cross-validation in the first place then I'd also place it inside the grid optimisation. What you may decide is to reduce the number of folds. Let's say that you optimise on C and gamma, both in 10 log steps, and you placed SVM inside a 10-fold cross validation, it means you'll be running SVM 10x10x10=1000 times. If you believe that a simple split will result in two representative samples then you can reduce the runs to 100.

HeikoeWin786 · July 2020

Hello,

Could anyone please help me with this understanding, please?

jacobcybulski · July 2020

I am not sure at what point your model fails. This is just a binomial classification of reviews, in which your binomial label happens to represent sentiment. I cannot see any major problems with your model training and its cross-validation, and I suspect this is not where the process fails. However, the process will definitely fail in your honest testing (the lower leg of your process), as your pre-processing for training and cross-validation is different from pre-processing for model testing (you do not create an ID, not defining a label, and not converting the nominal to numerical - you also must apply exactly the same pre-processing model here). So the process will definitely fail, regardless if you use SVM or some another model, which I'd also recommend to try).

sgenzer · July 2020

yes agree with @jacobcybulski of course

I would also wonder why you are insistent that you use SVM. It's very possible that another model might give you much better performance. Also have you explored the samples in the Community repo?

Image: https://us.v-cdn.net/6030995/uploads/editor/bg/hpwju1wdg826.png

If you shared your Excel then we could run your process and see what's going on.

Scott

HeikoeWin786 · July 2020

@sgenzer
@jacobcybulski

Hello both,

Thanks for your kind input.
Yes, I had changed the label to binominal and processed the data.
I am really not sure how to pick the optimum model (I tried SVM and NBC so far).
I see the sample sentiment analysis is using SVM in cross validation as well.
Let me check the sample repository once more again and explore other models.
And, yes for Sure, I can share with you the file.

Much appreciated for all your input for real!!

thanks and regards,
Heikoe

jacobcybulski · July 2020

When you test different binomial models of course you can select the best by accuracy, kappa or auc. The issue is that models such as SVM are difficult to optimise. I suggest using Grid Optimizer, in which you can place the whole cross-validation or its holdout equivalent when you have lots of data (for the efficiency sake). Then you can vary your SVM parameters (which depends on the selected kernel, e. g. C and gamma) and when you execute the process you will be able to view the log of performance indicators to see what combination of SVM parameters results in the best performance. Once you find these optimal parameters go back to the process you have previously created and plug these values into the SVM.

HeikoeWin786 · July 2020

Hello @jacobcybulski

Thanks much again here also.
Does it mean, I need to place my cross validation process inside Grid Optimizer?
Currently, inside cross validation is SVM process.
So, now, i put corss validation inside grid and run the process, it will return the parameters which best fits. I take that parameter and apply that in the actual SVM process. AM i correct?

thanks and regards,
Heikoe

jacobcybulski · July 2020

Let's say you logged 1000 results of your grid optimisation (make sure you pick the option to log all performance measurements, Kappa, accuracy and AUC). Of course you can then order it by accuracy (I'd avoid it if your label has a class imbalance), Kappa (pretty good) or AUC (especially if you are prepared to optimise your performance later in threshold) and you can pick the best performance and its best SVM parameters. However, I'd recommend plotting parameters vs performance (which is a challenge in its own right when you have multiple dimensions) and pick not necessarily the overall best but rather the best in the range of a stable combination of parameters (e. g. avoid the maximum Kappa surrounded by the cliffs of poor performance).

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Polynominal sentiment analysis in SVM

Best Answer

Answers