figure out if there is there is any problem with dataset.

njasaj · July 2012

Hi,
I am trying to classify a data set with three label and 7 attribute with libsvm operator. my data set is imbalanced. class distribution is 882,237, 273. When ever i try to classify this data set the computed model can not discriminate between classes and classify all the points (except 30 of them) into the biggest one.I tried under sampling with sampling 200 point of every class with simple sampling operator implemented in rapidminer but the result is not acceptable.
Is there any problem with my data set? I repeated this procedure for iris data set and it worked.
Thanks.

MariusHelf · July 2012

Then probably your data is not separable with the learning method (svm) or the parameters you are using. Did you try to optimize the parameters of the SVM? You should try different kernels (linear/dot and radial/rbf are good choices to start with), and optimize the C parameter. When using the rbf kernel, also parameter gamma needs to be optimized.
Try an Optimize Parameters or Loop Parameters with a sensible Log operator inside to get an overview of the impact of the parameters. Good starting values for both C and gamma are 10e-4 to 10e+4 on a logarithmic scale.

Best, Marius

njasaj · July 2012

Thank you Marius. the poor results was gained by parameter optimization.I have tested evolutionary parameter optimization and tried to tune C and gamma of rbf kernel. I will try poly nominal and sigmoid kernels too.Would you mind please describe or put xml code for how using cost sensitive meta learning with parameter optimization in rapidminer for imbalanced data set? I guess that simply lowering the number of samples of the larger class by random is not proper task and must use more advance sampling technic.
Thanks a lot.

MariusHelf · July 2012

Hi,

instead of the evolutionary search I would go for a systematic grid search (Optimize Parameters (Grid)), since the evolutionary one has some disadvantages (e.g. very long execution times if by chance a bad parameter combination is chosen in one of the generations). By logging the values, you then get a very nice overview of the impact of different parameters.

For the balancing, I personally would optimize the balancing in a separate step/process with the same technique, i.e. trying different balancing values with Loop Parameters (Grid) and logging the values, than fix the best value (which will in most cases near a balanced data set) and use in the the actual SVM parameter optimization.

Best, Marius

njasaj · July 2012

Thank you for your answers and support.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

figure out if there is there is any problem with dataset.

Answers