The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
High deviation?
Legacy User
Member Posts: 0 Newbie
For many learners (supervised learning), I get for my learning set a standard deviation (for the accuracy of 70-80%)
of about 40-50%. What can I infer from this value? Is this deviation high, i.e. the computed classifiers are too
weak? Or are these standard values? If not, what could I do to improve them in terms of reducing the standard deviation?
I'd be happy to get a detailed answer since I'm a newbie to data mining and your cool tool rapidminer. :-)
Thx.
klaus
of about 40-50%. What can I infer from this value? Is this deviation high, i.e. the computed classifiers are too
weak? Or are these standard values? If not, what could I do to improve them in terms of reducing the standard deviation?
I'd be happy to get a detailed answer since I'm a newbie to data mining and your cool tool rapidminer. :-)
Thx.
klaus
0
Answers
if you use crossvalidation, the performance is estimated averaging the performance on a number of disjoint training and classification runs. Since you then have for example ten values of performance you might calculate the standard deviation from them.
A high standard deviation then points out, that the performance in some of the runs was much better than the average and in other much worse.
This indicates a very unstable classification result or the using of a very small trainingset. In small trainingsets (for example gene expression data with hundreds of thousands of attributes but only a small number of examples) one missclassified example more or less in a run, causes already some percent standard deviation.
If your example set contains enough examples, you should try another learning algorithm or tune the parameters in order to provide more reliable results.
Greetings,
Sebastian
Your assumption is right, the training set is relatively small, about 500 examples.I've tried
couple of algorithms and also played around with their parameters. But for most cases
I get these high deviations. Do you have any suggestions which algorithms/parameters or
pre-/post-processing steps might be promising to reduce the standard deviation?
And a general question: Are such high deviations acceptable when the accuracy, as
in my case, is relatively high (or are 80% not that good)? Or is it better for a model to
sacrifice some accuracy at the cost of a small standard deviation?
klaus
500 examples is not that big, but one or two missclassified examples more or less won't cause the accuracy to deviate that much.
The standarddeviation is an indicator of the reliability of the performance estimation using cross validation. An deviation of 80% says: Don't know. Your final model trained on all data might be really bad, or really good, but you don't know and you cannot test it (since you already used all your training data).
Usually a deviation of around 4% is tolerable sometimes more, sometimes less depending, on the size of training data. So you should indeed search another learning algorithm or doing some preprocessing tasks. In many cases the real trick is in the correct preprocessing...
Greetings,
Sebastian