"[Solved] SVM parameter optimization of C and epsilon"

qwertz · January 2013

Dear all,

I have never been in the favorable situation to attend lectures on data mining. Thus, I lack the mathematical details on how they actually work. Nevertheless, I'd like to have at least an understanding of what is being influenced by the parameters.

To start into this topic I read some papers and searched the forum. Promising results were:
- "A User's Guide to Support Vector Machines" by Ben-Hur and Weston
- http://www.svms.org/parameters/
- http://rapid-i.com/rapidforum/index.php/topic,5863.0.html

In my case I have the "SVM (linear)" with a linear kernel. In the above mentioned post it is said that for this type of SVM only the parameter C has to be optimized. However, there are yet other parameters that can be set.

So I was wondering
- What effect do the parameters "C", "convergence epsilon" and "epsilon" have on the model at all?
- What are reasonable threshholds for these parameters for a (logarithmic grid) optimization?

I'd appreciate if someone could help with these questions or give a link for an "easy to understand" introduction to SVMs.

Best regards
Sachs

MariusHelf · January 2013

Hi,

depending on what you consider "easy to understand" I would recommend the book "Elements of Statistical Learning" by Tibshirani et.al. to you. It is a very comprehensive introduction to data mining, and given the complexity of the matter it is well written and easy to understand. You can download it for free from the author's website: http://www-stat.stanford.edu/~tibs/ElemStatLearn/
For serious working with it I would recommend to buy a paper copy, though.

To answer your questions about the SVM on a very high-level scale:
Consider a classification problem with a fuzzy, ragged separation line between the two classes. Now when creating a classification model, you can go into two extremes:
1. you create a model which classifies each and every data point in the training data correctly. That will result in a very complex description of the data: you have to describe every bend and every outlier. To create such a model, the creating algorithm must have "high capacity".
2. model the data with a single straight line. This results in a very simple model, and you probably make a lot of mistakes on the training data. We say, the algorithm has a low capacity.

The choice of the capacity depends on the data - if the training data is e.g. very noisy, you do not want an algorithm with a high capacity, because in that case it would model it noise point as part of the model and thus generalize bad to new data. As well does a method with a low capacity have disadvantages on problems which are imminently complex per se.

Different learning algorithms have different capacities: a neural net for example has a high capacity, whereas a decision tree with only one or two levels has a rather low capacity.

The cool thing about the SVM is that you can control its capacity with a single parameter, without changing the algorithm itself. That is the C that you can configure.

Internally, the SVM calculates the model not with a simple formula, but optimizes the model stepwise. When the model converges against a certain measure, the algorithm stops and considers the current model as "good enough". The strictness of this optimization is controlled by the epsilon parameters. Usually you can use the default settings.

For C, as written elsewhere, a good choice are values in the range from 10^-6 to 10^3. If you see a constant increase towards one of the borders, you should of course increase the search space. Same applies to the kernel_gamma parameter, which has to be optimized when using an SVM with rbf/radial kernel.

Best regards,
Marius

qwertz · January 2013

Hi Marius,

Thank you very much for taking the time to give such helpful and detailed explanation!!
Your support in this forum is as outstanding as the Rapidminer software itselt!

May I ask a last question on SVMs? The "SVM (Linear)" operator has got two parameters: "convergence epsilon" and just "epsilon". What is the difference then?

Best regards
Sachs

MariusHelf · February 2013

As described above, the SVM internally optimizes a hyperplane. In each iteration, the costs (or so-called loss) is calculated, i.e. wrongly classified examples are "counted" in a way. In the next iteration the hyperplane is adjusted according to the errors discovered in the last iteration. The algorithm stops when the hyperplane is "good enough" according to some constraints, the so-called KKT-constraints. "Good enough" here is influenced by the convergence_epsilon.
The other epsilon values specify a delta around the true value of each example for which a prediction is still considered as "correct".

But as I said, usually you do not need to touch any of these parameters.

Best regards,
Marius

qwertz · February 2013

You are marvelous! Thanks again!

Best regards
Sachs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"[Solved] SVM parameter optimization of C and epsilon"

Answers