"Averaging cross-validation results"

_paul_ · July 2009

Hi,

I've a general and a RapidMiner-specific question concerning the cross-validation.

In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?

I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?

Regards,
Paul

steffen · July 2009

Hello Paul

The IteratingPerformanceAverage is used to average the PerformanceVectors (=output of Crossvalidation). Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.
The global random generator using seed 2001 is initialized every time you start a process. Hence I can assert, that XValidation uses splits the data another way every time the operator is executed (please note the difference between single operator and the whole process. If you would set the mentioned parameter to a fixed value unequal to -1, it would initialize the random generator every time the operator is executed and hence split the data always the same way. I hope it is clear now.

regards,

Steffen

PS: I guess I have found the first topic for the wiki ;D

_paul_ · July 2009

Hi Steffen,

Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.

Would you have a reference (paper/book) for me where I could find Kohavi's suggestion.

The global random generator using seed 2001 is initialized every time you start a process.

Maybe I got it wrong, but I think you meant here "-1" and not "2001", right? To my understanding you would
get always the same pseudo-random numbers when you use a fixed value != -1. Using -1 on the other hand
might be a problem when you want to have reproducible results since "always" different seeds are generated.

I think that the most suitable approach combined with the IteratingPerformanceAverage operator would be a mix
of both seed specifications: RapidMiner should perform the cross-validation 6-10 times with different seeds
which are however specified statically. Thus, the results would be reproducible each time you run your process but
on the other hand you would get an average over multiple seeds as validation results which are however not
completely biased to one specific seed.

Is there a way to tell RapidMiner to perform a cross-validation with a set of pre-defined seeds which have to
be defined manually?

Regards,
Paul

steffen · July 2009

Hello Paul

First of all: -1 means that you use the global random generator, which is (as specified in preferences) initialized with 2001

Then:
The global random generator is initialized with 2001 every time a process is executed (by clicking the arrow button). On the other hand the local generators are initialized with the specified seed (!= -1) every time the operator, where this seed has been specified, is executed. Hence the results are always reproducible.

To use self-specified seeds for IteratingPerformanceAverage, you can type


%{a}

as argument (RapidMiner Macros, powerful thingi, see the tutorial.pdf for more details ), which replace the seed with the number of current iteration (1,2,3,....)

I suggest to continue to play with the rapidminer example processes to see what I mean. I hope I didnt increase your confusion

Regarding Kohvai: Here is the link to its Ph.D. Thesis (http://ai.stanford.edu/~ronnyk/teza.pdf), where you can find a detailed discussion of the issue of validation. Long text, but fun to read.

hope this was helpful

Steffen

_paul_ · July 2009

Hi Steffen,

thank you for your help.

What I meant with "not reproducible results" was that using "-1" as global and local seed would always
yield different random numbers due to the system time which usually changes when a process is
executed multiple times.

Paul

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Averaging cross-validation results"

Answers