The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Averaging cross-validation results"

_paul__paul_ Member Posts: 14 Contributor II
edited May 2019 in Help
Hi,

I've a general and a RapidMiner-specific question concerning the cross-validation.

In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?

I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?

Regards,
Paul 

Answers

  • steffensteffen Member Posts: 347 Maven
    Hello Paul
    • The IteratingPerformanceAverage is used to average the PerformanceVectors (=output of Crossvalidation). Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.
    • The global random generator using seed 2001 is initialized every time you start a process. Hence I can assert, that XValidation uses splits the data another way every time the operator is executed (please note the difference between single operator and the whole process. If you would set the mentioned parameter to a fixed value unequal to -1, it would initialize the random generator every time the operator is executed and hence split the data always the same way. I hope it is clear now.
    regards,

    Steffen

    PS: I guess I have found the first topic for the wiki  ;D
  • _paul__paul_ Member Posts: 14 Contributor II
    Hi Steffen,
    Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.
    Would you have a reference (paper/book) for me where I could find Kohavi's suggestion.
    The global random generator using seed 2001 is initialized every time you start a process.
    Maybe I got it wrong, but I think you meant here "-1" and not "2001", right? To my understanding you would
    get always the same pseudo-random numbers when you use a fixed value != -1. Using -1 on the other hand
    might be a problem when you want to have reproducible results since "always" different seeds are generated.

    I think that the most suitable approach combined with the IteratingPerformanceAverage operator would be a mix
    of both seed specifications: RapidMiner should perform the cross-validation 6-10 times with different seeds
    which are however specified statically. Thus, the results would be reproducible each time you run your process but
    on the other hand you would get an average over multiple seeds as validation results which are however not
    completely biased to one specific seed.

    Is there a way to tell RapidMiner to perform a cross-validation with a set of pre-defined seeds which have to
    be defined manually?

    Regards,
    Paul
  • steffensteffen Member Posts: 347 Maven
    Hello Paul

    First of all: -1 means that you use the global random generator, which is (as specified in preferences) initialized with 2001

    Then:
    The global random generator is initialized with 2001 every time a process is executed (by clicking the arrow button). On the other hand the local generators are initialized with the specified seed (!= -1) every time the operator,  where this seed has been specified, is executed. Hence the results are always reproducible.

    To use self-specified seeds for IteratingPerformanceAverage, you can type

    %{a}
    as argument (RapidMiner Macros, powerful thingi,  see the tutorial.pdf for more details ), which replace the seed with the number of current iteration (1,2,3,....)

    I suggest to continue to play with the rapidminer example processes to see what I mean. I hope I didnt increase your confusion :)

    Regarding Kohvai: Here is the link to its Ph.D. Thesis (http://ai.stanford.edu/~ronnyk/teza.pdf), where you can find a detailed discussion of the issue of validation. Long text, but fun to read.

    hope this was helpful

    Steffen
  • _paul__paul_ Member Posts: 14 Contributor II
    Hi Steffen,

    thank you for your help.

    What I meant with "not reproducible results" was that using "-1" as global and local seed would always
    yield different random numbers due to the system time which usually changes when a process is
    executed multiple times.  :)

    Paul
Sign In or Register to comment.