The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"repeated holdout method and test/Traindata from X-Validation?"

Fred12Fred12 Member Posts: 344 Unicorn
edited June 2019 in Help

hi,

are there operators in RM available for repeated holdout method (slide 24):
http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-and-credibility

 

furthermore, I would like to know if X-Validation allows somehow to extract the training data and the test data for each round for separate testdata and traindata performance testing?

Best Answer

  • bhupendra_patilbhupendra_patil Employee-RapidMiner, Member Posts: 168 RM Data Scientist
    Solution Accepted

    yes your are right, you can restart from anywhere, by changing the settings "optimize parameters"

    However when you restart it obviously does not know that models were generated 1 thur 50, so the best model it delivers will be based only on the current cycle.

     

    Hence we need to build a seperate process to combine models build from first run and  second run and compare the whole set of models

Answers

  • bhupendra_patilbhupendra_patil Employee-RapidMiner, Member Posts: 168 RM Data Scientist

    I think X-validation does exactly that, it will do n fold validation where it repeats training and testing k times.

     

    The input ExampleSet is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remaining k − 1 subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be averaged (or otherwise combined) to produce a single estimation. The value k can be adjusted using the number of validations parameter.

     

    You can also decide what kind of sampling to use in the parameter

     

    Also see here how I am saving models, in a similar fashion you can store the data if needed.

     

    https://www.youtube.com/watch?v=4q629LRYByA

     

  • Fred12Fred12 Member Posts: 344 Unicorn

    x-val doesnt do that, it partitions into equal sampling sizes, where each partition on each run has not seen the other partition,

    hold-out is just doing 2 different subsets (Test/train) on each run by random composition, so there is no x-validation strict concept...

  • bhupendra_patilbhupendra_patil Employee-RapidMiner, Member Posts: 168 RM Data Scientist

    Not sure if this works for your case

     

    Use  "loop and average" operator, you can specify how many iternations by changing value of loop

    then inside the loop use split get your training and test data set and build and test models and then it will average the performance for you.

    This operator doesnot deliver the model though, 

  • Fred12Fred12 Member Posts: 344 Unicorn

    one more question about your youtube-video that you mentioned:

    what happens, if the process was interrupted at depth 50...

    if you generate and set the makro by optimize parameter operator .. and you get depth 60, but performance and model does not exist for depth > 50... what happens?

    does it throw an error and process stops? could you start model building from the point depth 60 on to continue to make your models then? or does it interrupt and throw an error?

  • Fred12Fred12 Member Posts: 344 Unicorn

    ok, but just one more question...

    I sometimes test a SVM with C and gamma parameters.. and sometimes it is 10000 combinations of it...

    is it really recommended to save 10000 preformance and models into the repository? are you serious? I mean, is that really the only approach /solution possible to go with ?

     

    or what would be a better solution? It's a pity, that there is no operator toretrieve a process, that remembers all currently used parameter settings , and the actual settings on time when the process broke down to go on from an offset instead of starting everything again... but that's maybe impossible, as a computer is not aware of itself when a process actually breaks down or stops I guess...

     

    and there is no timer function to save intermediate results?

Sign In or Register to comment.