"repeated holdout method and test/Traindata from X-Validation?"
hi,
are there operators in RM available for repeated holdout method (slide 24):
http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-and-credibility
furthermore, I would like to know if X-Validation allows somehow to extract the training data and the test data for each round for separate testdata and traindata performance testing?
Best Answer
-
bhupendra_patil Employee-RapidMiner, Member Posts: 168 RM Data Scientist
yes your are right, you can restart from anywhere, by changing the settings "optimize parameters"
However when you restart it obviously does not know that models were generated 1 thur 50, so the best model it delivers will be based only on the current cycle.
Hence we need to build a seperate process to combine models build from first run and second run and compare the whole set of models
0
Answers
I think X-validation does exactly that, it will do n fold validation where it repeats training and testing k times.
The input ExampleSet is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remaining k − 1 subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be averaged (or otherwise combined) to produce a single estimation. The value k can be adjusted using the number of validations parameter.
You can also decide what kind of sampling to use in the parameter
Also see here how I am saving models, in a similar fashion you can store the data if needed.
https://www.youtube.com/watch?v=4q629LRYByA
x-val doesnt do that, it partitions into equal sampling sizes, where each partition on each run has not seen the other partition,
hold-out is just doing 2 different subsets (Test/train) on each run by random composition, so there is no x-validation strict concept...
Not sure if this works for your case
Use "loop and average" operator, you can specify how many iternations by changing value of loop
then inside the loop use split get your training and test data set and build and test models and then it will average the performance for you.
This operator doesnot deliver the model though,
one more question about your youtube-video that you mentioned:
what happens, if the process was interrupted at depth 50...
if you generate and set the makro by optimize parameter operator .. and you get depth 60, but performance and model does not exist for depth > 50... what happens?
does it throw an error and process stops? could you start model building from the point depth 60 on to continue to make your models then? or does it interrupt and throw an error?
ok, but just one more question...
I sometimes test a SVM with C and gamma parameters.. and sometimes it is 10000 combinations of it...
is it really recommended to save 10000 preformance and models into the repository? are you serious? I mean, is that really the only approach /solution possible to go with ?
or what would be a better solution? It's a pity, that there is no operator toretrieve a process, that remembers all currently used parameter settings , and the actual settings on time when the process broke down to go on from an offset instead of starting everything again... but that's maybe impossible, as a computer is not aware of itself when a process actually breaks down or stops I guess...
and there is no timer function to save intermediate results?