The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Stratification: How to get the same number of examples for each class?
I have a data set of 2 labels, label A(6000 items), label B(500items).
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.
So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.
Thanks in advance for your support
John Quest
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.
So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.
Thanks in advance for your support
John Quest
0
Answers
There is no need to repeat your question. What is the difference between doing what you describe and using standard XValidation with stratified sampling, applied on an example set with 50% label A and 50% label B? If you post your XML people will take more interest.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="386" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="38" y="77">
<parameter key="repository_entry" value="../data talbe/157000_85"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="back_freq|back_avg_distance|candidate_len|freq_keyword|snippets|suppE|suppC|keyword_id_ch|correctness|roverd|ranking|dis|lift|front_freq"/>
</operator>
<operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
<process expanded="true" height="431" width="373">
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="112" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="246" y="30">
<parameter key="sample_size" value="5661"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="246" y="165"/>
<operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="300"/>
<connect from_port="training" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="414" width="373">
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="51" y="43">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="227" y="44">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_port="result 2"/>
<connect from_op="Validation" from_port="training" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
this is clearly going far beyond of the scope of this board (and actually also of this forum). A process like this isn't made within a minute.
However, I have created a process for the desired task and uploaded it with the Community Extension of RapidMiner under the name "Same Number of Examples per Class (Stratification; Loops and Macros)". Just download and install the Community Extension and search for the process (search in this forum for more information, some infos can also be found in my signature below).
Cheers,
Ingo
You beat me to it! Drat ! Can we not have a badge/smiley pointing folks there, lest we have to repeat ( this exact one of balancing data comes up repeatedly ).
Anyway, I moved the discussion into this board here and made it also sticky so that we can easily link to this one in future.
Cheers,
Ingo
I think this covers the points you made - must say I found the 'Append' operator placement a challenge, still it does show the world of collections at work..
John
I still having some problem understand the last XML post by haddock, I cannot connect the macros to two outputs.
My question is still regarding my XML post on 10 June, I make it simpler and only looking at the problem this time, please see the attached XML codes.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="396" width="779">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="//Project CE/cep8/data talbe/157000_85"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="380" y="30">
<parameter key="sample_size" value="1662"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
<connect from_op="Retrieve" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
We want the operator "sample_stratified" take the exact amount according to the number of examples from "filter_examples 1" value="correctness=correct". Any ideas, thanks in advance for your support.
John
Cheers,
Ingo
Sorry for this question, how do I access the files uploaded in community extension, thanks.
Best regards
John
no problem. You can find some explanations here in the forum:
- Look here: http://rapid-i.com/rapidforum/index.php/topic,1992.0.html (first hit in forum search for "Community Extension" by the way...)
- Or here: http://rapid-i.com/rapidforum/index.php/topic,2254.msg8888.html#msg8888
- Follow the description and the link in my signature (yes, the small text under each of my posts )
The baseline is: You can simply download and install our Community Extension via the Update- and Installation option in our Help menu and activate the "myExperiment Browser" in the View menu of RapidMiner afterwards. In this view, you can search for the process stated above and directly download it into RapidMiner with a single click on "Open".Cheers,
Ingo
Thanks, sorry for the late reply, sometimes it is difficult to come back to my posts, besides from "show new replies", the only way I can find my post is from profile. would you tell me another way, thanks.
I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.
John Quest
(I thought we were already at the stage of using "John" and "Ingo" ) What exactly do you not understand? The first loop values is only used for calculating the size of the minimal class and storing this size in a macro.
Cheers,
Ingo (Mierswa )
Thanks, I may modified it into something more interesting and upload it to the community, may need your help if I got problems, thanks in advance for your support.
Best Regards
John