Stratification: How to get the same number of examples for each class?

JohnQuest · June 2010

I have a data set of 2 labels, label A(6000 items), label B(500items).
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.

So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.

Thanks in advance for your support

John Quest

haddock · June 2010

Hi,

There is no need to repeat your question. What is the difference between doing what you describe and using standard XValidation with stratified sampling, applied on an example set with 50% label A and 50% label B? If you post your XML people will take more interest.

JohnQuest · June 2010

my set up is as follows, I am wondering how to make operator "sample" automatically set the sample size according to the size of operator "filter sample" the one use parameter setting correctness=correct

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="386" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="38" y="77">
<parameter key="repository_entry" value="../data talbe/157000_85"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="back_freq|back_avg_distance|candidate_len|freq_keyword|snippets|suppE|suppC|keyword_id_ch|correctness|roverd|ranking|dis|lift|front_freq"/>
</operator>
<operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
<process expanded="true" height="431" width="373">
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="112" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="246" y="30">
<parameter key="sample_size" value="5661"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="246" y="165"/>
<operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="300"/>
<connect from_port="training" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="414" width="373">
<operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="51" y="43">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="227" y="44">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_port="result 2"/>
<connect from_op="Validation" from_port="training" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

IngoRM · June 2010

Hi,

this is clearly going far beyond of the scope of this board (and actually also of this forum). A process like this isn't made within a minute.

However, I have created a process for the desired task and uploaded it with the Community Extension of RapidMiner under the name "Same Number of Examples per Class (Stratification; Loops and Macros)". Just download and install the Community Extension and search for the process (search in this forum for more information, some infos can also be found in my signature below).

Cheers,
Ingo

haddock · June 2010

Greetings O Pointy One,

You beat me to it! Drat ! Can we not have a badge/smiley pointing folks there, lest we have to repeat ( this exact one of balancing data comes up repeatedly ).

IngoRM · June 2010

I might have been faster but the solution can still be optimized ;D A good idea would be to extract the label automatically without having the user define it via a macro. The second thing is that I loose one example in the minority class ::)

Anyway, I moved the discussion into this board here and made it also sticky so that we can easily link to this one in future.

Cheers,
Ingo

haddock · June 2010

Hi,

I think this covers the points you made - must say I found the 'Append' operator placement a challenge, still it does show the world of collections at work..

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="335" width="791">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="120">
        <parameter key="macro" value="exs"/>
      </operator>
      <operator activated="true" class="loop_values" expanded="true" height="76" name="Loop Values" width="90" x="313" y="120">
        <parameter key="attribute" value="class"/>
        <process expanded="true" height="453" width="809">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="141" y="94">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="class=%{loop_value}"/>
          </operator>
          <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro (2)" width="90" x="313" y="75">
            <parameter key="macro" value="subexs"/>
          </operator>
          <operator activated="true" class="generate_macro" expanded="true" height="76" name="Generate Macro" width="90" x="447" y="75">
            <list key="function_descriptions">
              <parameter key="exs" value="min(%{subexs},%{exs})"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Extract Macro (2)" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_collection" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="120">
        <parameter key="unfold" value="true"/>
        <parameter key="parallelize_iteration" value="true"/>
        <process expanded="true" height="353" width="809">
          <operator activated="true" class="sample" expanded="true" height="76" name="Sample" width="90" x="269" y="53">
            <parameter key="sample_size" value="%{exs}"/>
          </operator>
          <connect from_port="single" to_op="Sample" to_port="example set input"/>
          <connect from_op="Sample" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" expanded="true" height="76" name="Append" width="90" x="581" y="120"/>
      <connect from_op="Retrieve" from_port="output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

JohnQuest · June 2010

Thanks, I will try it out

John

JohnQuest · June 2010

Dear All
I still having some problem understand the last XML post by haddock, I cannot connect the macros to two outputs.
My question is still regarding my XML post on 10 June, I make it simpler and only looking at the problem this time, please see the attached XML codes.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="396" width="779">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="//Project CE/cep8/data talbe/157000_85"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=wrong"/>
</operator>
<operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="165">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="correctness=correct"/>
</operator>
<operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="380" y="30">
<parameter key="sample_size" value="1662"/>
</operator>
<operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
<connect from_op="Retrieve" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
We want the operator "sample_stratified" take the exact amount according to the number of examples from "filter_examples 1" value="correctness=correct". Any ideas, thanks in advance for your support.

John

IngoRM · June 2010

Did you try the process I have uploaded with the Community Extension? Could help here...

Cheers,
Ingo

JohnQuest · June 2010

Dear Ingo
Sorry for this question, how do I access the files uploaded in community extension, thanks.

Best regards

John

IngoRM · June 2010

Hi,

no problem. You can find some explanations here in the forum:

Look here: http://rapid-i.com/rapidforum/index.php/topic,1992.0.html (first hit in forum search for "Community Extension" by the way...)
Or here: http://rapid-i.com/rapidforum/index.php/topic,2254.msg8888.html#msg8888
Follow the description and the link in my signature (yes, the small text under each of my posts )

The baseline is: You can simply download and install our Community Extension via the Update- and Installation option in our Help menu and activate the "myExperiment Browser" in the View menu of RapidMiner afterwards. In this view, you can search for the process stated above and directly download it into RapidMiner with a single click on "Open".

Cheers,
Ingo

JohnQuest · June 2010

Dear Ingo Mierswa
Thanks, sorry for the late reply, sometimes it is difficult to come back to my posts, besides from "show new replies", the only way I can find my post is from profile. would you tell me another way, thanks.

I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.

John Quest

IngoRM · June 2010

Dear John Quest,

(I thought we were already at the stage of using "John" and "Ingo"

)

I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.

What exactly do you not understand? The first loop values is only used for calculating the size of the minimal class and storing this size in a macro.

Cheers,
Ingo (Mierswa

)

JohnQuest · July 2010

Dear Ingo

Thanks, I may modified it into something more interesting and upload it to the community, may need your help if I got problems, thanks in advance for your support.

Best Regards

John

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Stratification: How to get the same number of examples for each class?

Answers