How to log the number of positive and negative examples?

DrGary · January 2010

When you stop the GUI on an ExampleSet, you can look at the "label" attribute row to see how many positive and negative examples there are in the dataset. But I want to run from the command line and see the dataset class counts in the log.

The DataStatistics operator will write dataset info to the log, but it doesn't include the counts of the label classes. You can add in a DataMacroDefinition operator, but it only offers the total ExampleSet size, not the class counts.

Is there a way to log the class sizes?

land · January 2010

Hi,
you could first filter the example set according to the label value and then count the examples using the DataStatistics or DataMacroDefinition.
For this purpos I recommend using a ValueIterator, which will give you each value of an attribute as macro and then filter the examples accordingly.

Greetings,
Sebastian

DrGary · January 2010

Sebastian, thanks for the suggestion. Here's what I came up with:


        <operator name="Count class sizes" class="OperatorChain" expanded="yes">
            <operator name="ValueIterator" class="ValueIterator" expanded="no">
                <parameter key="attribute"	value="target_"/>
                <parameter key="iteration_macro"	value="target_value"/>
                <operator name="ExampleFilter" class="ExampleFilter">
                    <parameter key="condition_class"	value="attribute_value_filter"/>
                    <parameter key="parameter_string"	value="target_=%{target_value}"/>
                </operator>
                <operator name="DataMacroDefinition" class="DataMacroDefinition">
                    <parameter key="macro"	value="class_size"/>
                </operator>
                <operator name="echo the target value" class="CommandLineOperator">
                    <parameter key="command"	value="echo &quot; class &#39;%{target_value}&#39; size = %{class_size}&quot;"/>
                    <parameter key="log_stderr"	value="false"/>
                </operator>
            </operator>
            <operator name="ExampleSetMerge" class="ExampleSetMerge">
            </operator>
        </operator>

Seems to work pretty well. Is there a way to keep the original ExampleSet and drop the new ones instead of merging the new ones?

I pushed it with large datasets, and it doesn't seem to use as much memory as you might expect from creating new ExampleSets. I assume that's because views into the current ExampleSet are being created and rows are not duplicated.

Still, it seems like a lot of overhead for a simple count...

land · January 2010

Hi,
you could use the IOStorer and IORetriever for storing it if it is not possible to pass it the usual way. IOMultiplier and IOConsumer might help as well.
In general I would recommend to switch to RM 5.0 RC, because the flow layout gives you much more intuitive way of handling such problems.

Greetings,
Sebastian

IngoRM · January 2010

Hi,

aeh, maybe I got it wrong but why do you not simply aggregate and count? Use the label as group by attribute and use a count of the label as aggregation attribute. Just one operator and you are done

Here is the process for RM 5 RC (based on the Iris sample data set):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="280" width="413">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="112" y="165">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="246" y="165">
<list key="aggregation_attributes">
<parameter key="label" value="count"/>
</list>
<parameter key="group_by_attributes" value="label"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Cheers,
Ingo

ui3o · March 2010

Hi,

can anyone help me to set up a process, with which I can filter out examples for which an attribute has a value with seldom occurance. The Aggregate-operator (count) calculates the occurances as described above, but how can I use the result to filter?

Thanks for advice.

Greetings,

ui3o

ui3o · April 2010

Hi there,

anybody have an idea on that?
Thanx for help

Greetings,

ui3o

IngoRM · April 2010

Hey, usually the creation of a process like this is more a consulting task than a simple example process for technical support. However, I just felt like "would be funny to create a nice looping process before the holidays" and here we are:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="341" width="614">
      <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="target_function" value="single gaussian cluster"/>
        <parameter key="number_examples" value="500"/>
        <parameter key="number_of_attributes" value="3"/>
      </operator>
      <operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
        <parameter key="number_of_bins" value="5"/>
        <parameter key="range_name_type" value="short"/>
      </operator>
      <operator activated="true" class="remember" expanded="true" height="60" name="Remember" width="90" x="313" y="30">
        <parameter key="name" value="filtered_data"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="447" y="30">
        <process expanded="true" height="603" width="626">
          <operator activated="true" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="45" y="30">
            <list key="aggregation_attributes">
              <parameter key="label" value="count"/>
            </list>
            <parameter key="group_by_attributes" value="%{loop_attribute}"/>
          </operator>
          <operator activated="true" class="sort" expanded="true" height="76" name="Sort" width="90" x="179" y="30">
            <parameter key="attribute_name" value="count(label)"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="filter_example_range" expanded="true" height="76" name="Filter Example Range" width="90" x="313" y="30">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="3"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="true" class="loop_values" expanded="true" height="60" name="Loop Values" width="90" x="447" y="30">
            <parameter key="attribute" value="%{loop_attribute}"/>
            <process expanded="true" height="603" width="626">
              <operator activated="true" class="recall" expanded="true" height="60" name="Recall (2)" width="90" x="45" y="30">
                <parameter key="name" value="filtered_data"/>
                <parameter key="io_object" value="ExampleSet"/>
              </operator>
              <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
                <parameter key="condition_class" value="attribute_value_filter"/>
                <parameter key="parameter_string" value="%{loop_attribute} = %{loop_value}"/>
                <parameter key="invert_filter" value="true"/>
              </operator>
              <operator activated="true" class="remember" expanded="true" height="60" name="Remember (2)" width="90" x="313" y="30">
                <parameter key="name" value="filtered_data"/>
                <parameter key="io_object" value="ExampleSet"/>
              </operator>
              <connect from_op="Recall (2)" from_port="result" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Remember (2)" to_port="store"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Loop Values" to_port="example set"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="recall" expanded="true" height="60" name="Recall" width="90" x="447" y="120">
        <parameter key="name" value="filtered_data"/>
        <parameter key="io_object" value="ExampleSet"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Discretize" from_port="example set output" to_op="Remember" to_port="store"/>
      <connect from_op="Remember" from_port="stored" to_op="Loop Attributes" to_port="example set"/>
      <connect from_op="Recall" from_port="result" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

The first operators simply create a gaussian distributed data set and discretizes it to create "seldom" values. You of course have to adapt some of the parameters for your concrete data set.

Cheers and happy holidays,
Ingo

ui3o · May 2010

Ingo,

thx a lot! didn't know, that my question was not just setting the right parameter in the right operator ... great work and thanks again for you effort.

Best Regards & Viele Grüße

ui3o

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to log the number of positive and negative examples?

Answers