The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

calculating statistics

TheBearTheBear Member Posts: 18 Maven
edited November 2018 in Help
Hello
I am new to rapidminer. So far I am not quite sure if I have understand the concept
or the syntax corrctly.

My data set consists of some instances (I call it instances what is organised in lines in my spreadsheet)
and several attributes (columns). What I need to do is to condense the data,
e.g. calculating mean and deviation for attribute 3-7 (for each instance).
(For instance: Lets say I have a set of process parameters X describing my process and
I measure some output characteristics several times O1, O2 ,O3, O4 .
Now I want to investigate O further which is characterised by the mean O1-O4.)

I found the FeatureGeneration Operator which might be used for that purpose but the syntax is
not really easy to use (e.g. no function for mean or deviation).

Is there any other operator or operator chain which are better suited to receive statistics within instances?

Answers

  • steffensteffen Member Posts: 347 Maven
    Hello and welcome to RapidMiner

    I suggest to use the operator "Aggregation".
    Example: Calculating average of attribute "a" of the iris data set (available with RapidMiner), grouped by each value of the classlabel.
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="iris.aml"/>
        </operator>
        <operator name="Aggregation" class="Aggregation">
            <parameter key="aggregation_attribute" value="a1"/>
            <parameter key="group_by_attribute" value="label"/>
            <parameter key="keep_example_set" value="false"/>
        </operator>
    </operator>
    In combination with operator "ParameterIteration" (use the cvs-version please) and the "ExampleSetJoinOperater" you can calculate the average for all attributes of a data set.

    hope this was helpful

    Steffen

    PS: I will add an Example for the second suggestion as soon as my cvs-update is complete  ;)



    ...which is not possible  :(
    @RapidMiner-Team:

    I got an
    java.lang.ClassCastException: com.rapidminer.parameter.ParameterTypeStringCategory cannot be cast to com.rapidminer.parameter.ParameterTypeCategory
    Here is the slightly changed setup, error occured while moving "aggregation_attribute" from "Parameters" to "SelectedParameters".
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="iris.aml"/>
        </operator>
        <operator name="ParameterIteration" class="ParameterIteration" expanded="yes">
            <list key="parameters">
            </list>
            <operator name="Aggregation" class="Aggregation">
                <parameter key="aggregation_attribute" value="a1"/>
                <parameter key="group_by_attribute" value="label"/>
                <parameter key="keep_example_set" value="false"/>
            </operator>
        </operator>
    </operator>
    I downloaded the cvs-version 40 minutes ago and ran the ant-build-script with default settings before starting the gui via RapidMinerGUI.bat
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    yes, the "Aggregation" operator (eventually in combination with the ExampleSetJoin) should be the solution. We just improved the Aggregation so that it can handle multiple groups and also multiple value attributes - even with different aggregation functions. Here is an example on the IRIS dataset caclulating the average for the four attributes:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="sample/data/iris.aml"/>
        </operator>
        <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
            <parameter key="name" value="label"/>
        </operator>
        <operator name="Aggregation" class="Aggregation">
            <list key="aggregation_attributes">
              <parameter key="a1" value="average"/>
              <parameter key="a2" value="average"/>
              <parameter key="a3" value="average"/>
              <parameter key="a4" value="average"/>
            </list>
            <parameter key="keep_example_set" value="false"/>
        </operator>
    </operator>
    You can now join the resulting example set with your original set if desired. By the way: we just made the release 4.2 so you would not need to access it via CVS. We will add the link to the new release on our website during the next hours.


    @Steffen:

    I just testet it myself but I didn't not get the class cast exception. Maybe there was some inconsistency in the CVS during the delay between developer and anonymous CVS. On the other hand, maybe the error came due to the changed parameters (see above) of this operator. Could you please try again in a few hours and check if this still happens?

    Thanks and cheers,
    Ingo
  • steffensteffen Member Posts: 347 Maven
    Hello

    ;D Yeah Release Time  ;D

    Well,with RapidMiner4.2 it is not possible to add aggregation_attributes as parameter for parameteriteration because it is a list of parameters. But this is ok.
    But it would be nice to remove all parameters from the ParameterIteration-Configuration-Dialog, which are not available for ParameterIteration (or mark them as such). Just to avoid confusion

    greetings

    Steffen
  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    good idea. I will add it to our Todo list.

    Cheers,
    Ingo
  • TheBearTheBear Member Posts: 18 Maven
    Hi,

    I am not quite sure if I just didn't understand the function of the Aggregation operator or maybe
    I was not clear with my description. (Sorry I am not a native speaker...)

    What I want to do is to generate a new attribute (Average). Hence Rapidminer should
    compute the values for that attribute by calculating the mean from O1 till O3.
    Label             O1     O2      O3      Average 
    Instance 1         1       2       3           2
    Instance 2         1       3       2           2
    Instance 3         1       4       7           4
    Aggregation      1       3       4
    In my opinion Aggregation averages over one attribute and not for instances.
    Correct me please if I am wrong.
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee-RapidMiner, Member Posts: 295 RM Product Management
    Hi,

    you are right. Aggregation means aggregating over attributes. Hence, a normal aggregation is not suitable for your need - at least not without a complicated process structure. As far as I know there is not operator which lets you directly average the values of some attribtutes. Nevertheless you can use the [tt]FeatureGeneration[/tt] operator and manually calculate the code. Suppose you want to average the three attribtues att1, att2 and att3. Then the corresponding XML code for averaging is

        <operator name="FeatureGeneration" class="FeatureGeneration">
            <list key="functions">
              <parameter key="average" value="/(+(att1,+(att2,att3)),const[3]())"/>
            </list>
            <parameter key="keep_all" value="true"/>
        </operator>
    I think we already plan a more sophisticated and more easy-to-use feature generation. Maybe we are even able to make this part of the next release.

    Regards,
    Tobias
  • TheBearTheBear Member Posts: 18 Maven
    Thanks Tobias.
    All right I ll wait till the next release :).

    I already used the FeatureGeneration but to be honest it is a bit of a pain to bring  it in the right syntax (especially for the deviation with ten or more attributs to be condensed).
    I have up to several hundreds attributs and I need average and deviation of certain groups of these attributs.
    (Not a big deal I ll precalculate these values in my spreadsheet.)

    Keep up the good work!

Sign In or Register to comment.