The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

[SOLVED] Binomial test on examples whose attributes specify the parameters

tennenrishintennenrishin Member Posts: 177 Contributor II
edited November 2018 in Help
Pardon me if this is a silly question.

Suppose I have an ExampleSet with two attributes, eg:
TC   SC
005  3
010  7
150  83
etc...

...where TC is the (Bernoulli) "Trial Count" and SC is the "Success Count", and the probability of success at each trial is 0.5. I would like to perform the binomial test on each example. In other words, I would like to generate a new attribute SS (Statistical Significance) indicating (for each example) the probability that TC trials will result in at least SC successes. How should I approach this?

I can't see how I could construct the cumulative binomial distribution from the functions available in the Generate Attributes operator, except perhaps if TC is small, and using loops. I'm going to look into how much Hoeffding's inequality and Chernoff's inequality can help, but am I overlooking any simpler way of doing this? Perhaps some statistical tests already implemented in one of the RM extensions?

Thanks in advance.

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    It's very likely R will have something (I haven't checked ;)). If so it's not too hard to call an R script from an RM process.

    Alternatively, does a Java library exist to do this calculation? If so, you could use a Groovy script.


    regards

    Andrew
  • tennenrishintennenrishin Member Posts: 177 Contributor II
    Thanks Andrew! I've installed R and set up the R extension.

    I've never used R before. I'm willing to learn, but do I need to take on the full learning curve at this point, or is it possible to give me some pointers to get me up and running quickly on this particular application?

    Best,
    Isak
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Here's an example of an R script being called 

    http://rapidminernotes.blogspot.co.uk/2011/06/counting-clusters-part-r.html

    You'll have to do a bit of Googling to find the right R library for your specific requirement

    regards

    Andrew
  • tennenrishintennenrishin Member Posts: 177 Contributor II
    Simpler than I thought!

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="431" width="681">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="TC" value="5"/>
              <parameter key="SC" value="3"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="TC" value="10"/>
              <parameter key="SC" value="7"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (3)" width="90" x="45" y="210">
            <list key="attribute_values">
              <parameter key="TC" value="150"/>
              <parameter key="SC" value="83"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="112" name="Append" width="90" x="246" y="120"/>
          <operator activated="true" class="r:execute_script_r" compatibility="5.2.000" expanded="true" height="76" name="Execute Script (R)" width="90" x="447" y="120">
            <parameter key="script" value="attach(inp)&#10;SS &lt;- pbinom(SC,TC,0.5,lower.tail=F)&#10;out &lt;- data.frame(SC=SC,TC=TC,SS=SS)&#10;detach(inp)"/>
            <enumeration key="inputs">
              <parameter key="name_of_variable" value="inp"/>
            </enumeration>
            <list key="results">
              <parameter key="out" value="Data Table"/>
            </list>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
          <connect from_op="Append" from_port="merged set" to_op="Execute Script (R)" to_port="input 1"/>
          <connect from_op="Execute Script (R)" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    This R extension really opens up a new dimension on the span of problems that RM can tackle. Thanks for your help, Andrew.
  • tennenrishintennenrishin Member Posts: 177 Contributor II
    For anyone who references this in future, the R script for generating an attribute could actually just be a one-liner:

    data$SS <- pbinom(data$SC-1,data$TC,0.5,lower.tail=F)
    ... if input and output are both named 'data'.
Sign In or Register to comment.