The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Correlation Matrix When to use Squared Correlation"

mobmob Member Posts: 37 Contributor II
edited June 2019 in Help
While researching a project involving polynominal datasets I forgot to check if Rapidminer had an operator to help so I'm a bit confused by the Correlation Matrix operator and when to use the "squared correlation"

Is the squared correlation the same as a chi-squared calculation and so is the correlation matrix similar to the "weight by chi-square" but without the need to have a class label defined ?

The tutorial example for the correlation matrix appears to show its suitable for use with the default params with non numeric data but other tools like R seem to prefer only numeric datasets so I'm a bit confused on how to handle non-numeric datasets in RM when I need to see the correlation

Any pointers to help clear the fog?
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    as far as i know squared correlation is aquivalent to R² in Excel.

    Does this help?

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor II
    Hi Martin,

    Thanks for helping. If you are talking about the rsq() function in excel that "can be interpreted as the proportion of the variance in y attributable to the variance in x." according to the Excel help docs. The excel function isn't suitable for non-numeric data

    Is RM able to process non-numeric data to see if attributes are related or do I need to convert them and if so how do i do that so I don't loose the essence of the relationships between categorical attributes?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    i think what you want is not possible with a single operator, you need to use a loop here. Attached is a process calculating such a matrix (as a list) using Gini Index. You can use any other Weight by Operator if you want to. Comments are inside the process

    ~Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="165">
           <parameter key="target_function" value="non linear"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="6.4.000" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="165">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="label"/>
           <parameter key="invert_selection" value="true"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="discretize_by_bins" compatibility="6.4.000" expanded="true" height="94" name="Discretize (2)" width="90" x="313" y="165">
           <parameter key="number_of_bins" value="5"/>
           <parameter key="range_name_type" value="short"/>
         </operator>
         <operator activated="true" class="loop_attributes" compatibility="6.4.000" expanded="true" height="94" name="Loop Attributes" width="90" x="447" y="165">
           <process expanded="true">
             <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="11" y="52"/>
             <operator activated="true" class="set_role" compatibility="6.4.000" expanded="true" height="76" name="Set Role" width="90" x="179" y="210">
               <parameter key="attribute_name" value="%{loop_attribute}"/>
               <parameter key="target_role" value="label"/>
               <list key="set_additional_roles"/>
             </operator>
             <operator activated="true" class="weight_by_gini_index" compatibility="6.4.000" expanded="true" height="76" name="Weight by Gini Index" width="90" x="313" y="210"/>
             <operator activated="true" class="weights_to_data" compatibility="6.4.000" expanded="true" height="60" name="Weights to Data" width="90" x="447" y="210"/>
             <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="210">
               <list key="function_descriptions">
                 <parameter key="Iteration" value="&quot;%{loop_attribute}&quot;"/>
               </list>
             </operator>
             <operator activated="true" class="order_attributes" compatibility="6.4.000" expanded="true" height="76" name="Reorder Attributes" width="90" x="715" y="210">
               <parameter key="attribute_ordering" value="Iteration|Attribute|Weight"/>
             </operator>
             <connect from_port="example set" to_op="Multiply" to_port="input"/>
             <connect from_op="Multiply" from_port="output 1" to_port="example set"/>
             <connect from_op="Multiply" from_port="output 2" to_op="Set Role" to_port="example set input"/>
             <connect from_op="Set Role" from_port="example set output" to_op="Weight by Gini Index" to_port="example set"/>
             <connect from_op="Weight by Gini Index" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
             <connect from_op="Weights to Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
             <connect from_op="Generate Attributes" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
             <connect from_op="Reorder Attributes" from_port="example set output" to_port="result 1"/>
             <portSpacing port="source_example set" spacing="0"/>
             <portSpacing port="sink_example set" spacing="0"/>
             <portSpacing port="sink_result 1" spacing="0"/>
             <portSpacing port="sink_result 2" spacing="0"/>
             <description align="center" color="yellow" colored="false" height="193" resized="true" width="283" x="153" y="129">Weight by Gini Index always calcs the Index for the label. Same stuff works with information Gain etc.</description>
             <description align="center" color="yellow" colored="false" height="189" resized="true" width="412" x="441" y="133">Transform it a bit to make it easier readable</description>
           </process>
           <description align="center" color="transparent" colored="false" width="126">Loop so that each attribute is label once</description>
         </operator>
         <operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="76" name="Append" width="90" x="581" y="210"/>
         <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Discretize (2)" to_port="example set input"/>
         <connect from_op="Discretize (2)" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
         <connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
         <connect from_op="Loop Attributes" from_port="result 1" to_op="Append" to_port="example set 1"/>
         <connect from_op="Append" from_port="merged set" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
         <description align="center" color="yellow" colored="false" height="249" resized="true" width="394" x="23" y="101">Get some polynomal data</description>
       </process>
     </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor II
    Is there a reason why you discretized the dataset before looping?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    just to get nominal values, because you asked for it.

    so: not really
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor II
    Thanks for that and appreciate the help.. Is there other ways to accomplish the same calculation as I have a fairly large dataset with a large number of columns
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    None that i know of.

    How do you want to use it? Ofc. you can delete coloumns in a iteration so it is not tested anymore in the next iteration. That might make everything faster.

    Edit: If you want to use it for feature selection, have a look on this extension: http://sourceforge.net/projects/rm-featselext/
    The MRMR operator there might be useful. Sadly this is not on the RM Market Place.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor II
    I'm really looking to get a general sense of the dataset before any data mining starts and to be honest have driven myself crazy trying to do things in R with dummy variables for categorical values to gleam some relationships between polynominal categorical data. What I wouldnt give for a numeric dataset and a pearson correlation  :)
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Well, Gini Index and Information Gain (aka entropy) are quite good for polynominal values.

    The other option would be to use Nominal to numerical and dummy coding. But i think pearson correlation is "wrong" for a binominal (numerical) attribute.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor II
    Thanks Martin,

    I went crazy trying the dummy variables route. I'll check out the operators you suggest
Sign In or Register to comment.