[Solved] Average mutual information / correlation matrix on massive data set

qwertz2 · May 2017

Dear community,

There is a massive data set with a couple of thousands of regular attributes and a single label. The primary goal is to get a list with two columns showing 1) the attribute's names and 2) the average mutual information (related to the label).

As there are so many attributes the average mutual information matrix is slow and memory consuming. So I thought to work on a subset. This way I can calculate label and att1, then label and att2, then label and ... looping through all combinations.
However, I didn't manage to combine each iteration's result in a single table. Recall and remember don't seem to work here as the initial recall is empty.

The secondary goal would be to select the five attributes with the highest average mutual information out of the initial massive data set.
PS: I have the converters extension installed in order to convert matrix to example set.

PPS: The matrix operators don't seem to be able to handle special attributes. That's why I used "set role to regular".

Looking forward to any advice...

Cheers
Sachs

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="number_of_attributes" value="5000"/>
      </operator>
      <operator activated="true" class="concurrency:loop_attributes" compatibility="7.5.000" expanded="true" height="103" name="Loop Attributes" width="90" x="179" y="34">
        <parameter key="regular_expression" value="%{loop_attribute}|label"/>
        <process expanded="true">
          <operator activated="true" class="work_on_subset" compatibility="7.5.000" expanded="true" height="103" name="Work on Subset" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="regular_expression" value="%{loop_attribute}|label"/>
            <parameter key="include_special_attributes" value="true"/>
            <process expanded="true">
              <operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="34">
                <parameter key="attribute_name" value="label"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="mututal_information_matrix" compatibility="7.5.000" expanded="true" height="82" name="Mutual Information Matrix" width="90" x="179" y="34"/>
              <operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="313" y="85"/>
              <operator activated="true" class="recall" compatibility="7.5.000" expanded="true" height="68" name="Recall" width="90" x="313" y="187">
                <parameter key="name" value="temp"/>
              </operator>
              <operator activated="true" class="append" compatibility="7.5.000" expanded="true" height="103" name="Append" width="90" x="447" y="136"/>
              <operator activated="true" class="remember" compatibility="7.5.000" expanded="true" height="68" name="Remember" width="90" x="581" y="136">
                <parameter key="name" value="temp"/>
              </operator>
              <connect from_port="exampleSet" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Mutual Information Matrix" to_port="example set"/>
              <connect from_op="Mutual Information Matrix" from_port="example set" to_port="example set"/>
              <connect from_op="Mutual Information Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
              <connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Append" to_port="example set 1"/>
              <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
              <connect from_op="Append" from_port="merged set" to_op="Remember" to_port="store"/>
              <connect from_op="Remember" from_port="stored" to_port="through 1"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <portSpacing port="sink_through 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="output 1"/>
          <connect from_op="Work on Subset" from_port="through 1" to_port="output 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
      <connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop Attributes" from_port="output 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

MartinLiebig · May 2017

Dear Sachs,

mutual information is binning internally anyway. Thus i would recommend to use Weight by information gain on a discretized label.

~Martin

MartinLiebig · May 2017

Hi,

isnt Weight by Correlation/Information Gain what you want to have?

Best,

Martin

qwertz2 · May 2017

Dear Martin,

The "weight by" operator is almost basically what I want to have. I go through "work on subset" and assign weights and finally select the five attributes with the highest weights. The thing is that there is no "weight by mutual information" operator and the "mutual information matrix" has no weight output...

Maybe a process like "mutual information matrix" -> "matrix to example set" -> "data to weight". But how to proceed? Are the weights stored internally so that I can do "select by weight" after the looping? And for some reason the "data to weight" is always 1. Please advice...

Best regards

Sachs

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="number_of_attributes" value="5000"/>
      </operator>
      <operator activated="true" class="concurrency:loop_attributes" compatibility="7.5.000" expanded="true" height="103" name="Loop Attributes" width="90" x="179" y="34">
        <parameter key="regular_expression" value="%{loop_attribute}|label"/>
        <process expanded="true">
          <operator activated="true" class="work_on_subset" compatibility="7.5.000" expanded="true" height="103" name="Work on Subset" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="regular_expression" value="%{loop_attribute}|label"/>
            <parameter key="include_special_attributes" value="true"/>
            <process expanded="true">
              <operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="34">
                <parameter key="attribute_name" value="label"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="mututal_information_matrix" compatibility="7.5.000" expanded="true" height="82" name="Mutual Information Matrix" width="90" x="179" y="34"/>
              <operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="313" y="85"/>
              <operator activated="true" class="data_to_weights" compatibility="7.5.000" expanded="true" height="82" name="Data to Weights" width="90" x="447" y="85"/>
              <connect from_port="exampleSet" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Mutual Information Matrix" to_port="example set"/>
              <connect from_op="Mutual Information Matrix" from_port="example set" to_port="example set"/>
              <connect from_op="Mutual Information Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
              <connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Data to Weights" to_port="example set"/>
              <connect from_op="Data to Weights" from_port="weights" to_port="through 1"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <portSpacing port="sink_through 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="output 1"/>
          <connect from_op="Work on Subset" from_port="through 1" to_port="output 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_by_weights" compatibility="7.5.000" expanded="true" height="103" name="Select by Weights" width="90" x="313" y="34">
        <parameter key="weight_relation" value="top k"/>
        <parameter key="k" value="5"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
      <connect from_op="Loop Attributes" from_port="output 1" to_op="Select by Weights" to_port="example set input"/>
      <connect from_op="Loop Attributes" from_port="output 2" to_op="Select by Weights" to_port="weights"/>
      <connect from_op="Select by Weights" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Telcontar120 · May 2017

Check out the "Weight by Maximum Relevance" operator which is part of the free Feature Selection Extension. It outputs either attribute weights based on correlation (for numerical labels) or mutual information (for nominal labels). It also has several other operators that you may find useful for dealing with such a large set of attributes.

MartinLiebig · May 2017

Hey,

any reason why you need weight by mutual information and information gain is not fine? Otherwise the fastest way might be to quickly built something like this w/ groovy

~Martin

qwertz2 · May 2017

Hi Brian, hi Martin,

Thank you very much for taking time an having a look into my issue!

Probably my knowledge is not deep enough in this matter but I cannot use "weight by relevance" or "weight by information" gain as I have a label of type "real" and not nominal. What I generally want to achieve is to get a figure for non linear correlation. So I thought that mutual information is a good way to go - and it works with my "real" label.

Meanwhile I made some progress on how to realize my approach. The process can now determine the n attributes with the highest mutual information. However, the whole process looks pretty complicated and clumsy.

I would highly appreciate if you could advice on

- whether my approach is generally the right one in order to detect non linear correlation.

- how to tweak the latest version of my process.

Kind regards

Sachs

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="238">
        <parameter key="number_of_attributes" value="50"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.5.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="238"/>
      <operator activated="true" class="select_attributes" compatibility="7.5.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="label"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="weight_by_user_specification" compatibility="7.5.000" expanded="true" height="82" name="Weight by User Specification (2)" width="90" x="447" y="34">
        <list key="name_regex_to_weights"/>
        <parameter key="default_weight" value="0.0"/>
      </operator>
      <operator activated="true" class="weights_to_data" compatibility="7.5.000" expanded="true" height="68" name="Weights to Data (2)" width="90" x="581" y="34"/>
      <operator activated="true" class="remember" compatibility="7.5.000" expanded="true" height="68" name="Remember" width="90" x="715" y="34">
        <parameter key="name" value="w"/>
      </operator>
      <operator activated="true" class="concurrency:loop_attributes" compatibility="7.5.000" expanded="true" height="103" name="Loop Attributes" width="90" x="514" y="238">
        <parameter key="regular_expression" value="%{loop_attribute}|label"/>
        <parameter key="reuse_results" value="true"/>
        <process expanded="true">
          <operator activated="true" class="work_on_subset" compatibility="7.5.000" expanded="true" height="103" name="Work on Subset" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="regular_expression" value="%{loop_attribute}|label"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="deliver_inner_results" value="true"/>
            <process expanded="true">
              <operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="136">
                <parameter key="attribute_name" value="label"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="mututal_information_matrix" compatibility="7.5.000" expanded="true" height="82" name="Mutual Information Matrix" width="90" x="179" y="34"/>
              <operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="313" y="34"/>
              <operator activated="true" class="extract_macro" compatibility="7.5.000" expanded="true" height="68" name="Extract Macro" width="90" x="447" y="34">
                <parameter key="macro" value="weight"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="label"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="weight_by_user_specification" compatibility="7.5.000" expanded="true" height="82" name="Weight by User Specification" width="90" x="246" y="187">
                <list key="name_regex_to_weights">
                  <parameter key="%{loop_attribute}" value="%{weight}"/>
                </list>
              </operator>
              <operator activated="true" class="weights_to_data" compatibility="7.5.000" expanded="true" height="68" name="Weights to Data" width="90" x="380" y="187"/>
              <connect from_port="exampleSet" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Mutual Information Matrix" to_port="example set"/>
              <connect from_op="Set Role" from_port="original" to_op="Weight by User Specification" to_port="example set"/>
              <connect from_op="Mutual Information Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
              <connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Weight by User Specification" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
              <connect from_op="Weight by User Specification" from_port="example set" to_port="example set"/>
              <connect from_op="Weights to Data" from_port="example set" to_port="through 1"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="189"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <portSpacing port="sink_through 2" spacing="0"/>
              <description align="center" color="yellow" colored="false" height="144" resized="true" width="437" x="148" y="23">&lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br/&gt;&lt;br/&gt;&lt;br&gt;Store the actual mutual informatin value in a macro</description>
              <description align="center" color="yellow" colored="false" height="158" resized="true" width="436" x="148" y="175">&lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; Retrieve macro value and convert it into an example set.&lt;br&gt;</description>
            </process>
          </operator>
          <operator activated="true" class="recall" compatibility="7.5.000" expanded="true" height="68" name="Recall" width="90" x="45" y="238">
            <parameter key="name" value="w"/>
            <description align="center" color="yellow" colored="true" width="126">This is where the prepared header is needed. Otherwise error &amp;quot;object retrieval not possible.&amp;quot;</description>
          </operator>
          <operator activated="true" class="append" compatibility="7.5.000" expanded="true" height="103" name="Append" width="90" x="179" y="85"/>
          <operator activated="true" class="sort" compatibility="7.5.000" expanded="true" height="82" name="Sort" width="90" x="313" y="85">
            <parameter key="attribute_name" value="Weight"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="7.5.000" expanded="true" height="82" name="Generate ID" width="90" x="447" y="85"/>
          <operator activated="true" class="filter_examples" compatibility="6.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="85">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="id.lt.6"/>
            </list>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.5.000" expanded="true" height="82" name="Select Attributes" width="90" x="715" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="remember" compatibility="7.5.000" expanded="true" height="68" name="Remember (2)" width="90" x="849" y="85">
            <parameter key="name" value="w"/>
          </operator>
          <connect from_port="input 1" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="output 1"/>
          <connect from_op="Work on Subset" from_port="through 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Remember (2)" to_port="store"/>
          <connect from_op="Remember (2)" from_port="stored" to_port="output 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="222" resized="true" width="803" x="164" y="75">&lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br/&gt;&lt;br&gt;After each iteration append the new mutual information value to the five highest processed so far. Then keep best five again.&lt;br&gt;That seems complicated. Any better way?&lt;br&gt;Especially the part to extract mutual information to weight to convert it back to example set seems cumbersome...&lt;br&gt;</description>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Loop Attributes" to_port="input 1"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Weight by User Specification (2)" to_port="example set"/>
      <connect from_op="Weight by User Specification (2)" from_port="weights" to_op="Weights to Data (2)" to_port="attribute weights"/>
      <connect from_op="Weights to Data (2)" from_port="example set" to_op="Remember" to_port="store"/>
      <connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop Attributes" from_port="output 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="189"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="21"/>
      <description align="center" color="yellow" colored="false" height="175" resized="true" width="583" x="275" y="28">&lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; &lt;br&gt; Only needed to initialize a specific example set header that can be recalled later.&lt;br&gt;Any better way to solve this ???</description>
    </process>
  </operator>
</process>

MartinLiebig · May 2017

Dear Sachs,

it's always a "tricky" thing how to judge on dependencies. There are some measures around, but there is no clear argument which is the best. I know that we used a combination of all for a science project. I could ask for a process if you like .

For nominal attributes i usually go for gini index or information gain ratio. But Mutual information is very close to information_gain (aka entropy) anyway. So i would recommend to go with information_gain.

For numericals - i have used rank correlation a few times. But i am not sure if we have this in rapidminer as a weight_by operator. if not, that needs to be on our list to built.

Best,

Martin

qwertz2 · May 2017

Hi Martin,

Yes, I would highly appreciate if you would share the process of your science project.

From your feedback I understood that mutual information is close to information gain. However, in Rapidminer mutual information operator can handle numerical labels while information gain can't. Hence, it might not be a good idea to stay with mutual information for numerical labels. Though the process works syntactically, the mutual information algorithm might not be intended to be used on numerical data.

So what is your recommendation to move on as my source data consists of numerical time series? Shall I better

- stay with my clumsy process including mutual information matrix?

- convert my numerical data series to nominal values?

Rank correlation doesn't seem to exist in Rapidminer. It would be a great feature. Additionally, it would come handy if the matrix operators would offer a possibility to calculate only the combination label <-> all other attribute (a single column) instead of all possible combinations all attributes <-> all attributes (a whole matrix).

Best regards

Sachs

Telcontar120 · May 2017

Actually the rank correlation (Spearman) is available in the Statistics extension which can be downloaded from the marketplace and licensed from Old World Computing @land. You may find that helpful for your process.

qwertz2 · May 2017

Dear Martin & Brian,

Thank you very much for guiding me in the right direction. Using weight by information gain on a discretized label finally brought success and happiness It's amazing how only three operators can replace my former complicated process. And it also provides the same results. Moreover, it is x times faster!

Cheers

Sachs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[Solved] Average mutual information / correlation matrix on massive data set

Best Answer

Answers