[Solved] Average mutual information / correlation matrix on massive data set
Dear community,
There is a massive data set with a couple of thousands of regular attributes and a single label. The primary goal is to get a list with two columns showing 1) the attribute's names and 2) the average mutual information (related to the label).
As there are so many attributes the average mutual information matrix is slow and memory consuming. So I thought to work on a subset. This way I can calculate label and att1, then label and att2, then label and ... looping through all combinations.
However, I didn't manage to combine each iteration's result in a single table. Recall and remember don't seem to work here as the initial recall is empty.
The secondary goal would be to select the five attributes with the highest average mutual information out of the initial massive data set.
PS: I have the converters extension installed in order to convert matrix to example set.
PPS: The matrix operators don't seem to be able to handle special attributes. That's why I used "set role to regular".
Looking forward to any advice...
Cheers
Sachs
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.5.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
<parameter key="number_of_attributes" value="5000"/>
</operator>
<operator activated="true" class="concurrency:loop_attributes" compatibility="7.5.000" expanded="true" height="103" name="Loop Attributes" width="90" x="179" y="34">
<parameter key="regular_expression" value="%{loop_attribute}|label"/>
<process expanded="true">
<operator activated="true" class="work_on_subset" compatibility="7.5.000" expanded="true" height="103" name="Work on Subset" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="regular_expression"/>
<parameter key="regular_expression" value="%{loop_attribute}|label"/>
<parameter key="include_special_attributes" value="true"/>
<process expanded="true">
<operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="34">
<parameter key="attribute_name" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="mututal_information_matrix" compatibility="7.5.000" expanded="true" height="82" name="Mutual Information Matrix" width="90" x="179" y="34"/>
<operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="313" y="85"/>
<operator activated="true" class="recall" compatibility="7.5.000" expanded="true" height="68" name="Recall" width="90" x="313" y="187">
<parameter key="name" value="temp"/>
</operator>
<operator activated="true" class="append" compatibility="7.5.000" expanded="true" height="103" name="Append" width="90" x="447" y="136"/>
<operator activated="true" class="remember" compatibility="7.5.000" expanded="true" height="68" name="Remember" width="90" x="581" y="136">
<parameter key="name" value="temp"/>
</operator>
<connect from_port="exampleSet" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Mutual Information Matrix" to_port="example set"/>
<connect from_op="Mutual Information Matrix" from_port="example set" to_port="example set"/>
<connect from_op="Mutual Information Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
<connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Append" to_port="example set 1"/>
<connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Remember" to_port="store"/>
<connect from_op="Remember" from_port="stored" to_port="through 1"/>
<portSpacing port="source_exampleSet" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Work on Subset" to_port="example set"/>
<connect from_op="Work on Subset" from_port="example set" to_port="output 1"/>
<connect from_op="Work on Subset" from_port="through 1" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
<connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>
<connect from_op="Loop Attributes" from_port="output 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Best Answer
-
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
Dear Sachs,
mutual information is binning internally anyway. Thus i would recommend to use Weight by information gain on a discretized label.
~Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Hi,
isnt Weight by Correlation/Information Gain what you want to have?
Best,
Martin
Dortmund, Germany
Dear Martin,
The "weight by" operator is almost basically what I want to have. I go through "work on subset" and assign weights and finally select the five attributes with the highest weights. The thing is that there is no "weight by mutual information" operator and the "mutual information matrix" has no weight output...
Maybe a process like "mutual information matrix" -> "matrix to example set" -> "data to weight". But how to proceed? Are the weights stored internally so that I can do "select by weight" after the looping? And for some reason the "data to weight" is always 1. Please advice...
Best regards
Sachs
Check out the "Weight by Maximum Relevance" operator which is part of the free Feature Selection Extension. It outputs either attribute weights based on correlation (for numerical labels) or mutual information (for nominal labels). It also has several other operators that you may find useful for dealing with such a large set of attributes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hey,
any reason why you need weight by mutual information and information gain is not fine? Otherwise the fastest way might be to quickly built something like this w/ groovy
~Martin
Dortmund, Germany
Hi Brian, hi Martin,
Thank you very much for taking time an having a look into my issue!
Probably my knowledge is not deep enough in this matter but I cannot use "weight by relevance" or "weight by information" gain as I have a label of type "real" and not nominal. What I generally want to achieve is to get a figure for non linear correlation. So I thought that mutual information is a good way to go - and it works with my "real" label.
Meanwhile I made some progress on how to realize my approach. The process can now determine the n attributes with the highest mutual information. However, the whole process looks pretty complicated and clumsy.
I would highly appreciate if you could advice on
- whether my approach is generally the right one in order to detect non linear correlation.
- how to tweak the latest version of my process.
Kind regards
Sachs
Dear Sachs,
it's always a "tricky" thing how to judge on dependencies. There are some measures around, but there is no clear argument which is the best. I know that we used a combination of all for a science project. I could ask for a process if you like .
For nominal attributes i usually go for gini index or information gain ratio. But Mutual information is very close to information_gain (aka entropy) anyway. So i would recommend to go with information_gain.
For numericals - i have used rank correlation a few times. But i am not sure if we have this in rapidminer as a weight_by operator. if not, that needs to be on our list to built.
Best,
Martin
Dortmund, Germany
Hi Martin,
Yes, I would highly appreciate if you would share the process of your science project.
From your feedback I understood that mutual information is close to information gain. However, in Rapidminer mutual information operator can handle numerical labels while information gain can't. Hence, it might not be a good idea to stay with mutual information for numerical labels. Though the process works syntactically, the mutual information algorithm might not be intended to be used on numerical data.
So what is your recommendation to move on as my source data consists of numerical time series? Shall I better
- stay with my clumsy process including mutual information matrix?
- convert my numerical data series to nominal values?
Rank correlation doesn't seem to exist in Rapidminer. It would be a great feature. Additionally, it would come handy if the matrix operators would offer a possibility to calculate only the combination label <-> all other attribute (a single column) instead of all possible combinations all attributes <-> all attributes (a whole matrix).
Best regards
Sachs
Actually the rank correlation (Spearman) is available in the Statistics extension which can be downloaded from the marketplace and licensed from Old World Computing @land. You may find that helpful for your process.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dear Martin & Brian,
Thank you very much for guiding me in the right direction. Using weight by information gain on a discretized label finally brought success and happiness It's amazing how only three operators can replace my former complicated process. And it also provides the same results. Moreover, it is x times faster!
Cheers
Sachs