"how to handle missing values while calculating correllation"

venkatesh20 · January 2010

Hi Gurus,
I am working on movie lens data set, consider the below data set

userid, movieid, rating
1,100,5
1,101,2
1,102,4
2,100,5
2,102,1

I want to compute the correlation between the userids 1 and 2, only based on the items which users 1 and 2 have commonly rated. I want to ignore the uncommon ratings while calculating correlation. For eg. In the above case i want to compute the correlation only based on the ratings of the movie ids 100 and 102 which user 1 and user 2 have in common. Can any one guide me how to do this in rapid miner?

I tried the one below and it has missing values, and does not give proper results

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="449" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
<parameter key="repository_entry" value="jester/jester_sub"/>
</operator>
<operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="120">
<parameter key="group_attribute" value="userid"/>
<parameter key="index_attribute" value="jokeid"/>
</operator>
<operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="447" y="120">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CorrelationSimilarity"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="126"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

land · January 2010

Hi,
I guess it would be the easiest solution to replace the missing values. If you would simply remove all attributes with missing values, you would loose informations, because not rating a movie is an information about a user. If you replace the missing values by -1, this might catch the real connection much better.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"how to handle missing values while calculating correllation"

Answers