Nearest neighbours always gives the same prediction ¿? !!

traveria · March 2009

Hello, I am having an astonishing result:

I run a very simple test with nearest neighbors (see xml code below) and I am using a training dataset and an test dataset (see short datasets below).

The seldom result is that I always get the same value for the predicted value, despite the test example I use

If I use the "ExampleSetGenerator" instead of reading a dataset in a file (activate it in the model I include below) I get a different prediction for every new test example I use, as it is expected.

Can anyone explain what is the reason for getting always the same prediction if I read data from a file?? ???

Any hint or solution will be welcomed!!!!

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma.aml"/>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator" activated="no">
<parameter key="number_examples" value="10000"/>
<parameter key="target_function" value="polynomial"/>
</operator>
<operator name="NearestNeighbors" class="NearestNeighbors">
</operator>
<operator name="ExampleSource (4)" class="ExampleSource">
<parameter key="attributes" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma_test.aml"/>
<parameter key="permutate" value="true"/>
</operator>
<operator name="ExampleSetGenerator (2)" class="ExampleSetGenerator" activated="no">
<parameter key="number_examples" value="10"/>
<parameter key="target_function" value="polynomial"/>
</operator>
<operator name="ExampleRangeFilter" class="ExampleRangeFilter">
<parameter key="first_example" value="2"/>
<parameter key="last_example" value="2"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
<parameter key="keep_model" value="true"/>
</operator>
</operator>

TRAINING DATA

1 1.6 0.84 0 0.76
2 2.17 0.91 0.3 0.96
3 1.61 0.14 0.48 1
4 0.84 -0.76 0.6 1
5 0.74 -0.96 0.7 1
6 1.5 -0.28 0.78 1
7 2.5 0.66 0.85 1
8 2.89 0.99 0.9 1
9 2.37 0.41 0.95 1
10 1.46 -0.54 1 1
11 1.04 -1 1.04 1
12 1.54 -0.54 1.08 1
13 2.53 0.42 1.11 1
14 3.14 0.99 1.15 1
15 2.83 0.65 1.18 1
16 1.92 -0.29 1.2 1
17 1.27 -0.96 1.23 1
18 1.5 -0.75 1.26 1
19 2.43 0.15 1.28 1
20 3.21 0.91 1.3 1

TEST DATA

21 3.16 0.84 1.32 1
22 2.33 -0.01 1.34 1
23 1.52 -0.85 1.36 1
24 1.47 -0.91 1.38 1
25 2.27 -0.13 1.4 1
26 3.18 0.76 1.41 1
27 3.39 0.96 1.43 1
28 2.72 0.27 1.45 1
29 1.8 -0.66 1.46 1
30 1.49 -0.99 1.48 1

haddock · March 2009

Hi,

I'm not sure the result is as surprising as you think. I can replicate your problem on your own data if I simply include the left hand column as a normal attribute, even though it looks looks much more like an Id attribute. If you treat it like one your "surprising" result disappears

So I think you should check your AML file to see how you've been handling that column.

Here's some code to illustrate the point, if you leave "1'" as the value for "select_which" in the very first operator all the predictions are the same, but they are not all the same if you insert "2" instead. That is because the second example source marks column one as an Id column, whereas the first does not.

<operator name="Root" class="Process" expanded="yes">
    <operator name="OperatorSelector" class="OperatorSelector" expanded="yes">
        <operator name="SimpleExampleSource" class="SimpleExampleSource">
            <parameter key="filename"	value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
            <parameter key="label_column"	value="2"/>
        </operator>
        <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
            <parameter key="filename"	value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
            <parameter key="label_column"	value="2"/>
            <parameter key="id_column"	value="1"/>
        </operator>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="ExampleRangeFilter" class="ExampleRangeFilter">
        <parameter key="first_example"	value="1"/>
        <parameter key="last_example"	value="20"/>
    </operator>
    <operator name="NearestNeighbors" class="NearestNeighbors">
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="ExampleRangeFilter (2)" class="ExampleRangeFilter">
        <parameter key="first_example"	value="21"/>
        <parameter key="last_example"	value="30"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <parameter key="keep_model"	value="true"/>
        <list key="application_parameters">
        </list>
        <parameter key="create_view"	value="true"/>
    </operator>
</operator>

I've attached the datafile with all 30 examples, you'll need to adjust the path to it in order to run the demo.

[attachment deleted by admin]

traveria · March 2009

Many thanks Haddock,

after some investigation I realize that the reason for the algorithm to produce the same prediction in all cases is that the examples dataset has not the same data description (metadata in the aml file) than in the test dataset, hence the algorithm does not know what to predict and produces all the time the last correct prediction.

I still do not understand why both datasets have not the same structure. Try the minimalist file at the end of the message to realize that it is so: what it writes first is not the same as it writes afterwards.

After solving this little inconvenience I can run the Nearest Neighbors correctly.

Many thanks for your comments anyway ;D!!!!!

Miquel

<?xml version="1.0" encoding="UTF-8"?>
<process version="4.2">

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_examples" value="1000"/>
<parameter key="target_function" value="polynomial"/>
</operator>
<operator name="ExampleSetWriter (2)" class="ExampleSetWriter">
<parameter key="attribute_description_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.aml"/>
<parameter key="example_set_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
<parameter key="quote_whitespace" value="false"/>
</operator>
<operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
<parameter key="filename" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
<parameter key="label_column" value="6"/>
<parameter key="use_quotes" value="true"/>
</operator>
<operator name="FeatureRangeRemoval" class="FeatureRangeRemoval">
<parameter key="first_attribute" value="6"/>
<parameter key="last_attribute" value="6"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="attribute_description_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.aml"/>
<parameter key="example_set_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.dat"/>
<parameter key="quote_whitespace" value="false"/>
</operator>
</operator>

</process>

haddock · March 2009

Hi,

It is rather difficult to comment on this unless you show what you put in "polinomi_set.aml", perhaps you will oblige us?

However, there are things that are obvious, whatever you put in that file....

1.The generator produces 5 attributes and 1 label= 6 columns.

2. Removing attribute number 6 cannot work, unless there are 6 attributes.

3. There can only be 6 attributes if the label column is set to 0.

4. But in your code it is marked as being in column 6!

5. So this code NEVER could work, whatever is in "polinomi_set.aml".

Which leaves me with a question, what on earth were you trying to achieve with this post?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Nearest neighbours always gives the same prediction ¿? !!

Answers