The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Bug in ModelApplier?"
Intro: First of all, I would like to congratulate the Rapid-I team to this great piece of software. The user interface and philosophy behind the data and operator handling is well-designed, intuitive and the set of algorithms & visualizations is very rich.
However, I stumbled over quite a bug when I tried to solve as an exercise the DMC'2007 challenge with RapidMiner. It seems to me that something is going wrong with the ModelApplier when combining MetaCost with certain datasets.
Bug: ModelApplier seems to change the label headings in a dataset, and this leads to completely different classification errors on the same data.
How to reproduce: There are two small datasets, dmc2007_test_small.csv and dmc2007_test_sm_2.csv attached to this post. The datasets contain each exactly the same set of 149 records, with the only difference that the order of the records is slightly rearranged: Labels are N…NBN…NA… in dmc2007_test_small.csv and N…NAN…NB… in dmc2007_test_sm_2.csv (only two lines interchanged).
When you run dmc2007_test_small.csv through the following script, the number of B-labels changes completely (from 11 to 23) when you pass the data through ModelApplier (see attached screenshots in my_results.pdf, the classification error goes from 30% to 26%). This is not the case with dmc2007_test_sm_2.csv, there everything is OK. The script is
This seems somewhat disturbing to me since ModelApplier changes the incoming data (“label”) which it is expected to read only.
And of course things can get much worse: if we put a record with label “B” as first record of the dataset (again the set is exactly the same) we get an appearent classification error of 86% (which is again due to the wrong labels, the predictions of the model are exactly the same).
Recently I found out: The bug is not dependent on the MetaCost part of the training model, the same thing happens if we just use a decision tree as model.
Another topic: it is not clear to me how the rows and columns in the cost matrix connect to the labels (at least I can not see it in the documentation, however I found it out by try-and-error that probably the order of occurrence in the training set defines the rows). It would be nice to have the cost matrix interface extended in such a way that it is clear what is true / predicted (row or column?) and which line corresponds to what label.
Wish you all the best for your product, we are currently considering using it in some of our Master and Bachelor Data Mining courses.
Best regards
Wolfgang Konen
Institut für Informatik,
FH Köln - Campus Gummersbach
Steinmüllerallee 1
51643 Gummersbach
www.gm.fh-koeln.de/~konen
P.S: Since no one reported to my bug description ID: 2686544 in SourceForge’s Rapid-I-Bug-Tracker (March, 13th), I post it here again. I tried to put it in a more concise form so that you can see better the error . Just as a note: If you solve this, also the Bug with ID: 2686544 is done. Hope to see some sort of reaction this time...
P.P.S.: If you do not maintain the BugTracker at SourceForge (which I can understand, you have already lots to do with the forum), it would perhaps be nice to put a note saying so in http://sourceforge.net/tracker/?group_id=114160&;atid=667390
WK
[attachment deleted by admin]
However, I stumbled over quite a bug when I tried to solve as an exercise the DMC'2007 challenge with RapidMiner. It seems to me that something is going wrong with the ModelApplier when combining MetaCost with certain datasets.
Bug: ModelApplier seems to change the label headings in a dataset, and this leads to completely different classification errors on the same data.
How to reproduce: There are two small datasets, dmc2007_test_small.csv and dmc2007_test_sm_2.csv attached to this post. The datasets contain each exactly the same set of 149 records, with the only difference that the order of the records is slightly rearranged: Labels are N…NBN…NA… in dmc2007_test_small.csv and N…NAN…NB… in dmc2007_test_sm_2.csv (only two lines interchanged).
When you run dmc2007_test_small.csv through the following script, the number of B-labels changes completely (from 11 to 23) when you pass the data through ModelApplier (see attached screenshots in my_results.pdf, the classification error goes from 30% to 26%). This is not the case with dmc2007_test_sm_2.csv, there everything is OK. The script is
Remark: The model dmc2007-dt.mod can be trained using the script below. dmc2007_test_sm_2.csv has the same order of label appearance as in the training data set. Here is the training script:
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="dmc2007_test_small.csv"/>
<parameter key="id_column" value="1"/>
<parameter key="label_column" value="22"/>
</operator>
<operator name="ModelLoader" class="ModelLoader">
<parameter key="model_file" value="dmc2007-dt.mod"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<list key="class_weights">
<parameter key="N" value="1.0"/>
<parameter key="A" value="999.0"/>
<parameter key="B" value="1.0"/>
</list>
<parameter key="classification_error" value="true"/>
<parameter key="keep_example_set" value="true"/>
</operator>
</operator>
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource" breakpoints="after">
<parameter key="filename" value="dmc2007_train_small.csv"/>
<parameter key="id_column" value="1"/>
<parameter key="label_column" value="22"/>
</operator>
<operator name="DecisionTree" class="DecisionTree">
<parameter key="keep_example_set" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="dmc2007-dt2.mod"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<list key="class_weights">
<parameter key="N" value="1.0"/>
<parameter key="A" value="999.0"/>
<parameter key="B" value="1.0"/>
</list>
<parameter key="classification_error" value="true"/>
<parameter key="keep_example_set" value="true"/>
</operator>
<operator name="CostEvaluator" class="CostEvaluator">
<parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
<parameter key="keep_exampleSet" value="true"/>
</operator>
</operator>
This seems somewhat disturbing to me since ModelApplier changes the incoming data (“label”) which it is expected to read only.
And of course things can get much worse: if we put a record with label “B” as first record of the dataset (again the set is exactly the same) we get an appearent classification error of 86% (which is again due to the wrong labels, the predictions of the model are exactly the same).
Recently I found out: The bug is not dependent on the MetaCost part of the training model, the same thing happens if we just use a decision tree as model.
Another topic: it is not clear to me how the rows and columns in the cost matrix connect to the labels (at least I can not see it in the documentation, however I found it out by try-and-error that probably the order of occurrence in the training set defines the rows). It would be nice to have the cost matrix interface extended in such a way that it is clear what is true / predicted (row or column?) and which line corresponds to what label.
Wish you all the best for your product, we are currently considering using it in some of our Master and Bachelor Data Mining courses.
Best regards
Wolfgang Konen
Institut für Informatik,
FH Köln - Campus Gummersbach
Steinmüllerallee 1
51643 Gummersbach
www.gm.fh-koeln.de/~konen
P.S: Since no one reported to my bug description ID: 2686544 in SourceForge’s Rapid-I-Bug-Tracker (March, 13th), I post it here again. I tried to put it in a more concise form so that you can see better the error . Just as a note: If you solve this, also the Bug with ID: 2686544 is done. Hope to see some sort of reaction this time...
P.P.S.: If you do not maintain the BugTracker at SourceForge (which I can understand, you have already lots to do with the forum), it would perhaps be nice to put a note saying so in http://sourceforge.net/tracker/?group_id=114160&;atid=667390
WK
[attachment deleted by admin]
Tagged:
0
Answers
I've downloaded and unzipped the data and re-arranged the XML to include the two test files, and to keep the generated model going throughout, like this. But if I run this code I cannot replicate your problem, because the decision tree just produces "N" in all cases, for all datasets, "A" and "B" never show up. I must have made a stupid mistake somewhere, but I'm damned if I can see where it is. On the other hand the cost evaluator does seem to switch the As and Bs around in the subsequent test datasets, which can't be good.
Perhaps others can have a bash and lighten my darkness...
I am using however RapidMiner 4.3, perhaps this makes a difference.
Anyhow, I give you the following code below (which is similar to your script) It has the DT-building operators disabled, instead it reads the DT-model from dmc2007-dt2.mod, which I give you below as attachment in the ZIP (along with dmc2007_test_sm_3.csv). With this you should be able to reproduce first a classification error 86.5% and then a classification error 24.1% and you can see that the labels for "true B" and "true N" are interchanged.
This leaves us with the bug in its cleanest form...
Regards
WK
[attachment deleted by admin]
Yep, I get the same. For those of us that optimise classifications this is pretty scarey stuff, but many thanks for bringing it to our attention.
If it is related to issues mentioned in http://rapid-i.com/rapidforum/index.php/topic,782.0.html then it is high time it was put to bed.
Thanks again.
Same here. So i tried to load the used files separately and saved them in RapidMiner format (that means *.aml, *.dat). As you can clearly see, the labels interchanged because the internal mapping has changed: what does that mean ? The standard RapidMiner format for ExampleSets stores all data in an array of Numbers. Nominal Values are stored using a mapping, which mapps every internal number to the real (exernal) (string-)value.
So... I have changed the sequence manually in the aml files... which results in a constant "quality" of 22,82%.
Here is the process:
- Run the complete process to see the interchanging and performance jumping
- Deactivate the first operatorchain by disabling it and change the sequence manually in the stored aml - files
- rerun the process to see what I have seen
ConclusionIt is not a bug of ModelApplier, it is a bug of the way RM stores the data internally. Normally, the data storage should not affect the usage of the data (if the values are correctly retrieved). I guess this is the same problem as here (http://rapid-i.com/rapidforum/index.php/topic,281.0.html), which has not been fixed yet.
Workaround
Store the data in the RM format and adjust the critical parts of the *.aml - file manually.
kind regards,
Steffen
thanks for confirming my results.
Hello Steffen,
thanks for your reply which just came minutes before I was about to post a similar workaround I found in the last hour. Yep, the workaround works! When I use instead of CVSExampleSource the operator ExampleSource with appropriate AML- and DAT-files and when I take care, that the order of the <value>-tags is the same in each AML-file, then I get the same results with each rearranged dataset.
It is, however, still quite a scary trap for a newcomer to RapidMiner. :-\
But thanks again for your fast reply.
Wolfgang
The good news is that we are currently working on a new way of meta data storage and handling for RM 5.0 which will allow to (re-)use meta data stored together with the data in a data repository. Then each operator will transform the meta data accordingly which have a nice side-effect: in future versions you will also be able to see how the meta-data looks like at almost arbitrary places of the process without running it...
The bad news: until then, you have to ensure the correctness of the meta data yourself which can be easily done by using the same .aml file for corresponding data sets (just replace the path to the data file in the header).
Cheers,
Ingo
By the way, I'm also getting this error msg:
[Warning] Kernel Model: The order of attributes is not equal for the training and the application example set. This might lead to problems for some models.
Here's my code - please help!
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.4">
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="init"/>
<parameter key="logfile" value="OUT_%{process_name}_RootLog0.log"/>
<parameter key="resultfile" value="OUT_%{process_name}_RootResults0.res"/>
<parameter key="random_seed" value="2001"/>
<parameter key="encoding" value="SYSTEM"/>
<operator name="ExampleSource_WkArnd2" class="ExampleSource">
<parameter key="attributes" value="C:\Desktop\RapidMiner\_20090404_NPScs_ALL_Dec08_KWA\OUT__20090404_csNPS_Dec08_WholeShebang_VAL_AttDescFile_ModVal.aml"/>
<parameter key="sample_ratio" value="1.0"/>
<parameter key="sample_size" value="-1"/>
<parameter key="permutate" value="false"/>
<parameter key="decimal_point_character" value="."/>
<parameter key="column_separators" value=",\s*|;\s*|\s+"/>
<parameter key="use_comment_characters" value="true"/>
<parameter key="comment_chars" value="#"/>
<parameter key="use_quotes" value="false"/>
<parameter key="quote_character" value="""/>
<parameter key="quoting_escape_character" value="\"/>
<parameter key="trim_lines" value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="local_random_seed" value="-1"/>
</operator>
<operator name="ModelLoader" class="ModelLoader">
<parameter key="model_file" value="C:\Desktop\RapidMiner\_20090404_NPScs_ALL_Dec08_KWA\OUT__20090404_csNPS_Dec08_WholeShebang_Model_ModDevOutput2.mod"/>
</operator>
<operator name="ModelApplier_ModVal" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="false"/>
</operator>
<operator name="ExampleSetWriter_ModVal" class="ExampleSetWriter">
<parameter key="example_set_file" value="OUT_%{process_name}_ExampleSetFile_ModValOutput_LiftCurve.dat"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$i $l $p $d"/>
<parameter key="fraction_digits" value="-1"/>
<parameter key="quote_nominal_values" value="true"/>
<parameter key="zipped" value="false"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
</operator>
</process>
I knew I was going to love Rapidminer even more when i could get the model validated -- woo hoo!
Have fun and all the best,
Ingo
Wait for it...( little drum roll )
Warning As the java ResultSetMetaData interface does not provide information
about the possible values of nominal attributes, the internal indices the
nominal values are mapped to will depend on the ordering they appear in the
table. This may cause problems only when processes are split up into a training
process and an application or testing process. For learning schemes which
are capable of handling nominal attributes, this is not a problem. If a learning
scheme like a SVM is used with nominal data, RapidMiner pretends that
nominal attributes are numerical and uses indices for the nominal values as their
numerical value. A SVM may perform well if there are only two possible values.
If a test set is read in another process, the nominal values may be assigned different
indices, and hence the SVM trained is useless. This is not a problem for
label attributes, since the classes can be specied using the classes parameter
and hence, all learning schemes intended to use with nominal data are safe to
use.
Rapidminer-4.3-tutorial.pdf page 103.
So there we have it, we were all warned. Moving swiftly along it now turns out that this problem can also be avoided completely if you use a database example set and fill in the blanks, here's a rework of Wokon's example which returns exactly what it should, namely 23.49% in both cases under 4.4 Enterprise. The thing that saves the day is the parameter <parameter key="classes" value="N A B"/>, having a similar, and probably required parameter also on all the file input operators would put this one to bed, and save us from the "me too" bug posts.