The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Yet Another Model Applier Problem
I have recently integrated Rapidminer into some software in a practical situation however I have found some behaviour today that has me greatly concerned.
Reading back through my email I sound very angry and bitter, I'm sorry about that because I know that you guys are working as hard as you can to make rapidminer a great product and I just venting my stress at working to a deadline.
I loaded data with the excel example set loader to create and save a model.
I then load a single instance that I want to classify using the excel example set loader and output the data and the classification with the excel example set writer.
Unfortunately, for the nominal variables it only outputs one nominal variable for each variable. If the nominal variable in the single instance is different to the 'default' then the only output is a question mark.
SEX MARSTAT EDUC EMPLOY
Male ? ? Employed
So if marstat is "current long-term" it will be outputted but for any other value only a question mark is returned.
However, if I load the data with the excel example source but don't check the 'use first row as headers' box then the variables are displayed correctly but of course the predictions etc. are not at all sensible.
temp.xls (1) temp.xls (2) temp.xls (3) temp.xls (4)
SEX MARSTAT EDUC EMPLOY
Male Previous Long-Term Junior (Yr 10) Employed
How can I know that the classification result output is correct? Or if it is just not outputting the data correctly?
It seems that this is a problem with the internal representation of the nominal variables. I would have thought that the model would contain the appropriate information on how to convert the nominal variables for a single new test case into the appropriate internal representation.
It is the only way that it could be expected to behave consistently. Does this happen?
If this is not the case then Rapidminer is almost completely useless as it stands for what is probably the most common practical real world task for which data mining could be used if any nominal variables are involved!!!!!
If I read the information in using the csv reader and then save it using the example set writer and then copy and paste the header information from the training set .aml file into the .aml file for the single example I can get what seems to be the correct output. I haven't found a way for me to automate this process as there is no way to associate an .aml file with a new data file as far as I can see.
So, do I have to write a script to replace the header each time for the most simple task imaginable?
Does anyone have any suggestions about what can be done here?
I have tried something like
- Loaded training data, wrote out training data with examplesetwriter
- Made a copy of the .aml file from the training data, saved as testo
- Opened testo manually and changed its associated file to test
<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="test.dat">
And then in a different process
- Read in data with excel example set
- Write out data with examplesetwriter (but only the test.dat file)
- Tried to read in testo.aml using exampleset
I thought that this would read exampleset, look at which file it is associated with and load the appropriate .dat file.
If this approach works then I could keep the correct .aml description and just load the appropriate .dat file each time
It doesn't appear to work and I can't understand. It appears that it is ignoring the attributesetdefault source and only looks for a
.dat file with the same name.
Ok, another approach.
I read in the single example in excelwriter and write it out to data.aml and data.dat
I copy the appropriate part from train.aml (the files are of different length because data.dat doesn't have a label variable).
Apply the model
Output with excelwriter.
Now all of the variables appear correctly but no prediction is made!
A last approach
Put everything in the same stream.
Examplesource (train)
W-J48
ExcelExampleSource (single prediction instance)
ModelApplier
ExcelExampleSetWriter
Output is the same as before, it comes with a prediction but nominal variables not of the 'default' type are shown as ?.
To summarise:
If you create a model with nominal variables and wish to apply it to a single example to make a prediction for the future
and output the results into excel there is no simple way to preserve the information for the nominal variables that
do not correspond to a 'default' nominal variable.
In other words, what I think is happening, that if you wish to apply a model to a single example
there is no way to get the nominal variables in the single example to map appropriately to the internal representations.
Does what I am saying make sense?
I can't think of anything else to try here.
Anyone have any ideas?
Reading back through my email I sound very angry and bitter, I'm sorry about that because I know that you guys are working as hard as you can to make rapidminer a great product and I just venting my stress at working to a deadline.
I loaded data with the excel example set loader to create and save a model.
I then load a single instance that I want to classify using the excel example set loader and output the data and the classification with the excel example set writer.
Unfortunately, for the nominal variables it only outputs one nominal variable for each variable. If the nominal variable in the single instance is different to the 'default' then the only output is a question mark.
SEX MARSTAT EDUC EMPLOY
Male ? ? Employed
So if marstat is "current long-term" it will be outputted but for any other value only a question mark is returned.
However, if I load the data with the excel example source but don't check the 'use first row as headers' box then the variables are displayed correctly but of course the predictions etc. are not at all sensible.
temp.xls (1) temp.xls (2) temp.xls (3) temp.xls (4)
SEX MARSTAT EDUC EMPLOY
Male Previous Long-Term Junior (Yr 10) Employed
How can I know that the classification result output is correct? Or if it is just not outputting the data correctly?
It seems that this is a problem with the internal representation of the nominal variables. I would have thought that the model would contain the appropriate information on how to convert the nominal variables for a single new test case into the appropriate internal representation.
It is the only way that it could be expected to behave consistently. Does this happen?
If this is not the case then Rapidminer is almost completely useless as it stands for what is probably the most common practical real world task for which data mining could be used if any nominal variables are involved!!!!!
If I read the information in using the csv reader and then save it using the example set writer and then copy and paste the header information from the training set .aml file into the .aml file for the single example I can get what seems to be the correct output. I haven't found a way for me to automate this process as there is no way to associate an .aml file with a new data file as far as I can see.
So, do I have to write a script to replace the header each time for the most simple task imaginable?
Does anyone have any suggestions about what can be done here?
I have tried something like
- Loaded training data, wrote out training data with examplesetwriter
- Made a copy of the .aml file from the training data, saved as testo
- Opened testo manually and changed its associated file to test
<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="test.dat">
And then in a different process
- Read in data with excel example set
- Write out data with examplesetwriter (but only the test.dat file)
- Tried to read in testo.aml using exampleset
I thought that this would read exampleset, look at which file it is associated with and load the appropriate .dat file.
If this approach works then I could keep the correct .aml description and just load the appropriate .dat file each time
It doesn't appear to work and I can't understand. It appears that it is ignoring the attributesetdefault source and only looks for a
.dat file with the same name.
Ok, another approach.
I read in the single example in excelwriter and write it out to data.aml and data.dat
I copy the appropriate part from train.aml (the files are of different length because data.dat doesn't have a label variable).
Apply the model
Output with excelwriter.
Now all of the variables appear correctly but no prediction is made!
A last approach
Put everything in the same stream.
Examplesource (train)
W-J48
ExcelExampleSource (single prediction instance)
ModelApplier
ExcelExampleSetWriter
Output is the same as before, it comes with a prediction but nominal variables not of the 'default' type are shown as ?.
To summarise:
If you create a model with nominal variables and wish to apply it to a single example to make a prediction for the future
and output the results into excel there is no simple way to preserve the information for the nominal variables that
do not correspond to a 'default' nominal variable.
In other words, what I think is happening, that if you wish to apply a model to a single example
there is no way to get the nominal variables in the single example to map appropriately to the internal representations.
Does what I am saying make sense?
I can't think of anything else to try here.
Anyone have any ideas?
0
Answers
I had tried all of what I have repeated below on the full data set and was getting incredibly weird results that I couldn’t understand so I thought that the approach suggested in the other thread was not working.
Just to reiterate that approach for anyone reading.
1. Read in your training data with whatever examplesource type that you need to use
2. Check that all of your variables are of the correct type and special category
3. Write out your data using the examplesourcewriter
4. Repeat these steps 1-3 for your test data
5. Copy the attribute information from your training.aml file and replace it in your test.aml file (careful not to copy over the header with the file information!)
6. Read in your training data, build a model with it and write out the model
7. When you want to classify a new instance
a. Read in your new instance data using whatever examplesource you want
b. Save you new instance data using the examplesourcewriter BUT ONLY FILL IN THE EXAMPLE_SET_FILE (the .dat file) and use the same name as you chose for your test.dat file before. Be aware that you may want to change the overwrite mode on the examplesourcewriter. I personally want to classify one instance and then later have that data overwritten so I chose overwrite.
c. Read in your new instance using examplesource with test.aml.
d. Load your model
e. Apply your model
f. Write out your new data in whatever form you are happy with.
You reply made me sit down and try it all again with a smaller cut down data set to make sure that I was getting it all right. Using the proposed work around did work out on the smaller data set in the end and when I went back and run the larger data set it worked. I don’t know what was going on.
Also in the process I found that I misnamed one of the nominal categories but that didn’t seem to be causing the weirdness.
I will post in another thread about the possibility of including the mapping in the model.
Finally, I wonder if the fact that the ID and LABEL classes can get switched around when transforming the data from excel (for example) to exampleset can cause a problem in the future if the internal mapping problem is licked. (see below the dotted line for further explanation, saving as an exampleset can reorder the special attributes as compared to the rest of the data).
I wont include all the files as I had intended and if you want you can safely ignore all of my experimental details below.
---------------------
In the first step I load the training set (RapidTrainCut.xls) with the excel loader and then write it out with the example set writer to the files train.dat and train.aml (also attached). In the excel file the ID column is 11th (UR) and the label column is 12th (Success). In the data view Success (label) is in the first column with UR (ID) in the second column followed by the rest of the data in the original order.
In the train.aml/train.dats files it can be noted that these two columns have switched position relative to the xls file that I started with. Perhaps this is causing a problem. We shall see.
The xls file:
GROUP UR SUCCESS
CBT only 132.00 Unsuccessful
CBT only 191.00 Unsuccessful
The train.aml file:
<attribute
name = "GROUP"
sourcecol = "10"
valuetype = "nominal">
<value>CBT only</value>
<value>Combination</value>
<value>Refuseniks</value>
<value>Acamprosate</value>
<value>St Judes</value>
<value>Naltrexone</value>
</attribute>
<label
name = "SUCCESS"
sourcecol = "11"
valuetype = "nominal">
<value>Unsuccessful</value>
<value>Successful</value>
</label>
<id
name = "UR"
sourcecol = "12"
valuetype = "integer"/>
</attributeset>
In the second step I load the single test example (tempCUTwithSUCCESS.xls) with the excel loader and then write it out with the example set writer to the files test.dat and test.aml. The same reversal of SUCCESS and UR has taken place.
In the third step I use the examplesource to load train.aml, create a model using W-J48 and the write the model (model1.mod) with the modelwriter in plain xml.
So, now we have a ‘correct’ and fully fleshed out train.aml/train.dat. We have a model created with those files, model1.mod, and we have a single item file for which we want a prediction of SUCCESS that has a correct test.dat but a test.aml which only has a single value for each nominal variable and should not be expected to work.
Now I will run a series of tests starting with predictions that we don’t expect to work and finishing with one that does.
FIRST TEST
My program outputs xls files so I wanted to be able to simply output a test case to be predicted in xml and get a prediction. I load tempCUTwithSUCCESS.xls with the excel loader, with 11 for the ID and 12 for the label as it was in the original data.
I loaded the model with modelloader and then apply the model with modelapplier and then write out the results to the file test1.xls with excelmodelwriter.
There are a number of warnings, to be expected
Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'SEX', training: 2, application: 1
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'MARSTAT', training: 3, application: 1
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EDUC', training: 5, application: 1
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EMPLOY', training: 3, application: 1
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'ACCOM', training: 3, application: 1
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The value types between training and application differ for attribute 'USE', training: integer, application: nominal
P Apr 10, 2009 9:20:13 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'GROUP', training: 6, application: 1
The file test1.xls is as follows. I have placed the original alongside for comparison:
SEX Male
MARSTAT Current Long-Term
EDUC ?
EMPLOY ?
ACCOM ?
USE
ONSET 12.0
USUAL 40.0
MAX 280.0
GROUP ?
SUCCESS
UR 228.0
prediction(SUCCESS) Unsuccessful
confidence(Unsuccessful) .8
confidence(Successful) .2
Original tempCUTwithSUCCESS
SEX Male
MARSTAT Current Long-Term
EDUC Tertiary (Non-Uni)
EMPLOY Unemployed
ACCOM Own Home
USE
ONSET 12
USUAL 40
MAX 280
GROUP Combination
UR 228
SUCCESS
As you can see, there are question marks in the place of educ, employ and accom.
For the second test I will load the ‘incomplete’ test.aml file and run the same test as above but writing out with excelwriter to a file test2.
A set of rather useful warnings is produced:
Apr 10, 2009 9:34:54 PM: [Warning] ExampleSource: At least one of the attributes is defined with a nominal value type but the possible values are not defined! Please specify the possible values by inner tags <value>first</value><value>second</value>.... Otherwise it might happen that the same nominal values of two example sets are handled in different ways which might cause less accurate models.
P Apr 10, 2009 9:34:54 PM: [Error] ExampleSource: The label attribute (class) 'SUCCESS' is defined with a nominal value type but the possible values are not defined! Please specify the possible values by inner tags <value>first</value><value>second</value>.... Otherwise it might happen that the same nominal values of two example sets are handled in different ways which might cause flipped predictions.
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'SEX', training: 2, application: 1
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'MARSTAT', training: 3, application: 1
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EDUC', training: 5, application: 1
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EMPLOY', training: 3, application: 1
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'ACCOM', training: 3, application: 1
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The value types between training and application differ for attribute 'USE', training: integer, application: nominal
SEX Male
MARSTAT Current Long-Term
EDUC ?
EMPLOY ?
ACCOM ?
USE
ONSET 12.0
USUAL 40.0
MAX 280.0
GROUP ?
SUCCESS
UR 228.0
prediction(SUCCESS) Unsuccessful
confidence(Unsuccessful) .8
confidence(Successful) .2
P Apr 10, 2009 9:34:54 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'GROUP', training: 6, application: 1
We get the same prediction as above.
Test 3, The suggested work-around:
For this I have copied the attribute information from the train.aml file and replaced the attribute information in the test.aml file. This should give the same internal mappings for test.aml when used with the model applier.
I renamed the old test.aml file to test –Copy.aml. The new test.aml file is attached.
I won’t include the xml code here as it is the same as for Test 2 except that I write out to Test3.xls.
This time it ran without warnings:
SEX Male
MARSTAT Current Long-Term
EDUC Tertiary (Non-Uni)
EMPLOY Unemployed
ACCOM Own Home
USE
ONSET 12.0
USUAL 40.0
MAX 280.0
GROUP Combination
SUCCESS
UR 228.0
prediction(SUCCESS) Unsuccessful
confidence(Unsuccessful) .8
confidence(Successful) .2
Finally we have success!
Before I could not get it to work, but sitting down and step by step working through it seems to have got me somewhere.
So, finally, to get a single stream that will work with my program, assuming that the model has already been trained and the test.aml file has been altered to have the same attributes as the train.aml file what I am going to do is to load the new excel file using the excelexampleset. Secondly I will write out the data using the examplesetwriter BUT ONLY FILLING IN THE DAT row with test.dat.
Hopefully this will keep the test.aml file and only replace the test.dat file, so the data will have the correct attribute headers.
Then the model loader, the model applier and finally writing out to test4.xls:
As compared to the new data on the right: It worked!
SEX Female
MARSTAT Current Long-Term
EDUC Uni
EMPLOY Unemployed
ACCOM Other
USE 20.0
ONSET 12.0
USUAL 40.0
MAX 280.0
GROUP Combination
SUCCESS
UR 228.0
prediction(SUCCESS) Successful
confidence(Unsuccessful) .0
confidence(Successful) 1.0
SEX Female
MARSTAT Current Long-Term
EDUC Uni
EMPLOY Unemployed
ACCOM Other
USE 20
ONSET 12
USUAL 40
MAX 280
GROUP Combination
UR 228
SUCCESS
Personally I'd rush off and install some SQL freebie, not only do you avoid all this buggeration, but you can do some pretty whacky stuff using SQL and RM macros, like optimising label columns on the fly to find unexpected patterns that validate well, just thinking of doing that with files makes me want to sit in a darkened room with tranquillisers ;D