The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Error Apply Model [SOLVED]
Hi,
I have a classification problem wiht 2 classes. Unfortunately one cannot access the prediction label when using a Feature Selection process. So I saved the attributes weights and started a new process with the loaded weights. I applied the Model to the "unseen" testset and compared its performance with the performance of the FS Process which used the same weights. The performance of the applied Model is differs a lot to the testset performance of the FS process. Can you fix this bug and mabye offer a possibility to access the prediction label via the simple FS process.
Furthermore I have to report that when saving the Model in an XMl file or similar, and recalling it, the performance also differs a lot to the FS process Performance. Can you fix it?
Thanks in advance Daniel
I have a classification problem wiht 2 classes. Unfortunately one cannot access the prediction label when using a Feature Selection process. So I saved the attributes weights and started a new process with the loaded weights. I applied the Model to the "unseen" testset and compared its performance with the performance of the FS Process which used the same weights. The performance of the applied Model is differs a lot to the testset performance of the FS process. Can you fix this bug and mabye offer a possibility to access the prediction label via the simple FS process.
Furthermore I have to report that when saving the Model in an XMl file or similar, and recalling it, the performance also differs a lot to the FS process Performance. Can you fix it?
Thanks in advance Daniel
Tagged:
0
Answers
can you please describe in detail what you are doing with the loaded weights, and how you are performing the Feature Selection? Most useful would be example processes.
Best,
Marius
thank you very much for your reply. Unfortunately I just found out that you wrote me, maybe you can offer an automatic info mail.
Anyway I have a Feature Selection process (linear Split validation). I need the outcome of the prediction labels of the Testset, but unfortunately, this is currently not possible with RM. So I use the "save model" and "load modell" and "apply model" operators and perform the process again only on the testet in order to get the predicions which I need for further processes. The problem is, that the Model is not at all the same as I saved before.The classifiaction accuracy differ a lot, although the testset in the FS process and the applied testset in the loaded model are identical. Its the same problem as here:
http://rapid-i.com/rapidforum/index.php/topic,3438.msg16533.html#msg16533
Can I send you my process??
Thanks in advance and again sorry for the late reply.
Daniel
Tanks in advance
Daniel
you can post your processes here in the forum. Just open the process in RapidMiner, go to the XML tab on top of the process view and copy the xml code into your post, surrounding it with code tags via the "#" button above the input field here in the forum.
Best,
Marius
this is the code. Instead of using the "store (model)" and "recall (model)" operators, one can also use the "write model" and "load model". Since I cannot directly access the prediction label (for the testset) of the Feature selection process, I need to save the built model after the FS process in order to load it and apply the model to the identical testdata. Since I cannot see the predicted label and see the performance evalutation at the same time, I need to do this process again, but this time with a perfomance evaluation operator at the end, to be able to compare the performance results of the FS process and the built and applied model. Acutally the performance should be the same since the testset data ius identical. But the results differ without any reason. I have checked it a hundred times. Do you have an explanation ???
P.S. Since the the maximum number of characters reached, I deleted some features in the code, but this shouldnt be a problem...
Btw, you can simplify your process a bit (see below).
Best, Marius
thanks a lot. this really helps to make it easier. I only needed to add another CSV Reader (and unable the multiply operator), since the model applier should only be applied to the testset of the data not to the whole dataset which also includes the training data...
I thought that, since the testset is identical to set of the applied model, the performance should not differ right? The model is built after the validation process. how can it be, that the testset is not classified indentically? The performance accuracy sometimes differ only 3 % (67 vs. 64%) but somtimes it differs 22 % (68 vs 46%) The last is the case when the validation process proceeds a long time even if the performance does not improve for a long time. The strange thing is that the applied model predicted every datapoint into the same class, never into the other one (it is a 2 class case). That is why the accuracy is only 46% while the accuracy of the testset of the forward selection process has 68%.
It is really annoying that I cannot fix it.
Can you help me?
Thanks in advance
Daniel
I had another look at the process, and the way it was setup before it does not make sense. The Forward Selection executes its subprocess for many combinations of parameters, and there is no guarantee that the last execution takes place on the best feature set, and thus the last stored model is not necessarily the best. You have to output the weights, apply them on the training data and then create the final model. Then you can apply it on the weighted test data.
By the way, you should exchange you Split Validation with a X-Validation for more reliable results, even though it will take more time to run.
Best,
Marius
thanks, that really makes sense. It unfortunately works for the Naive Bayes Classifier, but when I changed it to the Linear Discriminant Analysis the error still occurs. Unfortunately the accuracy of the testset of the Forward Selection process is 71,15% while the accuracy of the applied model onto the identical dataset is 46,15% (all the labeled data is classified into the same class). the selected attributes are the same, so this is not the underlying error...
I really have no idea how this can occur
Here is the process
The outer Model training does not suffer from that problem, since it uses unsampled data. But even after that fix the performances won't be exactly the same, because the outer Train/Apply combination uses the whole dataset both for training and for testing, whereas the FS uses only a part of the data for training and the other part for testing.
Btw, in the log operator you should log the "peformance" of the Validation, not of the FS.
And I still suggest urgently to exchange the Split Validation with a X-Validation
Best, Marius
The 2. CSV file therefore only comprises the testset, hence the 260 datapoints. That is why I think that the classification performance of the testset of the FS process should be at least nearly the same as the performance of the 2. process.
thanks a lot. Now it works without any difference in both classification accuracies. I have implemented the mentioned Filter and I have put the process log operator into the Validation process.
Is this code correct?
Best, Marius
Is this code correct for this proceeding?
Best, Marius
since I really need the testset to be unseen and the split validation optimizes with regard to the testset, is there a possibility to only train and optimize the training set and AFTER haveing optimized it, apply it to the unseen testset. Because if not, I cheat on myself since the testset is not really unseen...
All the best,
Marius
I mean, via Split validation the accuracy onto the testset is enormous but it is optimized with regard to the testset and since I really need the testset to be unseen, I would cheat on myself...
btw what is the algorithm behind the feature selection? If I use forward selection it starts with an empty feature space but how does the algorithm choose the next feature(s) for the following generation? Greedy, hill climbing, random? How does the "keep best" work? Does it add a certain number of random features and keep the best x of it or how does it work?
1. for all remaining features:
add the attribute to the set of features
evaluate the performance
Remove the feature, continue with the next one
2. add the feature with the best performance to the feature set
3. continue with 1. until no features are left, or the maximum number of features is reached.
Best,
Marius
great job!