Cannot compute the performance of a linear regression model
Hi there,
I'm a first time user of rapidminer and need to carry out a project of a course.
The goal is to create a linear regression model from some data, apply it to a new set of similar data and validate the model. The approach I adopted is the following:
1. Load the data
2. Select the interesting attributes (predictor variables which I believe affect the target)
3. Transform a categorical attribute into dummy variables
4 Apply the linear regression model
5. Load the new data set, apply the model and see the results
However, I get an error at the end when I try to connect the out lab port of the Apply Model block to the lab inp port of the Performance block:
"Input ExampleSet does not have a label attribute performance"
Do you have any insights on this issue that could help me?
Please find attached the .rpm process
Thanks in advance, A.
Answers
Hi,
The fix is relatively easy: just do what RapidMiner asks you to do and keep the label attribute "Avg_Sale_Amount" also in the test data (also change its role to label just like you did for the training data).
Think about it: how is RM supposed to calculate a performance if it does not know what the true values are? That is why the performance operators need both attributes, the label and the predictions so it can do the comparisons.
Hope this helps,
Ingo
Hi IngoRM,
thanks for your reply.
Actually, it's not that clear to me, I'm sorry. The attribute "Avg_Sale_Amount" is not present in the test data as it is the target variable that is to be predicted. Indeed, after feeding the "Apply model" block with the output of the linear regression and with the test data, in the results I see the attribute named "prediction(Avg_Sale_Amount)".
How should I keep the label attribute "Avg_Sale_Amount" also in the test data?
Thanks for your help, A.
Hi,
I don't have your original data so I do not know if the column "Avg_Sale_Amount" is in the original data or not. If it is, just include it in the list of attributes you are chosing with the operator Select Attributes. And also set the role to label (just as you did for the training part of the process).
If it is NOT part of the test data then... it is actually not test data :-) The idea of a test data set is that you have the true labels so that you can actually make the comparison with the predictions. If you do not know the truth, there is nothing to compare to.
In this case forget about your "test" data for now and just do a split on the training data with the operator Split Data to actually create your test data set (including the label column!). Alternatively you can use one of the validation operators like cross-validation etc.
In case this is all not clear at this point, I really recommend to do the tutorials in RapidMiner which you find in the "Need Help?" menu in the top right corner of the screen under "Tutorials". I especially recommend the tutorials in the section "Modeling, Scoring, and Validation".
Hope this helps,
Ingo
For building the model you need a dataset with a label (= attributed marked with the role label) and additional regular attributes.
Apply Model takes a dataset with or without a label (the label is ignored) but all the necessary regular attributes.
It then adds a prediction column, with the role prediction.
For the Performance operators you obviously need both the label and the prediction, these are compared to determine the machine learning performance. For just making predictions you don't need the label in the new dataset and you do get the prediction from Apply Model.
Regards,
Balázs