Is my model good enough?
Hello.
I'm trying to make a predictive regression model, but having a hard time telling if my model is good enough?
I'm using X-validation and I've read somewhere that you can tell if it's a good fit based on the difference between the training error and the validation error? But how do I get the X-validation to tell me the training error?
Currently my model has a RMSE of about 1,000 and my label has a range from 0 to about 32,000. Out from this I can't really tell if it is any good? Is there another way I can measure if it's a good model?
Oh, and one more thing - I can manage to make my model better if I use a k-NN global anomaly score and remove some of the outliers comming from noise - but i'm afraid that I remove too much information. How can I decide how many outliers I can remove?
Thanks in advance!
Answers
Hi mathias,
it really depends on the use case wether this is good or not. Hard to judge. But i would of course look at the testing error, not the training.
What might help you to get a better feeling is to get a plot of the scored set returned by the new Cross Validation in 7.3. Just plot label against prediction(label) in a scatter plot. You can extract information like "if the truth is 5, my prediction is between 3 and 7".
Best,
Martin
Dortmund, Germany
Okay, I've tried to do that.
I figured out if I use the k-NN Global Anomaly and filter out some outliers, I can get the RMSE lower - is that ok to do?
Here is my process if you like to look it over;
Mathias,
i do think it makes some sense to filter out the outlier - it often makes models better. The downside of this, is that your model does not cover examples with a high outlier score. I would argue that you want to do it anyway, because you cannot find good statistical reasoning for these outliers.
~Martin
Dortmund, Germany
Hello again.
I have tried to remove the outliers, but it turns out that it doesn't really have an effect on my RMSE.
I'm having a hard time telling how I could improve my model - as of right now I get a RMSE around 850, and I would like it to be atleast half of that. Could someone tell me what i'm doing wrong?
Here is my process;
And here is my data;
https://www.dropbox.com/s/w9a5545nn1vs0b8/traindata.csv?dl=0
I'm trying to predict the shaftpower for a ship.
Thank you in advance!