Algortihms are "cheating" and copying right label from other instances

sebasvog · November 2020

Hi everyone,

I have a problem with my model. It should predict a monthly product volume from some given attributes.
My (training)data consists of data from ~ 60 past month. Each instance in the dataset represents one day. Two given attributes are the "month" and the "year". The label is the product volume at the end of the month. So in my case every instance of a specific month (~ 30 days/month --> ~ 30 instances) has the same label. Now when I train the algorithm (via Cross Validation / Deep Learning) and look at the performance measure (relative_error) it seems like the algorithm looks at the attributes "month" and "year" and adopts the label value from another row with the same month and year as his prediction for this instance.

I hope you can follow my description. If there is something you don't understand feel free to ask.
I would be very thankfull if someone can tell me if my guess on this is right and how I can avoid this mistake.

Now I am trying to avoid this by just having the month as an attribute, not month+year.

Thanks for your replies,
Sebastian

MartinLiebig · November 2020

Hi,

i would recommend to use a Sliding Window Validation, and not a Cross Validation. This gives you a fair estimation of the performance.

Best,

Martin

sebasvog · November 2020

Hi Martin,

thank you very much for your answer. I guess this validation method could help me a lot in estimating the performance in my current model!

However I think I have to create a new process with a modified dataset (without year and month as an attribut --> maybe only month) to have a valid solution for my problem.

Regards,
Sebastian

MartinLiebig · November 2020

Hi,

either that, our change the preprocessing in a way that you get the month or quarter of the year. That may help.

BR,

Martin

sebasvog · November 2020

Hi,

I tried to apply "Sliding Window Validation" on my model but it seems like this type of validation is only applicable for time series data.
I know that my data is "some kind of" time series data, but I am trying to solve the problem by using a Regression with Neural Networks (Deep Learning) .
So I can not use Sliding Window Validation, right?

I tried to apply time series models (ARIMA) on my data (period=day, periode=month) but the result was very bad (quess I have not enogh historic data, just 60 month).

Regards,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Algortihms are "cheating" and copying right label from other instances

Answers