Can more data be harmful for a prediction?

Tyche · June 2020

Hello everyone, as part of a university project I decided to experiment a bit with the data set I got and tried to input different aggregation levels of the data into auto model to compare the solutions.
At that point I was already a bit confused that my aggregated data often delivered better outputs than the divided one.
Since the data is advancing through 27 weeks and every week more regular attributes are added, I also tried to develop models for every week to see when a model would be theoretical operational for a first deployment.
I expected a slow increase in accuracy and gain throughout the weeks but instead I got an extreme peak in week 7 with a very high accuracy and a very good gain which then drastically declines and is only surpassed by the best model in week 19. From week 19 on the model decreases again but stays good until the predictions stops changing from week 23-27.

My questions now are if such a behavior is normal and why does it happen? If I look at the problem I can not really think about a reason why more information would be harmful to a prediction but it clearly seems to be the case. Furthermore, if the prediction would theoretically be used, should I stop at the prediction form week 19 or still use the model form week 27?

Sadly I am not allowed share the data.

Thanks for help in advance

Telcontar120 · June 2020

Very interesting question! This is actually quite a complex topic and there is actually a lot going on here.
Many people start with the assumption that the more information available for building a predictive model, the better. And while this is probably true as a first order effect, it is definitely not a universal rule that is always true. There are some important corollary principles that you need to keep in mind when trying to answer this question.

One thing that is important is to define the idea of "more information" and its impact on the model, as "more information" can mean different things. For example, having more information in the form of additional attributes to examine for a given set of cases, versus more information in the form of additional examples to consider with the same attributes as your initial set. It also matters whether the additional information being considered for the model is contemporaneous vs time-differentiated (such as in a time series).

Thus, it is not always the case that more information leads to better outcomes. For example, considering too much information in the form of too many attributes can definitely lead to less robust models because it encourages overfitting, and in some cases can even make it difficult for the algorithm to identify the true signal amidst all the extra noise. This is why "feature selection" is an approach in data science projects, to try to reduce the factors considered to those which have more consistent, stronger relationships with the target.

It is also not always the case that more information in the form of more examples is better either. In this case, the impact of additional cases with little additional information value in terms of the underlying patterns that the model is trying to detect tends to bog down the algorithm with longer computations, and can also contribute to overfitting as well (the more cases there are the more likely that some random patterns will appear to be real). This is why sampling for model development is also considered a standard practice for data science projects.

If the additional information you are considering is coming from time periods that are not the same as the smaller (original) set, then the additional complicating factor is that the relationships between the underlying data and the predictive attributes may not be stable over time. There is quite a lot of discussion in data science about the concepts of drift or related ideas, and they all are dealing with the problems raised by this topic. In the case of a time series, it is always inherently part of the question as to whether the relationships are stable and repeatable, or whether they are shifting over time.

So these are just some of the hypothetical issues that you may be facing as you consider your own case. Without looking at the data, it is hard to say which of these may be at play in your particular dataset, but it is possible that you are seeing the impact of one or more of these effects.

It is also possible that you have other issues going on specific to time series data, which typically requires you to have a good sense of the underlying periodicity of relationships in your data. It may be that your model is misspecified and lags and other relationships work well in the timeframe of the shorter window but not in the longer window. Often a lot of data exploration is needed in time series to tease out the different frequencies of underlying patterns and how the data should be transformed to capture those for the purposes of machine learning.

MartinLiebig · June 2020

Hey @Telcontar120 ,

Great post!

Thus, it is not always the case that more information leads to better outcomes. For example, considering too much information in the form of too many attributes can definitely lead to less robust models because it encourages overfitting, and in some cases can even make it difficult for the algorithm to identify the true signal amidst all the extra noise. This is why "feature selection" is an approach in data science projects, to try to reduce the factors considered to those which have more consistent, stronger relationships with the target.

I would like to counter with regularization here? If I properly regulize, than my models should not fit into this "noise". Thats for me the whole argument, why the learning curve (sample size vs performance) should saturate?

Best,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Can more data be harmful for a prediction?

Best Answer

Answers