Not normally distributed data

jeroenheijlen · May 2020

Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen

lionelderkrikor · May 2020

Hi @jeroenheijlen,

Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?

Regards,

Lionel

jeroenheijlen · May 2020

Hi @lionelderkrikor , thanks for your reply.
Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:

Image: https://us.v-cdn.net/6030995/uploads/editor/n3/cx6t4k887wx5.png

Image: https://us.v-cdn.net/6030995/uploads/editor/60/zmxdgwaig5km.png

lionelderkrikor · May 2020

Hi @jeroenheijlen,

Maybe there are not relationships between your independent features and your label (your target).
In this case, it is impossible to find a good model and machine learning is of no use...
In the meantime, you can try to :
- enable feature selection / feature generation in the options of AutoModel
- for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.

Regards,

Lionel

jeroenheijlen · May 2020

Hi @lionelderkrikor,
I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
Thanks for your advise.
I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).

If I ever will succeed, I will post the outcome ;-).
Best regards
Jeroen

lionelderkrikor · May 2020

You're welcome, @jeroenheijlen.

Good luck !

regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Not normally distributed data

Answers