Predicting Project length of time
I've only dabbled in Rapidminer a bit and don't have a data mining background so I'm looking to get some direction/help on this. I'd like to predict how long a project will take to complete once it's handed off from sales to project management. Historically, this time can range from 14 days to well over 300 days based on a number of factors.
In my current model, my label is Days from Contract Signing to Project Complete and is numeric. I have about 20 other attributes in my dataset that I've included for training, though I'm not sure I'm using them correctly. I'm using a Deep Learning operator, but my RMSE is too high.
My goal is to predict a five-day window of when the project should be completed based on the 20 attributes mentioned above. Is there a better way to accomplish this?
Answers
It sounds to me like you are approaching it basically correctly, although it would help if you could post your process so others could review it.
The strength of the model is a function of the data that you have available at the time of the prediction. It could be that you simply don't have data that has a strong enough relationship with your outcome to provide a very precise prediction when you are trying to solve for a specific number of days.
In such cases, you may consider reformulating your label as a nominal class, and pick a threshold for completion, and then have a yes/no indicator for whether it would be completed in that timeframe, say 30 days since assignment. These are generally easier to predict than a continuous outcome and may still provide a usable model for you.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi Thomas, no because I'm trying to predict how long the project process will take at time of sale, which means I won't have any specific data pertaining to the project milestones (engineering, manager approval, permitting, etc). I'll only have the data specific (poly and bi nominals, integers and a date) to the what is known at time of sale (location of project, project management team, size of project, date when project was won, etc).
I apologize if I'm not explaining this clearly.
Thanks Brian,
Yes I did that first. I segmented the projects into two groups based on "did the project get completed in X days?" and had great results. But this only partly satisfies our needs.
Here's the context: We currently treat all project timelines the same, meaning they all have the same milestones and priority. I'd like to change that. While it's valuable to know if a project is going to be completed in a certain timeframe or not, if I could accurately predict a small window of time the project's day of completion will fall in the moment it's sold (based on all-known criteria at time of sale which is the city, price, pm team, how the project is being paid for and a handful of project specific attributes), we could begin prioritizing jobs into A,B,C categories and having unique milestones for each job.
You may be right though, my model probably works, I just might not have enough data.
Thanks, that additional information is helpful. It sounds like it is probably an underlaying data relationship issue.
You can try some additional tricks, like building multiple models based on different performance timeframes (complete within 30 days, 40 days, 50 days, 60 days, etc.) and then getting multiple scores for each project, and coming up with a way to use those scores to estimate your most likely completion window. That may be better than a single threshold, but not quite as good as a continuous prediction.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts