How to forecast and improve model simultaneously
Hello!
I’m new to Data Science and RM.
I am asking for some help in the following task. I am building a model, that would forecast energy consumption for every day. I have a lot of training data and I have already prepared input parameters of one month of test data. Because test data are from past, I also have the exact energy consumption figures for the whole month. So, I would like to validate my model, based on this test data.
Is there any function in RapidMiner that would predict energy consumption for the first day of the month, then take the exact consumption figure from an additional file and use it as a training data and after that predict energy consumption for the second day of the month? Then again, take the exact consumption for second day, use it as a training data and predict consumption for day three of the month, and again, and again, for the whole month.
What I actually need is an algorithm that would predict, then learn from some extra information (not previously known) and train again, repeat this whole task again.
I would appreciate some good advice, thank you in advance!
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
Hello @gp3354,
Welcome to the RapidMiner Community!
I am willing to help you but the scenario you describe can have a lot of variables. Hence, I sat down and made the experiment for myself. This is what I could come up with.
When I sit down to work with RapidMiner on a forecasting model, I write down the question: "what will be my energy consumption forecast for today?" is a great beginning. Then I look at the data I have: you prepared it already, and that's great too. Now, where is your data stored? There are three (well, there are more, but let's focus in the simple ones) possibilities:
- Spreadsheet files.
- RapidMiner IOObjects.
- An SQL database.
If you have your data in spreadsheet files, it will be more difficult to keep these updated, as there is always the possibility to hit the "Play" button twice. I recommend you to store you data in either a RapidMiner object or an SQL database.
Your flow would be something like:
- Retrieve past training data using the Retrieve operator. (A month)
- Retrieve recently labeled data, also using the Retrieve operator (Yesterday)
- Prepare your labeled data to have the same structure as the past training data. (Select Attributes, Set Role, Generate Attributes, Rename and so on... there are many more operators for data preparation but if you kept your data simple, these are the ones I would take a look at)
- Join both example sets to form the new training data using the Join operator.
- Remove the recently labeled data, so that it doesn't get duplicated (there is a Remove Example Set operator).
- Use your new training data (the result of the join) to train your algorithm (I don't know what algorithms are you using).
- Retrieve the unlabeled data (Today). If you have more data, you might want to filter examples at this point. It doesn't matter if you have this one on files, since you are just reading that data. At this point, I think you know it's either Retrieve, Read Excel, Read CSV or Read Database.
- Apply the model to your algorithm. (Apply Model, that was easy!)
- Store the results as the new labeled data with the Store operator.
- Remove the unlabeled data (or mark it in some way so that you can filter it avoiding RapidMiner to consume it again). I can't help you much with this, as I don't know where you store your data in the first place.
- You are ready for the day. The next day, the process will be the same: retrieve past training data...
I don't know of any function in RapidMiner that would do this recursively for you, except for a creative case of the Split Validation algorithm, maybe. But since you are a learner, I would refrain to go that route until you are confident.
Now, your second question: you want to validate and optimize your data before running it. That's wise from you, congratulations! That can be done with the Cross Validation operator (since you have data from only a month, you want to get the best from it.
Remember the step where I told you to use your new training data to train your algorithm? You can either use the Multiply operator to perform a Cross Validation or train your data inside the Cross Validation. I sense that the first one is better for your goals, but nothing better than experimentation.
Now, I don't have RapidMiner Studio on this computer, so I can't build an example for you but will happily check your XML if you are in doubt.
Hope it helps,
5
Answers
Hello rfuentealba,
thank you for the elaborate answer, you're amazing.
I was hoping there might be a built-in function that would solve that problem recursively, but your answer was helpful anyway.
Hi @gp3354!
Glad it helped.
A little while after I replied, I thought about something else that you should take in consideration. As I don't know what your data looks like, I'll make something up to explain my point.
Let's say this is your data:
Monday, 101kw
Tuesday, 97kw
Wednesday, 98kw
Thursday, 94kw
Friday, 104kw
Saturday, 119kw
Sunday, 93kw.
Let's say you apply a decision tree (I don't care about the algorithm, so I chose this to make it easy), and that since it's Monday, the decision tree is confident that your consumption will be 101kw...
If you put this as your new data, it's ok, but... what if on that Monday, your brother appeared at home with some beers to watch a soccer game, your neighbour asked you if she could use your laundry machine, and you used the coffee machine more than what was expected because you couldn't sleep? That would result in having more than the 101kw you predicted yet you are still reinforcing your algorithm with your prediction data instead of using your new data that may be different. Evaluate if what you want is to use the prediction or the outcome and fix appropriately, if you find it ok.
Never forget this rule (I forget it more often than not): Machine Learning isn't about forecasting the future but about using data to drive your decision making, by creating a mathematical idea of what will happen if the behaviour you are studying continues. I guess you already know how to use the operators I sent you, these are enough to solve this minor inconvenient.
All the best,
Rodrigo.
Thank you so much Mr. rfuentealba .
It would be really helpful to the whole community if you share the xml version of that algorithm. I'll be grateful for your support.
Best Regards.
Have you solved your problem? I haven't been in front of a computer in the past days, if you want I can send it tomorrow.
Best regards,
Hello everyone,
No Mr rfuentealba , I couldn't create it.
I would be grateful if you send it to me as you said .
Thank you so much.
Best regards.
Hello @puserc
Please find attached. There are three important processes:
02 Predict contains just the executable prediction and works as follows:
02-1 Generate Prediction helps updating historical information with recently scored information (a very rudimentary thing).
The 02-2 Generate Unlabeled Data is just filters and negational queries. Everytime you execute your algorithm, your predictions for the future "improve".
This process was way more complex than what I described. I am pretty sure it can be improved an awful lot, but at least you will have something to work with.
All the best,
Rodrigo.
Thank you so much Mr rfuentealba for your tremendous help.
Best Regards.