The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Getting started help: predict sales based on several attributes for several products
Hi Community
Disclaimer:
first-timer here, data science newbie, unfamiliar with the correct technical terminology. I'm somewhat good with concepts but neither strong in statistics, nor higher math nor programming but I try doing a bachelors degree of course I have basic statistics and programming knowledge, but am very untrained since years.
Background:
As part of my business information technology studies I'm working on my bachelor thesis "improved future sales forecasting by applying machine learning" (as opposed to simple compare-to-last-year-figures based prediction) together with a company operating convenience stores.
I have access to their BI system to pull historical sales data with several attributes, for example: date, shop, article, number sold.
Data preparation:
To develop a model, I have selected two customer contexts which may trigger a visit to the store to buy very specific goods: "grill party at lake" and "students breakfast".
I then looked at a handful shops close to lakes ("grill party") and/or universities ("students breakfast") and pulled the BI data of affected articles (Chips, Beers, Sausages, Bagels, Coffee, etc).
I then looked at a handful shops close to lakes ("grill party") and/or universities ("students breakfast") and pulled the BI data of affected articles (Chips, Beers, Sausages, Bagels, Coffee, etc).
I then added several hopefully relevant attributes such as HasLake (is shop close to a lake), HasUniversity (is shop close to university), HasSemester (is transaction during or inbetween university semesters), HasHoliday (is it a public holiday) and weather figures (temp., amount sunshine, amount rain).
My current (anonymized simplified) example dataset is attached as Excel.
Trying my luck:
I am asking for help now, how to proceed best.
I remodelled my exampleset several times (articles as rows, articles as columns; more attributes, less attributes; ...) and tried to put together a process but failed horribly every time.
I then went for Auto Model. Deep learning and Gradient Boosted Trees yielded quite good results but a) produces a "black box model" difficult to get away with in a bachelor thesis and b) the automated feature selection seems to primarily target attributes which are not "generic" but highly specific to the exampleset, e.g. a single shop. This makes sense, as in the data, one specific shop has very high numbers for beer. But this makes the model not applicable to other customer contexts in other shops (which are not included in the exampleset; there's ~200 shops in total with 3000 articles each and at least a dozen contexts for some but not other shops, e.g. high volume highway petrol station has nothing to do with neither university nor grill party at lake).
I tried to get inspired by the Auto Models created and reproduce the results to a degree, but they are way too complex for me to properly understand what's happening and why certain parameters are tuned the way they are.
I figured setting "Shop" to cluster and setting "quarter" or "week" to either batch (I also tried vice versa, shop as batch and timeperiod as cluster) should improve feature selection. Apparently not, as set roles and special attributes are being purged when automodelling. Is deep learning or GBT the wrong approach? Should I do something with "forecast" given the exampleset? I'm at a loss.
Could I ask you guys and gals to support me to get off the starting line? Many many thanks in advance!
I figured setting "Shop" to cluster and setting "quarter" or "week" to either batch (I also tried vice versa, shop as batch and timeperiod as cluster) should improve feature selection. Apparently not, as set roles and special attributes are being purged when automodelling. Is deep learning or GBT the wrong approach? Should I do something with "forecast" given the exampleset? I'm at a loss.
Could I ask you guys and gals to support me to get off the starting line? Many many thanks in advance!
Tagged:
0
Answers
could I loop through the exampleset shop by shop (all transactions, all articles, all dates, one shop only) and create a separate model for each shop? Then adding/removing shops and/or articles to the exampleset wouldn't play a role.
Or as an analogy, loop through an article at a time (all transactions, all dates, all shops, one article only)?
Or more generally speaking, loop through one specific attribute at a time and generate a model fitting to each specific loop.
That sounds a lot like clustering, but for clustering wouldn't I need to know in advance how many distinct articles there are? Something I can't know, in regards to articles available, each shop is (slightly or massively) different to the next one...
If you used AM then you had the option to try many different ML algorithms. GBT and DL are more powerful but as you say they produce black box models that are hard to interpret, although the "Explain Predictions" operator is helpful in identifying patterns. Only you can decide whether the tradeoff in performance vs simpler methods like Naive Bayes or Decision Trees is worthwhile.
If the shops really are very different, then looping through those and building a separate model would be sensible. But you should compare that to a global model to see whether the performance difference is worth it.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
So I took a step back and looked at my problem again. I now found it's not so much a regression problem as it is a time series forecast (with some regressionisque cherries as topping). I have since studied several blogs and some papers on the topic, and came up with the idea to build a LSTM (long short-term memory neural network) to tackle the task. But I struggle with the setup, namely the layers and their parameters. My main trigger for the idea has been Jason Brownlee:
So, I have installed the deep learning extension and went through the tutorial.
I have looked at the Airline Passengers LSTM sample process.
I have setup a process but this is where I'm at at the moment. Could anyone of you guys provide some rough guidance to get me on the right track?
Below my XML, the example set is very close to the one in the opening post. As it should not go into the wild, I can provide the original I am actually working with by PM.