The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Processing high volumes
Hello fellows.
We need to process a considerable volume of data, about 1 million retail ticket lines per day. Altough this is a high value, maybe it does not deserve to be considered actually as a 'big data' scenario.
Can anyone assert or deny this assumption? And if this is should be considered big data, which would be the recommended approach using Rapidminer?
Thanks
We need to process a considerable volume of data, about 1 million retail ticket lines per day. Altough this is a high value, maybe it does not deserve to be considered actually as a 'big data' scenario.
Can anyone assert or deny this assumption? And if this is should be considered big data, which would be the recommended approach using Rapidminer?
Thanks
0
Answers
There are different points to consider
1. What is the actual datasize? Smaller than 32GB?
2. What do you want to do it? Aggregate? Or learn on 1 million examples?
If the data set is smaller than your RAM everything should be fine, as long as the actual #examples is low enough for reasonable runtimes. Otherwise you might simply sample before hand.
Cheers,
Martin
Dortmund, Germany
1. We might take representative samples that could fit into 32 Gb. Full data set size largely exceeds that.
2. Aggregation could be solved right by SQL -this is a relational database. But for mining purposes -association detection like MBA, or other predictive methods like decision trees or lineal/logistic regression-
The big question here is if we would need some big data processing architecture (eg. Hadoop based) standing between the RDBMS and the mining software.
Regards
there are a few ways to handle this. since the total datasize is most likely > your RAM you need a special infrastructure
Way 1: Use a Hadoop cluster, sample your data, learn on the sampled data in-memory and apply in-hadoop
Way 2: Use a Hadoop cluster and learn directly in-hadoop. Radoop currently supports quite some algorithms (Decision Tree, Naive Bayes, Logistic Regression) and some more are to come
Way 3: Use either a Hadoop Cluster or some SQL DWH to just use aggregates / representatives to work on.
I think Way 3 might not be suited for you. Since it is about Radoop i would ask you to contact our sales team ( e.g. here: https://rapidminer.com/contact-sales-request-demo/ ). Then we (or one of my colleagues) might have a Webex or so about it.
Cheers,
Martin
Dortmund, Germany
Do you think solving this via Hadoop/Radoop is a typical situation in the reatil industry? (Eg. one retail store with 20 branches on 2M potential customers)
Regards
since i am consultant in Germany, i can hardly speak about the non-german market. What i experienced is, that more and more companies are shifting towards such an infrastructure. However in germany it is really a "still shifting". It is visible that the usage of data gets more and more a requirement instead of a nice to have.
From what i heard the U.S companies are faster in the process of adapting.
Cheers,
Martin
Dortmund, Germany