The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
scoring and storing very large datasets in RM - any hint?
The stream database operator is one of the essential operators in handling very large datasets in RM (if not the most important one).
I did the following experiment: using this operator, I have sampled a large dataset stored in a database, such that the sample fits and can be handled in the main memory in order to learn a model (data preprocessing included). So far so good: I got and evaluated the model and was happy with its performance parameters, so I saved it. Then I applied the saved model to the whole dataset that, logically, was accessed via the same stream database operator, with the intention to save the result in a new table of the database.
The process failed - with the suggestion of materialising the dataset in the memory first (!!), which is not the solution given the size of the dataset.
Although I find it obvious how to implement this in a consecrated Data Mining suite as SPSS Clementine/ Modeler or SAS Enterprise Miner, I cannot see another approach of scoring and storing the whole (large) dataset with RM. I assume it should be possible. Many thanks to those that would like to share from their experience or provide a useful hint.
Best
Dan
0
Answers
As with my response to your last post, I work on databases and do not experience the issues you describe, so it would be helpful to see your process XML and to know your configuration.
that it does not fit into the main memory (obviously you want the scored dataset saved back in the database).
1) How would you score the dataset using your approach?
2) How would you correct the following simplified process in order it to work? Assume you have the appropriate model, the appropriate connection details, and the appropriate dataset in the database. You got the message: "Process failed..." and you are suggested "to transform the dataset into a memory based data table first or materialize the data table in memory" !...
In conclusion since you say you use to mine databases, how do you score large datasets?
Thanks for your input,
Dan
Here is the simplified code (saving the scored dataset in the database was omitted here)
the general contract is, that you cannot write into the datatable you are just reading. Hence you have to materialize your data first as the exception suggest. Since you cannot materialize the complete data, you have to do this in chunks. These chunks can be appended to a new table after beeing classified.
A example process would look like this: Please tell me if you experience any problem with that.
Greetings,
Sebastian
Most non native English speakers on this forum may not have expressed themselves perfectly when posting.
If language mistakes are tolerated by everybody here, rudeness is not. Other people complained of your behaviour on this forum. It seems you are consistently rude (perhaps because frustrated?) when interacting with other users. Please ignore my postings, I am surely ignoring yours.
PS It also occurs to me that your training set must be smaller than your test set, unusual.
so for sure the model was not tested on this, thus it's a mistake to take it, for sure, for the test set. But, as mentioned for this experiment, the model had been evaluated before any attempt to score the whole dataset. How? Both the training and the test datasets were parts of the data sample, chosen with a usual split of 2/3 and 1/3. The matter is closed, and I will not respond to any of your postings any more.
that's enough guys, please calm down. This is a forum for helping each other not for battling!
Here's no competition for the one with the most exquisite language skills or best data miner on earth.
Might be Haddock should have not made fun of you mistake, but in fact it had some aspect of humor. Especially if you think of asking SPSS for help with a problem, if you don't have paid for their software: It's like praying to god...Simply won't help you with your software problem.
And, let me add this as another non-native speaker, Haddock has to live with the fact, that we mess up his mother language. At least with my german mother tongue, most of my English sentencens will either sound rude or simply confusing. Probably this is a reason to get sarcastic sometimes.
Last but not least, Haddock is the most active community member and has helped many of our users with valuable tips. It's definitively a good idea to listen to what he has to say.
Please continue this discussion as professionals, I don't want to clean the mess up if this escalates.
Greetings,
Sebastian
Anyway let's just deal with RM and data mining, in a professional manner, that's why we are on this forum.
Cheers,
Dan
"Read the fucking manual" may sound rude and I try to avoid this four words where ever possible, but sometimes when I have bad times and someone asks an idiotic question that could have been answered with just taking a single look at the documentation or just switching on the own brain, then I would like to be allowed to write them.
You have to see it in this way: Whenever someone asks a question here and nobody answers it, we feel obliged to help him. For this we have to sacrifice some minutes of our working time. On the one hand this is good, because user need to get over the steep learning curve, on the other it consumes much time we could use for improve the program (and the documentation). And if someone asks questions even before having reached the start of learning curve, this just costs unnecessary time. Time that we could use to answer more important question like the one you originally asked in this thread. Please keep in mind that we are not being paid for maintaining this forum, so stupid questions do not increase our human resources, just consuming it...
So I personally find some of these questions just impolite, because they are asked before people start thinking on their own. Of course I won't answer in the same impolite way...
But I cannot remember, any participant of this discussion every asked such a question. Nor do I believe that haddocks joke was made to insult either you or your language skills. I think we could settle the hole matter NOW.
Keep calm and carry on.
Sebastian
The method Sebastian describes looks indeed very promising, at first i didn't realize that the 'streaming database' should be combined with loop batches. So that has been cleared up, thanks!
I'm wondering, does this method also work for large datasets when training the model? For example, I want to use SVM on a large text database. Can I just use loop batches to train, or will this interfere with the iterative inner workings of the SVM algorithm?
I don't think this will work since the SVM needs all training examples at once as far as I know. And be sure: Since the runtime grows to the power of three with the number of examples, you simply don't want to train SVM on such many examples...
Nevertheless what you could do is to group examples with some sort of consistent label distribution and learn several SVMs on that. Later you can combine them to one voting model. MIght be this will improve performance further...
Greetings,
Sebastian