The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Process step caching
Greetings Programs!
We are currently evaluating a few tools (SAS Enetrprise Miner, IBM SPSS Modeler, RapidMiner, KNIME). This question is NOT about a comparison between those, but rather about a feature I really like in SPSS Modeler, that I haven't found in RapidMiner.
When you are creating a process, SPSS Modeler allows you to set a flag on any process step, which tells it to cache the output when run. This allows for a rapid development cycle of your process, because the tool is smart enough not to restart from the beginning of the process, but rather from a cached intermediate result.
For example: I have a CSV file with 12 million records, where I'm doing a lot of transformation and aggregation. At a certain point in the process, the intermediate result set is only 100 thousand records. I mark this spot as 'to be cached'. Next I continue developing my process, and add a few steps. Checking the result is really fast, since it can simply start with the cached set of 100k records each time I run it, and not from the starting set of 12M.
The thing I like about this feature, is that is totally transparent: I only have to mark the spot, and SPSS Modeler handles the rest.
I haven't found this in RapidMiner, which means that each time I want to check the result of my process, it has to start from scratch, running through each and every step again.
Did I overlook something? Is a similar feature available in RapidMiner?
Thanks for your input.
Tim
We are currently evaluating a few tools (SAS Enetrprise Miner, IBM SPSS Modeler, RapidMiner, KNIME). This question is NOT about a comparison between those, but rather about a feature I really like in SPSS Modeler, that I haven't found in RapidMiner.
When you are creating a process, SPSS Modeler allows you to set a flag on any process step, which tells it to cache the output when run. This allows for a rapid development cycle of your process, because the tool is smart enough not to restart from the beginning of the process, but rather from a cached intermediate result.
For example: I have a CSV file with 12 million records, where I'm doing a lot of transformation and aggregation. At a certain point in the process, the intermediate result set is only 100 thousand records. I mark this spot as 'to be cached'. Next I continue developing my process, and add a few steps. Checking the result is really fast, since it can simply start with the cached set of 100k records each time I run it, and not from the starting set of 12M.
The thing I like about this feature, is that is totally transparent: I only have to mark the spot, and SPSS Modeler handles the rest.
I haven't found this in RapidMiner, which means that each time I want to check the result of my process, it has to start from scratch, running through each and every step again.
Did I overlook something? Is a similar feature available in RapidMiner?
Thanks for your input.
Tim
0
Answers
while you can view intermediate step results via so called "breakpoints", you currently cannot start a process from such a breakpoint. You either have to continue the process after entering a breakpoint or start it from the beginning again. There is a workaround by utilizing the "Store" and "Retrieve "operators, but I'll admit it's clunky and not exactly convenient to use.
Remembering data at a certain point and allowing the process to be started from there is on our list of "cool features we want to have", however it does not yet exist.
Regards,
Marco
Is process caching anywhere further in development? This would be great to see soon.