Caching in RapidMiner using Old World Computing's Jackhammer Extension
Hello everyone,
we would like to present a feature of the versatile Jackhammer Extension and describe in more detail how to use the caching operators provided by the extension in order to speed up
your processes and significantly reduce overhead. We think this is a really useful feature and hope it'll help you with your work!
The caching function is useful for all processes suffering from a long run time due to repeated data retrieval. This is not only a nuisance when designing and testing a process, constant requests also put more stress on a database than is necessary and affect other users and applications on the same database. But most importantly, caching can greatly improve response time for web services, lower resource utilization for high request volumes and reduce reaction times when deploying a web application with RapidMiner Server.
The Jackhammer Extension comes with more than sixty operators, three of which regard caching: Cache, Clear Cache and Retrieve Cache. For this first tutorial, we want to show how the Cache operator works and how to integrate it into your process. In later tutorials, we will talk about more advanced functions like data validity settings, user rights and how to use the Clear Cache operator and cover taking into account dependencies when caching.
Basically, the Cache operator offers a subprocess where you place the operators to load the relevant data. This subprocess is executed once and then keeps the output of the subprocess cached from there on. When you run the process again, it will return the cached data rather than reloading the same data over and over again. With the Jackhammer Extension, losing time over waiting for the process to load the data becomes a thing of the past asyou can put entire preprocessing chains into the subprocess.
It is also possible to put objects, even the training of prediction models or any other complex processes generating static results into the subprocess so as to cache their results and use them without having to run them each time.
Step 1
For this first step, we will take a look at the Cache operator itself and its parameters. Search for the Cache operator and add it to your process, then click on it to be able to see the parameter settings.
As you can see, you can set a Cache name (1) of your choosing (which will be relevant in our following tutorial covering the Clear Cache operator) or clear the cache manually (2). You can also make the tick to restrict validity (3) and enter cache dependencies (4) – again, functions which we will discuss in later tutorials. If you choose not enter a name now, RapidMiner will simply use the operator’s name, i.e. Cache. Names will become important for more advanced functions like the Clear Cache or the Retrieve Cache operators. In this example, we will use “wind turbine”, but again, for these steps it is not yet necessary to enter a name. Then double click on the operator to open the subprocess.
Step 2
On the subprocess level, add your database connection.
Step 3
On this level, you can also add your data preprocessing steps. Their final result will be cached as well, meaning you will only have to run the preprocessing once and can save even more time. This is what the subprocess looks like with an added preprocessing step – of course you can use any and as many as you need! Do not forget to make all necessary connections between the operators and to the output ports in order to be able to receive your results.
When you run the process now, it will load the data and output an example set. Run it again and the results will be there virtually instantly! The extension’s process execution performance monitoring feature supports the caching’s time saving effect with numbers:
This screenshot shows the run time in milliseconds for the first execution of the process. The following one illustrates the improvement due to caching the Read Database and Select Attributes
operators:
As you can see, the tasks inside the Cache operator are not executed anymore, thus shaving off 344 ms off your run time. While this is of course only a small example process, you can surely imagine the time saving effect caching has on increasingly complex processes and larger databases!
Now you know how to use the Cache operator and can integrate it in your own processes. In the following tutorial we will pick up where we left off here to show how to set data validity periods
and how to manually clear and reload the cache.