The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
About Data Pipeline Structure
I am using Rapidminer for a big data analysis project and I only used execute Python operators to construct my workflow. What the workflow does is basically read a big pandas data frame at the very beginning and process it row by row in the following operators. I realize that each operator only starts when the previous operator finishes all of the rows. Is it possible that when a row is finished processing it can be immediately passed to the next operator? i.e. Does Rapidminer supports data pipeline?
0
Best Answer
-
CKönig Employee-RapidMiner, Member Posts: 70 RM Team MemberRapidMiner does support building Data Pipelines for streaming data. For enterprise projects this can involve writing to a Kafka queue where multiple worker nodes are listening to continue calculations. For the basic operators, you are right that these are usually "atomic" in nature: calculations will be performed "in order". In your specific case, since you are mainly using execute Python operators, it would make sense to build the data pipeline there aswell.
Another comment: if you are only using the visual workflows of RapidMiner to orchestrate steps done purely in pythin, be aware that any python-based operator (Execute Python, Python Learner, Python Transformer,...) introduces a small overhead into your overall runtime. For now, every time a python environment will be started and all of the data will be serialized and read into a Pandas Dataframe. When chaining multiple Python operators this overhead can be significant compared to doing it all in one Python operator.
Further information can be found here: RapidMiner and Python - RapidMiner Documentation1