SparkRM, Hive, TEZ, Python, R, PySpark, SparkR - What is the Sequence? Or, The Radoop Matryoshka
Question: If I put a Hive operator inside a SparkRM, does it become a Spark job?
No, you can only use standard RapidMiner operators inside SparkRM, you cannot use a Hive operator. However, you can configure Hive to have Spark as its execution engine. Then all the hive operators in Radoop work on Spark. There is a Hive option for that (hive.execution.engine) that you can set in the connection.
Question: If using Hortonworks and Hive with embedded TEZ, do my Hive operators automatically leverage TEZ?
As in the previous question, you just need to set the hive.execution.engine variable in the connection as “tez”.
Question: Can I execute python or R inside a Radoop nest and will it execute on the cluster?
You can use SparkR or PySpark with the “Spark Script” operator. That would be the easiest way.
If, for example, you need a package that is not in SparkR, then you can do it with SparkRM as above, but again, you need to have R installed and all in the same path.
Question: Can I run Hive operators on Spark without Hadoop?
No, we don’t integrate with Spark without Hadoop. You need a Hive server and Yarn installed. You can have Spark as Hive’s execution engine however.
Question: When writing PySpark, where should I execute the code? Radoop nest, SparkRM or Studio?
With the “Spark Script” operator, and that should be inside the Radoop nest.