The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Increase Radoop Performance

kevin_mkevin_m Member Posts: 5 Contributor I
edited November 2018 in Help

Hello, is it possible to increase the performance or speed of the spark-query? If so, how? Thanks in advance!

Tagged:

Best Answers

  • phellingerphellinger Employee-RapidMiner, Member Posts: 103 RM Engineering
    Solution Accepted

    Hi,

     

    That depends on which Spark queries are examined here.

     

    Before any specifics, let me make the comment that Hadoop (YARN) jobs have an annoyingly large overhead, which is especially obvious when running simple things on small data sets. That overhead is only relatively small when you run the "real" thing: distributed and/or complex jobs on huge data sets. Then the overhead is not that large compared to the job runtime.

     

    In case of larger jobs, the overall performance may depend on how well the cluster resources are allocated. Spark resource allocation related settings can have an effect on that.

     

    In case of smaller jobs, the overhead should be decreased. However, in case of pure Spark operators - you can recognize them from the Spark (star) icon - there is no general way to achieve that. In case of Hive-based operators - look for the Hive (bee) icon -, when Hive-on-Spark is enabled on the cluster, the overhead can be greatly decreased. In the following screenshot from the Resource Manager interface of the cluster (accessible via a web browser at <resource_manager_host>:8088 by default), you can distinguish between the two types of job by looking at the User column: the first is a Hive-on-Spark job, the second is a pure Spark job.Screen Shot 2017-07-12 at 14.42.55.png

    The overhead of the Hive-on-Spark jobs can be decreased via the "Connection pool" settings in the Preferences, although the default heuristics should already provide good results, when operations are executed frequently.

     

    Let me know if you can share your challenges more specifically.

     

    Best,

    Peter

     

    Edit: formatting

  • phellingerphellinger Employee-RapidMiner, Member Posts: 103 RM Engineering
    Solution Accepted

    Also, please note that you can expect performance improvement from upgrading to Spark 2.x.

    Switching to Spark 2.x for Radoop is very simple, because the required Spark archive can be uploaded to HDFS and Radoop can already use it. No need to install or upgrade any services on the cluster side.

     

    Peter

Sign In or Register to comment.