RapidMiner Studio Spark Cluster Connection issue

skaloger · September 2020

Hi I am currently using RapidMiner Studio to help me design my streaming workflows. I have an issue regarding spark connector. What I am trying to achieve is to design a simple workflow and upload it to my local spark cluster. My workflow consists of a spark connector(Retrieve) and a Streaming Nest(Contains a Kafka Source, an aggregate, and a sink). I have managed to execute the same workflow using flink connector, upload it to flink cluster and perform some testing with my kafka source and sink. So the rest of the workflow is definitely working properly. I assume that my problem is with the configuration settings of my connection, so I would like if possible to guide me through the steps and all configurations needed to perform a successful connection with my local spark cluster. Note that after all the configuration tests I performed, "Test connection" field showcased that the connection was successful, which does not seem to be right in my case.

<?xml version="1.0" encoding="UTF-8"?><process version="9.7.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.7.002" expanded="true" height="68" name="Retrieve SparkConnection" width="90" x="112" y="136">
        <parameter key="repository_entry" value="SparkConnection"/>
      </operator>
      <operator activated="true" class="streaming:streaming_nest" compatibility="0.1.000-SNAPSHOT" expanded="true" height="82" name="Streaming Nest (2)" width="90" x="380" y="136">
        <parameter key="job_name" value="JobShow"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.7.002" expanded="true" height="68" name="Retrieve MyKafka (5)" width="90" x="45" y="34">
            <parameter key="repository_entry" value="MyKafka"/>
          </operator>
          <operator activated="true" class="streaming:kafka_source" compatibility="0.1.000-SNAPSHOT" expanded="true" height="68" name="Kafka Source (2)" width="90" x="179" y="85">
            <parameter key="topic" value="input"/>
            <parameter key="start_from_earliest" value="false"/>
          </operator>
          <operator activated="true" class="streaming:aggregate" compatibility="0.1.000-SNAPSHOT" expanded="true" height="68" name="Aggregate Stream (2)" width="90" x="380" y="136">
            <parameter key="key" value="word"/>
            <parameter key="value_key" value="value"/>
            <parameter key="window_length" value="10"/>
            <parameter key="function" value="Sum"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.7.002" expanded="true" height="68" name="Retrieve MyKafka (6)" width="90" x="380" y="34">
            <parameter key="repository_entry" value="MyKafka"/>
          </operator>
          <operator activated="true" class="streaming:kafka_sink" compatibility="0.1.000-SNAPSHOT" expanded="true" height="82" name="Kafka Sink (4)" width="90" x="581" y="34">
            <parameter key="topic" value="output"/>
          </operator>
          <connect from_op="Retrieve MyKafka (5)" from_port="output" to_op="Kafka Source (2)" to_port="connection"/>
          <connect from_op="Kafka Source (2)" from_port="output stream" to_op="Aggregate Stream (2)" to_port="input stream"/>
          <connect from_op="Aggregate Stream (2)" from_port="output stream" to_op="Kafka Sink (4)" to_port="input stream"/>
          <connect from_op="Retrieve MyKafka (6)" from_port="output" to_op="Kafka Sink (4)" to_port="connection"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve SparkConnection" from_port="output" to_op="Streaming Nest (2)" to_port="connection"/>
      <connect from_op="Streaming Nest (2)" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

David_A · September 2020

Hi @skaloger ,

the Streaming extension is still an early prototype of the INFORE research project (which I guess you are involved somehow, sorry I don't recognize your e-mail).

In this case best get in contact with us developers directly via infore@rapidminer.com

Best,
David

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

RapidMiner Studio Spark Cluster Connection issue

Best Answer