Radoop on Amazon EMR fails to initialize
I'm very close to being able to get Radoop working with an Amazon EMR cluster. My set up involves RapidMiner Studio and Radoop on a Windows laptop which has full unfettered firewall access to the EMR machines. I am not using SOCKS (although I started with this). I am using the absolute latest Spark, Hive and Hadoop components that Amazon makes available.
The full connection test fails at the point where components are being uploaded to the /tmp/radoop/_shared/db_default/ HDFS location. I can see that the data nodes are being contacted on port 50010 and it looks like this fails from my laptop because the ip addresses are not known. I have tried the dfs.client.use.datanode.hostname true/false workaround and I see this changes the name that it attempts to use - in one setting the node is <name>/<ipaddress>:50010 (which is odd) while in the other it is <ipaddress>:50010 (which is believable but doesn't resolve).
I don't have the luxury of installing RapidMiner components on the EMR cluster so my question is what is the best way to get the name nodes exposed to the PC running RapidMiner Studio and Radoop?
Best Answer
-
Andrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru
Hello Peter,
I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.
As promised here is the list of things to do to get to this happy place.
Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.
Log on to the master node and determine the internal IP address of the eth0 interface using the command line.
ifconfig
While logged in, there are some configuration steps needed to make the environment work. These are described in the Radoop documentation here. I observed that Java did not need any special configuration, EMR is up to date. The commands to create various staging locations in HDFS are required. I've repeated them below
hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history
hadoop fs -chmod -R 777 /tmp/hadoop-yarn
hadoop fs -mkdir /user
hadoop fs -chmod 777 /userAn earlier version of Spark needs to be installed. Here are the steps.
wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
cd /home/hadoop
tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/Continue to follow the instructions to set up the network connection. Use the IP address found above as the NameNode address, Resource Manager Address and JobHistory Server Address. Don't be tempted to use any other name or IP address since it will not work.
Set the Hive Server address to localhost.
Set the Hive port to 1235.
Set the Spark version to Spark 1.6 and set the assembly jar location to
hdfs:///tmp/spark-assembly-1.6.3-hadoop2.6.0.jar
Set the advanced Hadoop parameters as follows
dfs.client.use.legacy.blockreader true
hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.SocksSocketFactory
hadoop.socks.server localhost:1234Now create the SOCKS connection. On Linux the command is like this.
ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000 hadoop@<nameofmaster>
In the command above, things between <> need to be provided by information from the environment you are in.
On Windows, use Putty to create the SOCKS connection. The Radoop documentation gives a nice picture here. Make sure you replace hive-internal-address with the ipaddress determined using the ifconfig command.
Now you can run the Radoop connection tests and with luck, all will be well...
yay!
Andrew
3
Answers
Hi Andrew,
You will need to use some networking trick, because the datanode IP addresses that you are receiving from the cluster are AWS internal IP addresses that your PC cannot route to. The dfs.client.use.datanode.hostname will not do the trick as Hadoop services are not exposed on the public-facing IPs.
If you can start another EC2 instance in the same local network (VPC in AWS lingo) as the EMR cluster, then I suggest installing a RapidMiner Server on that EC2 instance and enabling the Radoop Proxy. See here for more details: https://docs.rapidminer.com/radoop/installation/networking-setup.html#radoop-proxy
If you cannot start another instance then you either need to set up the SOCKS proxy or a VPN.
Best, Zoltan
Hello Zoltan
I initially tried with SOCKS but I couldn't make it work, a mis-configuration of some sort. Can I be confident that it will eventually be possible using the SOCKS approach? I just need to be sure that I will get it working before I spend time on it. I promise to write about what I did.
regards
Andrew
I have almost got it working - the last part is now a failure in the Spark location
And yet on the EMR master node, I can see local jar files at that location. Is there a specific file that is needed?
Hi Andrew,
I was able to reproduce your problem on EMR-5.6.0 with Spark 2.1.
It's important to note that Amazon is quite agile in pushing new EMR versions out :smileyhappy:, sometimes latest versions have changes that affects the initial RapidMiner connection setup. Let me take a look at this one, but it may take some time.
Meanwhile, you can always use Spark 1.6 on this cluster as well, just download it from http://spark.apache.org, put the assembly on HDFS and change the Radoop connection to point to that. For example, run these commands as hadoop user on the master (I hope I have no typos there):
Best,
Peter
oops - I made a typo in the instructions
it should be
and also, the SOCKS instructions for Windows Putty are incorrect. The address to use is localhost - confusing - but it seems to work.
Hi Andrew,
thanks for the great summary!
The only thing I did not get is the localhost address comment on Windows. Do you mean you had to use "localhost" as the address (with port 10000) instead of the Hive node's IP address? I would expect that to only work if the HiveServer2 ran on the master node.
Best,
Peter
Hello Peter
I have these Putty settings.
Putty settings
If I change the local port 1235 setting to other likely candidate names or ip addresses, I get a failure in the Quick Test of the Radoop connection.
regards
Andrew
We've made a small update to the Amazon EMR guide at https://docs.rapidminer.com/radoop/installation/distribution-notes.html.
Both Spark 1.x and Spark 2.x can be used easily. The most efficient configuration is described: upload Spark assembly / Spark jars to HDFS in a compressed format and provide the HDFS URL in the Radoop connection.
(The error came from the fact that Spark libraries are only installed on the master node, so the submitted jobs could not find them on worker nodes.)
Best,
Peter