Connecting to CDH5 in an EC2 instance
Dear all,
I have recently launched an EC2 instance with a CDH 5.11 within it. All services seem to be up and running. I have passed several tests to validate the installation.
I have also installed RapidMiner Studio on my desktop as well as the Radoop extension. Currently, I am trying to connect to my hadoop cluster. The EC2 instance is not configured to use Elastic IPs, I am ussing tunnels through ssh session.
I am currently trying to pass the full test to validate the connection. Initially, configuration was imported from Cloudera Manager. Then I modified several properties to adjust to my environment. Hive, Java version, Map Reduce, NNode networking test connections have been passed successfully but I am stucked with the upload of a jar file to HDFS. I guess the problem is given by a previous warning when doing DataNode networking test:
WARNING: Reverse DNS lookup failed! Expected hostname for ip <public-ip>: <fqdn>, but received <public DNS>.
WARNING: DataNode port 50010 on the ip/hostname <fqdn> cannot be reached. Please check that you can access the DataNodes of your cluster.
I guess that tunnel on port 50010 is working fine but there is something I am missing. Output of netstat command shows this port is listening to all IPs (0.0.0.0).
Things I have tried:
- Edit my local hosts file to resolve public ip to internal server hostname. Then Radoop complains because server is unreachable.
- Format namenode previously deleting all data in hdfs data directory
- Edit dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname on the client configuration to true.
- Try to upload a file using another client such as toad. Same error.
- Edit dfs.datanode.address in server to be like hostname:port is not allowed by Cloudera Manager. Only can be set as the port number.
- Edit dfs.datanode.address in the client conf does not change Radoop behaviour.
The error when trying to upload the jar file is the following:
[----] SEVERE: File /tmp/radoop/_shared/db_default/radoop_hive-v4_UPLOADING_1498636293395_dy8gaul.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
Somehow the client knows the number of datanodes in hdfs service. Could I say ssh tunnel on port 50010 is working fine? Can someone point me to the right direction?
Thank you!!
Best Answer
-
phellinger Employee-RapidMiner, Member Posts: 103 RM Engineering
Hi,
that is already some progress!
The client knows the number of DataNodes from the NameNode's response.
The client almost certainly won't be able to access the DataNodes directly, only through a SOCKS proxy, so the traffic goes through a master node.
You need to follow the instructions of "Configuring SOCKS Proxy and SSH tunneling" at
https://docs.rapidminer.com/radoop/installation/networking-setup.html
In this case, you don't need to create tunnels one by one. Only one additional for Hive, see the description.
Or is it something you have already configured?
This thread may also be helpul.
Best,
Peter
0
Answers
Hi phellinger,
Thank you a lot, this was helpful. I did not read this documentation and I was trying 1 thousand tunnels.
I am now able to pass the quick test. Full test fails in hive table load. The error tells me to check user permissions on LOAD or CREATE statements, which I have already done and seems to be ok.
Can you point me to the right direction?
Thank you in advance!
Best,
Pau
Hi Pau,
great!
The Hive load test uploads an HDFS file to a temp dir, and uses the LOAD DATA Hive statement that will effectively move the file to the Hive warehouse directory.
If you enable the Log panel in Studio (View -> Show Panel -> Log) and set the log level (right click on the panel -> Set log level -> FINER), you will see the details.
Can you share more details (log) in PM or here?
Best,
Peter