The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Connecting to CDH5 in an EC2 instance

pau_fernandez_qpau_fernandez_q Member Posts: 2 Contributor I
edited September 2019 in Help

Dear all,

 

I have recently launched an EC2 instance with a CDH 5.11 within it. All services seem to be up and running. I have passed several tests to validate the installation.

 

I have also installed RapidMiner Studio on my desktop as well as the Radoop extension. Currently, I am trying to connect to my hadoop cluster. The EC2 instance is not configured to use Elastic IPs, I am ussing tunnels through ssh session. 

 

I am currently trying to pass the full test to validate the connection. Initially, configuration was imported from Cloudera Manager. Then I modified several properties to adjust to my environment. Hive, Java version, Map Reduce, NNode networking test connections have been passed successfully but I am stucked with the upload of a jar file to HDFS. I guess the problem is given by a previous warning when doing DataNode networking test:

 

 WARNING: Reverse DNS lookup failed! Expected hostname for ip <public-ip>: <fqdn>, but received <public DNS>.

 WARNING: DataNode port 50010 on the ip/hostname <fqdn> cannot be reached. Please check that you can access the DataNodes of your cluster.

 

I guess that tunnel on port 50010 is working fine but there is something I am missing. Output of netstat command shows this port is listening to all IPs (0.0.0.0).

 

Things I have tried:

 

- Edit my local hosts file to resolve public ip to internal server hostname. Then Radoop complains because server is unreachable.

- Format namenode previously deleting all data in hdfs data directory

- Edit dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname on the client configuration to true.

- Try to upload a file using another client such as toad. Same error.

- Edit dfs.datanode.address in server to be like hostname:port is not allowed by Cloudera Manager. Only can be set as the port number.

- Edit dfs.datanode.address in the client conf does not change Radoop behaviour.

 

The error when trying to upload the jar file is the following:

[----] SEVERE: File /tmp/radoop/_shared/db_default/radoop_hive-v4_UPLOADING_1498636293395_dy8gaul.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.

 

Somehow the client knows the number of datanodes in hdfs service. Could I say ssh tunnel on port 50010 is working fine? Can someone point me to the right direction?

 

Thank you!!

Tagged:

Best Answer

Answers

  • pau_fernandez_qpau_fernandez_q Member Posts: 2 Contributor I

    Hi phellinger,

     

    Thank you a lot, this was helpful. I did not read this documentation and I was trying 1 thousand tunnels.

     

    I am now able to pass the quick test. Full test fails in hive table load. The error tells me to check user permissions on LOAD or CREATE statements, which I have already done and seems to be ok.

     

    Can you point me to the right direction? 

     

    Thank you in advance!

     

    Best,
    Pau

  • phellingerphellinger Employee-RapidMiner, Member Posts: 103 RM Engineering

    Hi Pau,

     

    great!

     

    The Hive load test uploads an HDFS file to a temp dir, and uses the LOAD DATA Hive statement that will effectively move the file to the Hive warehouse directory.

    If you enable the Log panel in Studio (View -> Show Panel -> Log) and set the log level (right click on the panel -> Set log level -> FINER), you will see the details.

     

    Can you share more details (log) in PM or here?

     

    Best,

    Peter

Sign In or Register to comment.