Radoop Full Test failing
I am new to Radoop and trying to setup a development enviornment. My setup is
- Virtual Machine (Ubuntu) running in Virtual Box (I am not using HDP Image)
- 5GB Ram is assinged to the VM
- Spark 2.0.0
- Hadoop 2.8.5
- Hive 2.3.3
My quick tests are all okay. When I run full tests, I get the following error
[Nov 4, 2018 7:50:46 PM]: Running test 17/25: Hive load data
[Nov 4, 2018 7:50:52 PM]: Test succeeded: Hive load data (6.356s)
[Nov 4, 2018 7:50:52 PM]: Running test 18/25: Import job
[Nov 4, 2018 7:51:07 PM] SEVERE: Test failed: Import job
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Import job
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Hive load data
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Radoop jar upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: HDFS upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Create permanent UDFs
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: UDF jar upload
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Spark assembly jar existence
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Spark staging directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: MapReduce staging directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Radoop temporary directory
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: MapReduce
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: HDFS
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: YARN services networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: DataNode networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: NameNode networking
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Java version
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Fetch dynamic settings
[Nov 4, 2018 7:51:07 PM]: Cleaning after test: Hive connection
[Nov 4, 2018 7:51:07 PM]: Total time: 22.634s
[Nov 4, 2018 7:51:07 PM]: java.lang.Exception: Import job failed, see the job logs on the cluster for details.
at eu.radoop.connections.service.test.integration.TestHdfsImport.call(TestHdfsImport.java:95)
at eu.radoop.connections.service.test.integration.TestHdfsImport.call(TestHdfsImport.java:40)
at eu.radoop.connections.service.test.RadoopTestContext.lambda$runTest$1(RadoopTestContext.java:279)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Nov 4, 2018 7:51:07 PM] SEVERE: java.lang.Exception: Import job failed, see the job logs on the cluster for details.
[Nov 4, 2018 7:51:07 PM] SEVERE: Test data import from the distributed file system to Hive server 2 failed. Please check the logs of the MapReduce job on the ResourceManager web interface at http://${yarn.resourcemanager.hostname}:8088.
[Nov 4, 2018 7:51:07 PM] SEVERE: Test failed: Import job
[Nov 4, 2018 7:51:07 PM] SEVERE: Integration test for 'VirtualBoxVM' failed.
In Yarn container logs, I see the following error
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Further, If I run just the spark tests, then I get the following
My Spark Radoop settings ->
- Spark 2.0
- Assembly path -> hdfs:///spark/jars/*
- Resource Allocation Policy -> Static, Default Configuration
Logs
[Nov 4, 2018 7:55:44 PM]: Running test 3/4: HDFS upload
[Nov 4, 2018 7:55:44 PM]: Uploaded test data file size: 5642
[Nov 4, 2018 7:55:44 PM]: Test succeeded: HDFS upload (0.075s)
[Nov 4, 2018 7:55:44 PM]: Running test 4/4: Spark job
[Nov 4, 2018 7:55:44 PM]: Assuming Spark version Spark 2.0.
[Nov 4, 2018 7:56:38 PM]: Assuming Spark version Spark 1.4 or below.
[Nov 4, 2018 7:56:38 PM] SEVERE: Test failed: Spark job
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Spark job
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: HDFS upload
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Spark staging directory
[Nov 4, 2018 7:56:38 PM]: Cleaning after test: Fetch dynamic settings
[Nov 4, 2018 7:56:38 PM]: Total time: 53.783s
[Nov 4, 2018 7:56:38 PM] SEVERE: com.rapidminer.operator.UserError: The specified Spark assembly jar, archive or lib directory does not exist or cannot be read.
[Nov 4, 2018 7:56:38 PM] SEVERE: The Spark test failed. Please verify your Hadoop and Spark version and check if your assembly jar location is correct. If the job failed, check the logs on the ResourceManager web interface at http://${yarn.resourcemanager.hostname}:8088.
[Nov 4, 2018 7:56:38 PM] SEVERE: Test failed: Spark job
[Nov 4, 2018 7:56:38 PM] SEVERE: Integration test for 'VirtualBoxVM' failed.
Resource Manager logs: (Full logs attached with the post)
User class threw exception: org.apache.spark.SparkException: Spark test failed: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/radoop/training-vm/tmp_1541357744748_x0migqc
Apart from this, I have also attached my yarn-site.xml and mapred-site.xml
Any help would be much appreciated.
Answers
hope we got a response! i have the same error .
- Your mapreduce classpath seems to be set up incorrectly. Now I don't see your connection xml, it's possible you haven't added mapreduce.application.classpath Advanced Hadoop property from your mapred-site.xml. It's also possible that the necessary jars are not available at the provided location. Could you please double check it?
- The Spark exception indicates that Spark is looking for files in the local file system (file:/tmp/radoop/training-vm/tmp_1541357744748_x0migqc) instead of hdfs (hdfs:///tmp/radoop/training-vm/tmp_1541357744748_x0migqc). We should look deeper into why this happens, but before that, see my following point.
- You are using a single node VM with 5 GBs of memory. We can safely assume that you only want to use this setup for some proof of concept solution. But 5 GBs of memory is probably not enough even for that. We always advise at least 8 GBs of memory for Hadoop, even for the simplest use-cases. But more importantly, if you only want to play around with Radoop, I strongly suggest you use one of the Quickstart VM guides provided in Radoop documentation: https://docs.rapidminer.com/latest/radoop/installation/distribution-notes.html Please note that making your current VM work with Radoop will take considerably more effort, than using our step-by-step guides.
Cheers,Máté
Hopefully without disrupting the troubleshooting here, let me just share my thoughts.
I see the point of having a VM with as few components as possible.
Years ago, we automated the creation of such VMs with tools like Vagrant and Packer, but they are no longer in use, they no longer work. (Docker requires fewer resources and can be automated much easier.) But those VMs were also based on Cloudera and Hortonworks VMs, because these distributions do so much configuration automatically during installation that is hard to replicate.
You may also consider these options. E.g. removing all unnecessary services from a Cloudera VM using the Cloudera Manager interface is simple - keep only HDFS, Hive, YARN, Spark, Zookeeper (Sentry, if you need security). Besides, as Máté says, 5 GB RAM is still too small for an Apache Hadoop, YARN is not prepared to handle that properly even if you are an expert in tweaking its memory settings. It is simply not supported.
Regarding the troubleshooting:
The Export Logs output after the Full Test would be very helpful, it contains much more information. That is a zip file, feel free to send it to us.
Thank you for your suggestion. I will give the docker option a try in parallel. I will also try to increase the memory to see how it behaves.
Did you happen to do that? Please look for the logs of the following application id (please double check that the application type is "MAPREDUCE" and the application name is "Radoop Import CSV job" ):
It should look something like this:
Then the details of the application - including its logs - is available after clicking on the application ID. If the error is not straightforward then please send all of the logs. I meant container-localizer-syslog, stdout, stderr and syslog, etc..