[CDAP-8367] Hive on Spark in CM Created: 07/Feb/17  Updated: 24/Feb/17  Resolved: 23/Feb/17

Status: Resolved
Project: CDAP
Component/s: CDAP, Explore, Spark
Affects Version/s: 4.0.1, 4.0.0
Fix Version/s: 4.1.0, 4.0.2

Type: Bug Priority: Blocker
Reporter: Matt Wuenschel Assignee: Ali Anwar
Resolution: Fixed Votes: 0
Labels: None

Release Notes: Fixed an issue with "Hive-on-Spark" on newer versions of CDH.
Rank: 1|hzzu6v:

 Description   

When Hive on Spark is enabled on a cluster managed by Cloudera, queries that trigger a spark job fail to start an application in Yarn.
CDAP master has the following error:

2017-02-07 00:28:49,635 - ERROR [HiveServer2-Background-Pool: Thread-97:o.a.h.s.c.SparkClientImpl@113] - Error while waiting for client to connect.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancel client 'ecb6be99-ea53-4121-b9fa-4be5d0ffc1c4'. Error: Child process exited before connecting back with error log Error: Cannot load main class from JAR file:/yarn/nm/usercache/cdap/appcache/application_1486425431946_0002/container_1486425431946_0002_01_000003/4
Run with --help for usage help or --verbose for debug output

	at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
	at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:106) ~[hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:99) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:95) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:67) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:111) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1969) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1682) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1419) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1203) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1198) [hive-exec-1.1.0-cdh5.9.1.jar:1.1.0-cdh5.9.1]
	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:187) [1486427023191-hive-jdbc-1.1.0-cdh5.9.1-standalone.jar:1.1.0-cdh5.9.1]
	at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:76) [1486427023191-hive-jdbc-1.1.0-cdh5.9.1-standalone.jar:1.1.0-cdh5.9.1]
	at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:241) [1486427023191-hive-jdbc-1.1.0-cdh5.9.1-standalone.jar:1.1.0-cdh5.9.1]
	at java.security.AccessController.doPrivileged(Native Method) [na:1.7.0_67]
	at javax.security.auth.Subject.doAs(Subject.java:415) [na:1.7.0_67]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714) [hadoop-common-2.6.0-cdh5.9.1.jar:na]
	at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:254) [1486427023191-hive-jdbc-1.1.0-cdh5.9.1-standalone.jar:1.1.0-cdh5.9.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_67]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_67]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]


 Comments   
Comment by Ali Anwar [ 08/Feb/17 ]

Worked with Terence Yim and Matt Wuenschel on this a bit, and came to the conclusion that the spark-submit shell command is launching with java 6 (the problem observed is due to or related to the environment setup), since that is the default version on the path. We need to run it with the same version of java as the explore container, since spark does not support java 6.
Potentially useful:
http://apache-spark-developers-list.1001551.n3.nabble.com/Setting-YARN-executors-JAVA-HOME-td18672.html
https://spark.apache.org/docs/1.3.0/configuration.html#environment-variables
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html#configuring-java8

Comment by Ali Anwar [ 08/Feb/17 ]

On a side note, it seems like in later versions of Hive, the spark job is launched as 'hive' by default:
https://github.com/cloudera/hive/blame/cdh5-1.1.0_5.10.0/spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java#L370-L378
This will be problematic if the data being queried is not readable by hive.

Comment by Ali Anwar [ 08/Feb/17 ]

Having hive logs at DEBUG level (instead of WARN) in logback-container.xml yields this information:

2017-02-08 20:57:13,480 - INFO  [HiveServer2-Background-Pool: Thread-80:o.a.h.s.c.SparkClientImpl@437] - Running client driver with argv: /data/cloudera/parcels/CDH-5.9.1-1
.cdh5.9.1.p0.4/lib/spark/bin/spark-submit --principal <PRINCIPAL-REVOKED> --keytab  --executor-cores 1 --executor-memory 268435
456 --properties-file /yarn/nm/usercache/cdap/appcache/application_1486516812701_0002/container_1486516812701_0002_01_000003/tmp/spark-submit.4819283235897699082.properties
 --class org.apache.hive.spark.client.RemoteDriver /yarn/nm/usercache/cdap/appcache/application_1486516812701_0002/container_1486516812701_0002_01_000003/hive-exec-1.1.0-cd
h5.9.1.jar --remote-host <HOSTNAME-REVOKED> --remote-port 42523 --conf hive.spark.client.connect.timeout=1000 --conf hive.spark.client.server.connect.timeout=90000 --c
onf hive.spark.client.channel.log.level=null --conf hive.spark.client.rpc.max.size=52428800 --conf hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256 
--conf hive.spark.client.rpc.server.address=null

The value for the --keytab argument is missing, which is leading to the Error in the JIRA description.
The value is missing because `hive.server2.authentication.kerberos.keytab` is missing from the hive-site.xml available to the explore container (the principal param is there, though). Additionally, even if it was available, it would be running the Spark job as the 'hive' user due to the changes in HIVE-14383. This would cause permission issues, since it should run as cdap or whatever the impersonating user is.

Comment by Ali Anwar [ 09/Feb/17 ]

Fix in:
release/4.1: https://github.com/caskdata/cdap/pull/7896 (reverted in https://github.com/caskdata/cdap/pull/7933)
release/4.0: https://github.com/caskdata/cdap/pull/7921
release/3.5: https://github.com/caskdata/cdap/pull/7922 (reverted in https://github.com/caskdata/cdap/pull/8025)

Comment by Ali Anwar [ 15/Feb/17 ]

Improved fix for release/4.0: https://github.com/caskdata/cdap/pull/7991
Merging the fix into release/4.1: https://github.com/caskdata/cdap/pull/8131

Generated at Mon Dec 17 04:30:57 UTC 2018 using Jira 7.13.0#713000-sha1:fbf406879436de2f3fb1cfa09c7fa556fb79615a.