Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-13934

Missing delegation tokens upon launch of 'master.services' yarn application

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.0.0
    • Fix Version/s: 5.0.0
    • Component/s: Master
    • Labels:
      None
    • Rank:
      1|i00fi7:

      Description

      When launching the master.services yarn application, it is not being launched with the delegation tokens for HBASE, MR, and HIVE. Instead, only RM and HDFS delegation tokens seem to be there upon launch.

      This results in errors in the container logs, such as transaction service:

      2018-07-25 22:20:55,898 - INFO  [TransactionPruningService STARTING:o.a.t.t.TransactionPruningService@72] - Starting TransactionPruningService...
      2018-07-25 22:20:55,898 - INFO  [ThriftRPCServer:o.a.t.s.TThreadedSelectorServerWithFix@224] - Starting TThreadedSelectorServerWithFix
      2018-07-25 22:20:55,898 - INFO  [TransactionPruningService STARTING:o.a.t.t.TransactionPruningService@137] - Initializing invalid list prune plugins [data.tx.pruning.plugin]
      2018-07-25 22:20:56,292 - INFO  [TransactionPruningService STARTING:o.a.t.h.t.HBaseTransactionPruningPlugin@131] - Initializing plugin with state table cdap_system:tephra.state
      2018-07-25 22:20:56,624 - WARN  [hconnection-0x3162dbd-shared--pool1-t1:o.a.h.s.UserGroupInformation@1923] - PriviledgedActionException as:cdap (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initia
      te failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      2018-07-25 22:20:56,628 - WARN  [hconnection-0x3162dbd-shared--pool1-t1:o.a.h.h.i.RpcClientImpl@675] - Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate faile
      d [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      2018-07-25 22:20:56,634 - ERROR [hconnection-0x3162dbd-shared--pool1-t1:o.a.h.h.i.RpcClientImpl@685] - SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
      javax.security.sasl.SaslException: GSS initiate failed
              at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) ~[na:1.8.0_161]
              at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:181) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:618) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:163) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:744) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:741) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_161]
              at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_161]
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) ~[hadoop-common-2.6.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:741) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:907) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:874) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094) [hbase-protocol-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:400) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:204) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:65) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:397) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:371) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) [hbase-client-1.2.0-cdh5.13.3.jar:na]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161]
              at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
      Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
              at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) ~[na:1.8.0_161]
              at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) ~[na:1.8.0_161]
              at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) ~[na:1.8.0_161]
              at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) ~[na:1.8.0_161]
              at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) ~[na:1.8.0_161]
              at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) ~[na:1.8.0_161]
              at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ~[na:1.8.0_161]
              ... 26 common frames omitted
      2018-07-25 22:20:56,849 - INFO  [leader-election-tx.service-leader:o.a.t.d.TransactionService@115] - Transaction Thrift Service didn't start on <HOSTNAME>/<IP>:40979
      2018-07-25 22:20:56,854 - ERROR [leader-election-tx.service-leader:c.c.c.c.t.AbstractMasterTwillRunnable@162] - Service org.apache.tephra.distributed.TransactionService failed
      com.google.common.util.concurrent.UncheckedExecutionException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
      Wed Jul 25 22:20:56 UTC 2018, RpcRetryingCaller{globalStartTime=1532557256377, pause=100, retries=2}, java.io.IOException: Could not set up IO Streams to <HOSTNAME>/<IP>:60020
      Wed Jul 25 22:20:56 UTC 2018, RpcRetryingCaller{globalStartTime=1532557256377, pause=100, retries=2}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: <HOSTNAME>/<IP>:60020
      
      
              at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1015) ~[com.google.guava.guava-13.0.1.jar:na]
              at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1001) ~[com.google.guava.guava-13.0.1.jar:na]
              at com.google.common.util.concurrent.AbstractService.startAndWait(AbstractService.java:220) ~[com.google.guava.guava-13.0.1.jar:na]
              at com.google.common.util.concurrent.AbstractIdleService.startAndWait(AbstractIdleService.java:106) ~[com.google.guava.guava-13.0.1.jar:na]
              at org.apache.tephra.distributed.TransactionService$1.leader(TransactionService.java:111) ~[org.apache.tephra.tephra-core-0.13.0-incubating.jar:0.13.0-incubating]
              at org.apache.twill.internal.zookeeper.LeaderElection.becomeLeader(LeaderElection.java:234) [org.apache.twill.twill-zookeeper-0.13.0.jar:0.13.0]
              at org.apache.twill.internal.zookeeper.LeaderElection.access$1900(LeaderElection.java:54) [org.apache.twill.twill-zookeeper-0.13.0.jar:0.13.0]
              at org.apache.twill.internal.zookeeper.LeaderElection$5.onSuccess(LeaderElection.java:212) [org.apache.twill.twill-zookeeper-0.13.0.jar:0.13.0]
              at org.apache.twill.internal.zookeeper.LeaderElection$5.onSuccess(LeaderElection.java:191) [org.apache.twill.twill-zookeeper-0.13.0.jar:0.13.0]
              at com.google.common.util.concurrent.Futures$6.run(Futures.java:799) [com.google.guava.guava-13.0.1.jar:na]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161]
              at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
      Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
      Wed Jul 25 22:20:56 UTC 2018, RpcRetryingCaller{globalStartTime=1532557256377, pause=100, retries=2}, java.io.IOException: Could not set up IO Streams to <HOSTNAME>/<IP>:60020
      Wed Jul 25 22:20:56 UTC 2018, RpcRetryingCaller{globalStartTime=1532557256377, pause=100, retries=2}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: <HOSTNAME>/<IP>:60020
      
      
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              ... 3 common frames omitted
      Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: <HOSTNAME>/<IP>:60020
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:708) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:907) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:874) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094) ~[hbase-protocol-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:400) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:204) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:65) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:397) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:371) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) ~[hbase-client-1.2.0-cdh5.13.3.jar:na]
              ... 4 common frames omitted 

      Once CDAP master pushes delegation token updates, which seems to happen soon thereafter, then the next attempt of the container seems to run fine.

      22:21:03.944 [message-callback] INFO  o.a.t.i.y.AbstractYarnTwillService - Secure store updated from hdfs://<HOSTNAME>:8020/cdap/twill/master.services/871c3dd4-4366-41c3-ada3-65ad71fbf51e/credentials.store. 

      v4.3.4 cluster logs:

      16:08:26.204 [ApplicationMasterService] INFO  o.a.t.i.a.ApplicationMasterService - Starting runnable messaging.service in Container: [ContainerId: container_1532554860735_0002_01_000002, NodeId: <HOSTNAME>:42700, NodeHttpAddress: <HOSTNAME>:8042, Resource: <memory:512, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.250.0.5:42700 }, ]
      16:08:26.675 [ApplicationMasterService] WARN  o.apache.hadoop.security.token.Token - Cannot find class for token kind MR_DELEGATION_TOKEN
      16:08:26.676 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: MR_DELEGATION_TOKEN, Service: 10.250.0.5:10020, Ident: 00 37 63 64 61 70 2f 72 34 33 34 33 33 35 33 32 2d 31 30 30 30 2e 64 65 76 2e 63 6f 6e 74 69 6e 75 75 69 74 79 2e 6e 65 74 40 43 4f 4e 54 49 4e 55 55 49 54 59 2e 4e 45 54 04 79 61 72 6e 00 8a 01 64 d7 58 c3 92 8a 01 64 fb 65 47 92 15 02
      16:08:26.844 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: HBASE_AUTH_TOKEN, Service: 6f4d0667-5d01-4072-a660-d189b1d18bf1, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@14)
      16:08:26.844 [ApplicationMasterService] WARN  o.apache.hadoop.security.token.Token - Cannot find class for token kind HIVE_DELEGATION_TOKEN
      16:08:26.844 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: HIVE_DELEGATION_TOKEN, Service: hive.metastore.service, Ident: 00 04 63 64 61 70 04 63 64 61 70 00 8a 01 64 d7 58 b5 08 8a 01 64 fb 65 39 08 29 01
      16:08:26.844 [ApplicationMasterService] WARN  o.apache.hadoop.security.token.Token - Cannot find class for token kind HIVE_DELEGATION_TOKEN
      16:08:26.845 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: HIVE_DELEGATION_TOKEN, Service: hiveserver2ClientToken, Ident: 00 04 63 64 61 70 04 63 64 61 70 00 8a 01 64 d7 58 b6 d7 8a 01 64 fb 65 3a d7 15 01
      16:08:26.845 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: HDFS_DELEGATION_TOKEN, Service: 10.250.0.5:8020, Ident: (HDFS_DELEGATION_TOKEN token 22 for cdap)
      16:08:26.846 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: RM_DELEGATION_TOKEN, Service: 10.250.0.5:8032, Ident: (owner=cdap/<HOSTNAME>@<DOMAIN>, renewer=cdap, realUser=, issueDate=1532621294815, maxDate=1533226094815, sequenceNumber=24, masterKeyId=2)
      16:08:26.847 [ApplicationMasterService] INFO  o.a.t.i.a.RunnableProcessLauncher - Launching in container container_1532554860735_0002_01_000002 at <HOSTNAME>:42700, [$JAVA_HOME/bin/java -Djava.io.tmpdir=tmp -Dyarn.container=$YARN_CONTAINER_ID -Dtwill.runnable=$TWILL_APP_NAME.$TWILL_RUNNABLE_NAME -cp launcher.jar:$HADOOP_CONF_DIR -Xmx308m -XX:MaxPermSize=128M -verbose:gc -Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dhdp.version=2.3.4.7-4 -Dtwill.container.class.loader=co.cask.cdap.data.runtime.main.MasterServiceMainClassLoader org.apache.twill.launcher.TwillLauncher org.apache.twill.internal.container.TwillContainerMain true 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr]
      16:08:26.973 [ApplicationMasterService] INFO  o.a.t.i.a.ApplicationMasterService - Runnable messaging.service fully provisioned with 1 instances.

      release/5.0 cluster (0c31afc2f0c6075bf23e4313d9500292e1f0a7a5, Wed Jul 25 16:13:14 2018 -0700):

      22:57:18.222 [ApplicationMasterService] INFO  o.a.t.i.a.ApplicationMasterService - Starting runnable messaging.service in Container: [ContainerId: container_1532645576469_0001_01_000002, NodeId: <HOSTNAME>:42543, NodeHttpAddress: <HOSTNAME>:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.250.0.9:42543 }, ]
      22:57:19.058 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: RM_DELEGATION_TOKEN, Service: <IP>:8032, Ident: (RM_DELEGATION_TOKEN owner=cdap/<HOSTNAME>@<DOMAIN>, renewer=cdap, realUser=, issueDate=1532645820878, maxDate=1533250620878, sequenceNumber=2, masterKeyId=2)
      22:57:19.059 [ApplicationMasterService] INFO  o.a.t.i.y.AbstractYarnProcessLauncher - Launch with delegation token Kind: HDFS_DELEGATION_TOKEN, Service: 10.250.0.9:8020, Ident: (token for cdap: HDFS_DELEGATION_TOKEN owner=cdap/<HOSTNAME>@<DOMAIN>, renewer=yarn, realUser=, issueDate=1532645820657, maxDate=1533250620657, sequenceNumber=41, masterKeyId=2)
      22:57:19.060 [ApplicationMasterService] INFO  o.a.t.i.a.RunnableProcessLauncher - Launching in container container_1532645576469_0001_01_000002 at <HOSTNAME>:42543, [$JAVA_HOME/bin/java -Djava.io.tmpdir=tmp -Dyarn.container=$YARN_CONTAINER_ID -Dtwill.runnable=$TWILL_APP_NAME.$TWILL_RUNNABLE_NAME -cp launcher.jar:$HADOOP_CONF_DIR -Xmx615m -XX:+UseG1GC -verbose:gc -Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dtwill.container.class.loader=co.cask.cdap.data.runtime.main.MasterServiceMainClassLoader org.apache.twill.launcher.TwillLauncher org.apache.twill.internal.container.TwillContainerMain true 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr]
      22:57:19.157 [ApplicationMasterService] INFO  o.a.t.i.a.ApplicationMasterService - Runnable messaging.service fully provisioned with 1 instances. 

      I believe that the cause for this is the removal of this code in May:
      https://github.com/caskdata/cdap/commit/4fb922d734f0469da06814ab2712ebb2d6454765#diff-ef64c290689c73ddd269891673e5281fL921

      I'm unsure of why this isn't causing issues in all Kerberos-enabled, multinode environments.

        Attachments

          Activity

            People

            • Assignee:
              ali.anwar Ali Anwar
              Reporter:
              ali.anwar Ali Anwar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: