Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-12454

HBase coprocessors sometimes cannot access tables due to ZK auth failure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.0
    • Fix Version/s: 4.3.1
    • Component/s: Master
    • Labels:
    • Release Notes:
      Fixes an issue where transaction coprocessors could sometimes not access their configuration.
    • Rank:
      1|i0071b:

      Description

      Sometimes, region servers have many messages in the logs of the form:

      2017-08-15 15:48:46,089 ERROR [tms-topic-metadata-cache-refresh] zookeeper.ZooKeeperWatcher: hconnection-0x234b6ae9-0x15b49966f34f9bb, quorum=<hostname censored>:2181,<censored>, baseZNode=/hbase-secure Received unexpected KeeperException, re-throwing exception
      org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase-secure/meta-region-server
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
              at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
              at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622)
              at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:491)
              at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:172)
              at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:608)
              at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:589)
              at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:568)
              at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1192)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1159)
              at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
              at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
              at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1256)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1103)
              at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:938)
              at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83)
              at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:79)
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
              at org.apache.hadoop.hbase.client.HTable.get(HTable.java:862)
              at org.apache.hadoop.hbase.client.HTable.get(HTable.java:828)
              at co.cask.cdap.data2.util.hbase.ConfigurationTable.read(ConfigurationTable.java:133)
              at co.cask.cdap.data2.transaction.queue.hbase.coprocessor.CConfigurationReader.read(CConfigurationReader.java:39)
              at co.cask.cdap.messaging.TopicMetadataCache.updateCache(TopicMetadataCache.java:123)
              at co.cask.cdap.messaging.TopicMetadataCache$2.run(TopicMetadataCache.java:183)
      

      If this happens, then it happens equally for the transaction state cache, the configuration table, and for the prune state.

      The behavior is pretty bad: CDAP coprocessor attempts to access a Table, for that it needs to access the meta region, which fails due to ZK authorization. Unfortunately, the HBase client does this with a blocking busy retry loop for 5 minutes, so it floooods the logs for 5 minutes. Then the next coprocessor gets its turn and produces another 5 minutes of unthrottled retries and error messages.

      The consequence is that coprocessors cannot read the transaction state or the configuration. Hence, for example, they cannot find out whether tx pruning is enabled and don't record prune info ever.

      There is a way to impersonate the login user when accessing a table from a coprocessor. That appears to fix the problem. or all coprocessors.

      Or is there even a better way to access a table from a coprocessor, than using an HBase client? Is it possible via the coprocessor environment?

      We need to investigate and fix this for all coprocessors.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                andreas Andreas Neumann
                Reporter:
                andreas Andreas Neumann
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: