Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-12970

HBase RegionServers can get stuck starting up, waiting on ConfigurationTable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.2, 4.3.1, 4.3.0
    • Fix Version/s: 5.0.0, 4.3.3
    • Component/s: None
    • Labels:
      None
    • Release Notes:
      Fixed a dependency bug that could cause HBase region servers to deadlock during a cold start
    • Rank:
      1|i00a1b:

      Description

      The ConfigurationReader class used within CDAP coprocessors (such as TransactionStateCache) relies on and can block until the Configuration hbase table is read. Since the Configuration hbase table might be loaded by the region server after other tables, the other tables might hold onto the hbase region server threads and simply hang.
      HBase master will then stop, saying:

      2017-12-07 02:37:43,076 FATAL org.apache.hadoop.hbase.master.HMaster: Failed to become active master
      java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
              at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:106)
              at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1009)
              at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:794)
              at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:188)
              at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1777)
              at java.lang.Thread.run(Thread.java:745)
      

      Attached a 'regionserver-stdout.log' which contains a thread dump of a RegionServer, which shows a deadlock. Here is an example of one such thread that is hung:

      "DefaultTransactionStateCache STARTING" prio=10 tid=0x0000000001b57800 nid=0x53e0 waiting on condition [0x00007fe1242c5000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
      	at java.lang.Thread.sleep(Native Method)
      	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1482)
      	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1199)
      	at org.apache.hadoop.hbase.client.CoprocessorHConnection.locateRegion(CoprocessorHConnection.java:41)
      	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1179)
      	at org.apache.hadoop.hbase.client.CoprocessorHConnection.locateRegion(CoprocessorHConnection.java:41)
      	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1136)
      	at org.apache.hadoop.hbase.client.CoprocessorHConnection.locateRegion(CoprocessorHConnection.java:41)
      	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:971)
      	at org.apache.hadoop.hbase.client.CoprocessorHConnection.getRegionLocation(CoprocessorHConnection.java:41)
      	at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:79)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
      	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)
      	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:901)
      	at org.apache.hadoop.hbase.client.HTableWrapper.get(HTableWrapper.java:125)
      	at co.cask.cdap.data2.util.hbase.ConfigurationReader.read(ConfigurationReader.java:105)
      	at co.cask.cdap.data2.util.hbase.CoprocessorCConfigurationReader.read(CoprocessorCConfigurationReader.java:60)
      	at co.cask.cdap.data2.transaction.coprocessor.DefaultTransactionStateCache.getSnapshotConfiguration(DefaultTransactionStateCache.java:62)
      	at org.apache.tephra.coprocessor.TransactionStateCache.tryInit(TransactionStateCache.java:96)
      	at org.apache.tephra.coprocessor.TransactionStateCache.refreshState(TransactionStateCache.java:155)
      	at org.apache.tephra.coprocessor.TransactionStateCache.startUp(TransactionStateCache.java:71)
      	at com.google.common.util.concurrent.AbstractIdleService$1$1.run(AbstractIdleService.java:43)
      	at java.lang.Thread.run(Thread.java:745)
      

      As a workaround, you can increase the hbase-site.xml configuration hbase.regionserver.executor.openregion.threads to 300 (default value is 3).
      You may need to set it to a higher value if the number of regions per region server is high.
      I had also increased hbase.master.namespace.init.timeout to 36000000 and hbase.bulk.assignment.perregion.open.time to 10000, though these two parameters are probably not necessary.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                poorna Poorna Chandra
                Reporter:
                ali.anwar Ali Anwar
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: