Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7388

Jobs fail during CDAP Master service failover

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.5.1
    • Fix Version/s: 4.1.0
    • Component/s: CDAP Services, Master
    • Labels:
    • Rank:
      1|hzzmpr:

      Description

      Experiment:

      • deployed two instances of CDAP master services (and all other services) in 2 different nodes.
      • Then started PurchaseHistory App from UI from 1st node, where main master service is running.
      • Then killed main master service (kill -9 MASTER_SERViCE1_PID)
      • It takes around 45 second for the 2nd master service becomes new main master service.
      • but the Purchase job which i launched fails.

      After 3 retries the map task fails. and the reason of failure in each attempt is dataset instance could not be created:

      2016-09-28 01:04:03,516 - WARN [AsyncDispatcher event handler:o.a.h.m.v.a.MRAppMaster@91] - Sep 28, 2016 1:04:03 AM org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$DiagnosticInformationUpdater transition
      INFO: Diagnostics report from attempt_1473728860156_5643_m_000003_2: Error: co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
      Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
      Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
      Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Failed to create dataset instance: dataset:elkhan.purchases
      Caused by: co.cask.cdap.common.ServiceUnavailableException: Service 'DatasetService' is not available. Please wait till it is up and running.

      ...

      2016-09-28 01:03:38,792 - WARN [main:o.a.h.m.YarnChild@91] - Sep 28, 2016 1:03:38 AM org.apache.hadoop.mapred.YarnChild main
      WARNING: Exception running child : co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
      Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
      Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
      Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Error during talking to Dataset Service at <URL_OF_KILLED_MASTER>:45764/v3/namespaces/elkhan/data/datasets/purchases?owner=program:elkhan.PurchaseHistory.mapreduce.PurchaseHistoryBuilder while doing GET with headers null and body null
      Caused by: java.net.ConnectException: Connection refused

      Problem/Bug:
      With current implementation, the job only retries (with limited number of times) to communicate only current dead active master.

      Solution:
      If the active master service is dead, what one would expect is the job (or any running/in progress service) needs to retry while trying to communicate both the dead current active master, and following master. After some time follower master will become new active master, and the job can successfully continue working with new active master.

      In the particular failure scenario described above, the job should retry communicate both masters (active and follower), until follower master becomes new active master (or the old master recovers, whichever happens first), and then use new active master's endpoint for instantiating Dataset instance.

      Otherwise the jobs will still fail, because dataset cannot be instantiated.

      Discussion from CDAP-User group:
      https://groups.google.com/forum/#!topic/cdap-user/i_Cx0AXs3-A

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ashau Albert Shau
                Reporter:
                elkhand Elkhan Dadashov
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: