Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-11937

MapReduce status is sometimes successful even though the job failed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.1.1
    • Fix Version/s: 4.2.1, 4.1.2
    • Component/s: MapReduce
    • Labels:
    • Release Notes:
      Fixes an issue where a failed MapReduce run was marked as successful.
    • Rank:
      1|i00407:

      Description

      We have seen cases when a MapReduce initially reports status as failed, then changes back to successful when we query the status again.

      This has been observed with Hadoop 2.7.3.2.5.3.0-37 and jobs that have many tasks. We have not been able to reproduce it on our end, but it happens repeatedly on one cluster.

      In that case, we see in the logs:

      2017-06-08 04:19:21,846 - INFO  [MapReduceRunner-MapReduce:c.c.c.i.a.r.b.MapReduceRuntimeService@377] - MapReduce Job is complete, status: false, job: name=..., jobId=job_1495837593383_149903, namespaceId=..., applicationId=..., program=..., runid=ae5edc13-4c26-11e7-9884-246e962f29d8
      2017-06-08 04:19:24,360 - INFO  [MapReduceRunner-MapReduce:c.c.c.i.a.r.b.MapReduceRuntimeService@398] - Committing MapReduce Job transaction: name=..., jobId=job_1495837593383_149903, namespaceId=..., applicationId=..., program=..., runid=ae5edc13-4c26-11e7-9884-246e962f29d8
      2
      

      and this is the code in question:

          LOG.info("MapReduce Job is complete, status: {}, job: {}", job.isSuccessful(), context);
          metricsWriter.reportStats();
          TimeUnit.SECONDS.sleep(2L);
          if (!stopRequested) {
            Preconditions.checkState(job.isSuccessful(), "MapReduce execution failure: %s", job.getStatus());
          }
      

      The first job.isSuccessful() returns false, the next call returns true. This should never happen... it looks like the call to get the task report that happens in between changes the status. This would be a bug in Hadoop.

      We can protect against that by remembering the failure the first time it returns false.

        Attachments

          Activity

            People

            • Assignee:
              andreas Andreas Neumann
              Reporter:
              andreas Andreas Neumann
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: