Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-13218

Stop workflow causes incorrect program states

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.3
    • Fix Version/s: 4.3.4
    • Component/s: Workflow
    • Labels:
      None
    • Release Notes:
      Fixed the state of Workflow, MapReduce, and Spark program to be reflected correctly as KILLED state when user explicitly terminated the running program
    • Rank:
      1|i00bhj:

      Description

      To reproduce:

      1. Deploy a simple stream -> tpfsavro batch mapreduce pipeline.

      2. Start the pipeline

      3. Stop the pipeline after the workflow is running in yarn, but before the mapreduce is running in yarn.

      4. The workflow state will transition to 'failed'. This is incorrect, it is supposed to be 'killed'.

      5. The master will log an error like:

      2018-03-21 19:53:19,410 - ERROR [program-status-subscriber-task:c.c.c.i.a.s.AppMetadataStore@628] - No run record meta for program 'program:default.asdf.-SNAPSHOT.workflow.DataPipelineWorkflow' pid '4c2b7242-2d41-11e8-94b8-42010afa0012' exists. Skip recording program killed.
      

      6. The mapreduce run record will stay in 'starting' state until the run record corrector fixes it

      There are several bugs in this scenario.

      The first is that the workflow run state ends up in 'failed' instead of 'killed'. I think this event is getting emitted by the program container.

      The second is that there are 2 state events triggered, one for failed and one for killed. The killed one is coming from the CDAP master and is ignored because of CDAP-13191. Even if there weren't a bug reading the run record, an even more confusing log about an invalid state transition from 'failed' to 'killed' would be generated.

      The third is that the workflow does not emit a 'killed' event for the mapreduce, so it ends up in 'failed' instead of in 'killed', and only after the run record corrector fixes it.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                terence Terence Yim
                Reporter:
                ashau Albert Shau
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: