Affects Version/s: 4.3.3
Fix Version/s: 4.3.4
Release Notes:Fixed the state of Workflow, MapReduce, and Spark program to be reflected correctly as KILLED state when user explicitly terminated the running program
1. Deploy a simple stream -> tpfsavro batch mapreduce pipeline.
2. Start the pipeline
3. Stop the pipeline after the workflow is running in yarn, but before the mapreduce is running in yarn.
4. The workflow state will transition to 'failed'. This is incorrect, it is supposed to be 'killed'.
5. The master will log an error like:
6. The mapreduce run record will stay in 'starting' state until the run record corrector fixes it
There are several bugs in this scenario.
The first is that the workflow run state ends up in 'failed' instead of 'killed'. I think this event is getting emitted by the program container.
The second is that there are 2 state events triggered, one for failed and one for killed. The killed one is coming from the CDAP master and is ignored because of
CDAP-13191. Even if there weren't a bug reading the run record, an even more confusing log about an invalid state transition from 'failed' to 'killed' would be generated.
The third is that the workflow does not emit a 'killed' event for the mapreduce, so it ends up in 'failed' instead of in 'killed', and only after the run record corrector fixes it.