Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7724

Stopping a Spark streaming program does not update its run record / status

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0.0
    • Fix Version/s: 4.0.0
    • Component/s: Spark
    • Labels:
      None
    • Release Notes:
      Fixed a bug that could cause Spark and MapReduce programs to stop improperly, resulting in a failed run record instead of a killed run record.
    • Rank:
      1|hzzpe7:

      Description

      Steps to reproduce:
      1. Launch a spark streaming program (such as one that consumes from Kafka).
      2. Stop it via the REST API. It will return a 200 status code after ~40 seconds.
      3. Notice that its run record is not updated. Hitting the /status endpoint also returns RUNNING.
      4. Within a few minutes, the run record corrector will update the run record, and the status will then also be STOPPED.

      Note that if the run record corrector runs very shortly after stopping the Spark program, then this bug may not even be noticed.

      Relevant logs upon making a REST call to stop the Spark program:

      2016-12-01 08:22:42,643 - INFO  [ STOPPING:c.c.c.i.a.r.d.AbstractTwillProgramController$2@74] - Twill program terminated: program:default.kafkaStreaming.-SNAPSHOT.spark
      .DataStreamsSparkStreaming ed2720a2-b137-4343-87ee-874525d2ec6e
      2016-12-01 08:25:27,894 - WARN  [pool-5-thread-1:c.c.c.i.a.s.ProgramLifecycleService@858] - Found 1 RunRecords with RUNNING status and the program not actually running 
      for program type Spark
      2016-12-01 08:25:27,909 - WARN  [pool-5-thread-1:c.c.c.i.a.s.ProgramLifecycleService@873] - Fixed RunRecord 036eb9ac-b79f-11e6-9eda-42010a8e000b for program program:def
      ault.kafkaStreaming.-SNAPSHOT.spark.DataStreamsSparkStreaming with RUNNING status because the program was not actually running
      2016-12-01 08:25:27,915 - INFO  [pool-5-thread-1:c.c.c.i.a.s.ProgramLifecycleService@790] - Corrected 1 of run records with RUNNING status but no actual program running.
      

      Logs from the yarn application we launch which launches the actual Spark Yarn app:

      2016-12-01 09:43:51,750 - INFO  [TwillContainerService:c.c.c.i.a.r.d.AbstractProgramTwillRunnable@153] - Initializing runnable: DataStreamsSparkStreaming
      2016-12-01 09:43:54,543 - INFO  [TwillContainerService:c.c.c.l.a.LogAppenderInitializer@64] - Initializing log appender KafkaLogAppender
      2016-12-01 09:43:54,684 - INFO  [TwillContainerService:c.c.c.c.g.LocationRuntimeModule@100] - HDFS namespace is /cdap
      2016-12-01 09:43:56,387 - INFO  [TwillContainerService:c.c.c.i.a.r.d.AbstractProgramTwillRunnable@197] - Runnable initialized: DataStreamsSparkStreaming
      2016-12-01 09:43:56,410 - INFO  [authorization-enforcement-service:c.c.c.s.a.AbstractAuthorizationService@102] - Started authorization enforcement service...
      2016-12-01 09:43:56,411 - INFO  [TwillContainerService:c.c.c.i.a.r.d.AbstractProgramTwillRunnable@278] - Starting runnable: DataStreamsSparkStreaming
      2016-12-01 09:43:56,536 - INFO  [TwillContainerService:c.c.c.a.r.s.SparkProgramRunner@196] - Starting Spark Job: SparkRuntimeContext{id=program:bar.kafkaStreaming.-SNAPSHOT.spark.DataStreamsSparkStreaming, runId=9fa67b50-b7aa-11e6-8f11-42010a8e000b}
      2016-12-01 09:43:56,580 - INFO  [SparkRunnerDataStreamsSparkStreaming:o.a.t.d.AbstractClientProvider@112] - Service discovered at cdap-itn-dstc56-480-16126-1000.dev.continuuity.net:46428
      2016-12-01 09:43:56,623 - INFO  [SparkRunnerDataStreamsSparkStreaming:o.a.t.d.AbstractClientProvider@120] - Attempting to connect to tx service at cdap-itn-dstc56-480-16126-1000.dev.continuuity.net:46428 with timeout 30000 ms.
      2016-12-01 09:43:56,640 - INFO  [SparkRunnerDataStreamsSparkStreaming:o.a.t.d.AbstractClientProvider@134] - Connected to tx service at cdap-itn-dstc56-480-16126-1000.dev.continuuity.net:46428
      2016-12-01 09:44:04,567 - INFO  [NettyHttpService STARTING:c.c.h.NettyHttpService@257] - Starting SparkExecutionService http service on address cdap-itn-dstc56-480-16126-1000.dev.continuuity.net/10.142.0.11:0...
      2016-12-01 09:44:04,661 - INFO  [NettyHttpService STARTING:c.c.h.NettyHttpService@262] - Started SparkExecutionService http service on address /10.142.0.11:42960
      2016-12-01 09:45:05,918 - INFO  [message-callback:c.c.c.i.a.r.d.AbstractProgramTwillRunnable@227] - Stopping runnable: DataStreamsSparkStreaming.
      2016-12-01 09:45:35,921 - WARN  [SparkExecutionService STOPPING:c.c.c.a.r.s.d.SparkExecutionService@116] - Timeout in waiting for Spark program to stop: program_run:bar.kafkaStreaming.-SNAPSHOT.spark.DataStreamsSparkStreaming.9fa67b50-b7aa-11e6-8f11-42010a8e000b
      2016-12-01 09:45:35,922 - INFO  [NettyHttpService STOPPING:c.c.h.NettyHttpService@274] - Stopping SparkExecutionService http service on address cdap-itn-dstc56-480-16126-1000.dev.continuuity.net/10.142.0.11:42960...
      2016-12-01 09:45:35,934 - INFO  [NettyHttpService STOPPING:c.c.h.NettyHttpService@288] - Stopped SparkExecutionService http service on address cdap-itn-dstc56-480-16126-1000.dev.continuuity.net/10.142.0.11:42960
      

      Note that the fix is not something as simple as increasing SparkExecutionService.SHUTDOWN_WAIT_SECONDS.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ashau Albert Shau
                Reporter:
                ali.anwar Ali Anwar
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: