Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-3014

Failure Spark node in Workflow causing the Workflow restart infinitely

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.2.0
    • Component/s: Spark, Workflow
    • Labels:
      None
    • Release Notes:
      Hide
      Fixed a problem where the failure of a spark node was causing a workflow to restart infinitely.
      Show
      Fixed a problem where the failure of a spark node was causing a workflow to restart infinitely.
    • Rank:
      1|hzyw2n:

      Description

      I believe it's because the WorkflowDriver is not catching the exception from the ProgramController correctly but letting it propagated up to the top and killed the container process (hence got restarted by the AM). Also, if it doesn't happen for MapReduce job, this means either the MR program runner is wrong in reflecting execution failure through the ProgramController or the WorkflowDriver implemented differently between MR and Spark.

      As a side note, it seems like the exception get logged multiple times (not shown in below).

      This is the stack trace:

      2015-07-15 06:50:38,118 - ERROR [TwillContainerService:c.c.c.i.a.r.d.AbstractProgramTwillRunnable@320] - Program execution failed.
      java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294) ~[com.google.guava.guava-13.0.1.jar:na]
      	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281) ~[com.google.guava.guava-13.0.1.jar:na]
      	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[com.google.guava.guava-13.0.1.jar:na]
      	at co.cask.cdap.internal.app.runtime.distributed.AbstractProgramTwillRunnable.run(AbstractProgramTwillRunnable.java:315) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at org.apache.twill.internal.container.TwillContainerService.doRun(TwillContainerService.java:130) [org.apache.twill.twill-yarn-0.6.0-incubating-SNAPSHOT.jar:0.6.0-incubating-SNAPSHOT]
      	at org.apache.twill.internal.AbstractTwillService.run(AbstractTwillService.java:179) [org.apache.twill.twill-core-0.6.0-incubating-SNAPSHOT.jar:0.6.0-incubating-SNAPSHOT]
      	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52) [com.google.guava.guava-13.0.1.jar:na]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
      Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeAll(WorkflowDriver.java:469) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.run(WorkflowDriver.java:450) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	... 2 common frames omitted
      Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_75]
      	at java.util.concurrent.FutureTask.get(FutureTask.java:188) ~[na:1.7.0_75]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeFork(WorkflowDriver.java:376) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeNode(WorkflowDriver.java:410) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeAll(WorkflowDriver.java:462) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	... 3 common frames omitted
      Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeAll(WorkflowDriver.java:469) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.access$300(WorkflowDriver.java:102) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver$2.call(WorkflowDriver.java:367) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver$2.call(WorkflowDriver.java:363) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_75]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_75]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_75]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_75]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_75]
      	... 1 common frames omitted
      Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_75]
      	at java.util.concurrent.FutureTask.get(FutureTask.java:188) ~[na:1.7.0_75]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeAction(WorkflowDriver.java:265) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeNode(WorkflowDriver.java:407) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.executeAll(WorkflowDriver.java:462) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	... 9 common frames omitted
      Caused by: java.lang.RuntimeException: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.AbstractProgramWorkflowRunner$1.run(AbstractProgramWorkflowRunner.java:119) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.workflow.ProgramWorkflowAction.run(ProgramWorkflowAction.java:81) ~[co.cask.cdap.cdap-api-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver$1.run(WorkflowDriver.java:254) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	... 5 common frames omitted
      Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /cdap/namespaces/default/data/ipCount/files/2015-07-15/03-31.1436931097000 already exists
      	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1011) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:998) ~[spark-assembly.jar:1.4.0]
      	at co.cask.cdap.internal.app.runtime.spark.ScalaSparkFacade.saveAsDataset(ScalaSparkFacade.java:65) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.spark.ExecutionSparkContext.writeToDataset(ExecutionSparkContext.java:207) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.examples.loganalysis.ResponseCounterProgram.run(ResponseCounterProgram.scala:54) ~[na:na]
      	at co.cask.cdap.internal.app.runtime.spark.SparkProgramWrapper.run(SparkProgramWrapper.java:60) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at co.cask.cdap.internal.app.runtime.spark.SparkProgramWrapper.main(SparkProgramWrapper.java:40) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_75]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_75]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_75]
      	at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_75]
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) ~[spark-assembly.jar:1.4.0]
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-assembly.jar:1.4.0]
      	at co.cask.cdap.internal.app.runtime.spark.SparkRuntimeService.run(SparkRuntimeService.java:163) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52) [com.google.guava.guava-13.0.1.jar:na]
      	at co.cask.cdap.internal.app.runtime.spark.SparkRuntimeService$2$1.run(SparkRuntimeService.java:231) ~[co.cask.cdap.cdap-app-fabric-3.1.0-SNAPSHOT.jar:na]
      	... 1 common frames omitted
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                terence Terence Yim
                Reporter:
                terence Terence Yim
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: