Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-15799

Program stop can fail at the start of a run

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: App Fabric
    • Labels:
      None
    • Rank:
      1|i00qlr:

      Description

      If you stop a remote run at the right time, it fails with:

      2019-08-27 09:50:20,947 - ERROR [appfabric-executor-67:i.c.c.c.HttpExceptionHandler@70] - Unexpected error: request=POST /v3/namespaces/default/apps/gcs_source_test_v3/workflows/DataPipelineWorkflow/runs/66532beb-c8ea-11e9-9901-02423c33507f/stop user=<null>:
      java.util.concurrent.ExecutionException: 1 out of 1 runs of the program program:default.gcs_source_test_v3.-SNAPSHOT.workflow.DataPipelineWorkflow failed to stop
      	at io.cdap.cdap.internal.app.services.ProgramLifecycleService.stop(ProgramLifecycleService.java:619) ~[na:na]
      	at io.cdap.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performRunLevelStop(ProgramLifecycleHttpHandler.java:302) ~[na:na]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_181-google-v7]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_181-google-v7]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_181-google-v7]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_181-google-v7]
      	at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:82) ~[io.cdap.http.netty-http-1.2.0.jar:na]
      	at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) [io.cdap.http.netty-http-1.2.0.jar:na]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) [io.cdap.http.netty-http-1.2.0.jar:na]
      	at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_181-google-v7]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181-google-v7]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_181-google-v7]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:295) [na:1.8.0_181-google-v7]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181-google-v7]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181-google-v7]
      	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181-google-v7]
      Caused by: java.lang.RuntimeException: java.io.IOException: No runtime monitor server address
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:190) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.AbstractTwillProgramController.doStop(AbstractTwillProgramController.java:147) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.AbstractProgramController.lambda$stop$3(AbstractProgramController.java:161) ~[na:na]
      	... 3 common frames omitted
      Caused by: java.io.IOException: No runtime monitor server address
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.resolveURL(RuntimeMonitorClient.java:253) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.connect(RuntimeMonitorClient.java:264) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.postCall(RuntimeMonitorClient.java:139) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.requestStop(RuntimeMonitorClient.java:132) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.requestStop(RuntimeMonitor.java:165) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:188) ~[na:na]
      	... 5 common frames omitted
      

      this happens after the jars have been copied over to the remote host but before the remote RuntimeMonitorServer has started up. Stopping it again after this point will eventually succeed, but it seems to trigger another weird bug where there will be repeated error logs:

      2019-08-27 09:54:12,853 - ERROR [appfabric-executor-82:i.c.c.a.r.AbstractProgramRuntimeService@604] - Unable to create ProgramController for program program:default.gcs_source_test_v3.-SNAPSHOT.workflow.DataPipelineWorkflow for twill application 'unknown twill application'. It is likely caused by invalid CDAP program runtime extension.
      2019-08-27 09:54:13,959 - ERROR [appfabric-executor-82:i.c.c.a.r.AbstractProgramRuntimeService@604] - Unable to create ProgramController for program program:default.gcs_source_test_v3.-SNAPSHOT.workflow.DataPipelineWorkflow for twill application 'unknown twill application'. It is likely caused by invalid CDAP program runtime extension.
      2019-08-27 09:54:15,180 - ERROR [appfabric-executor-82:i.c.c.a.r.AbstractProgramRuntimeService@604] - Unable to create ProgramController for program program:default.gcs_source_test_v3.-SNAPSHOT.workflow.DataPipelineWorkflow for twill application 'unknown twill application'. It is likely caused by invalid CDAP program runtime extension.
      

      this happens because the ProgramRunner is a DefaultProgramRunnerFactory.LocalProgramRunner, which is not a ProgramControllerCreator

        Attachments

          Activity

            People

            • Assignee:
              trishka Trishka
              Reporter:
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: