Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-16519

Pipeline stops fail with weird messages

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.3.0
    • Labels:
      None
    • Rank:
      1|i00vb3:

      Description

      When stopping a pipeline, it is possible to get many different types of errors, depending on the state of the run. When the RuntimeMonitor is not yet available, it can fail with:

      2020-03-30 08:51:08,070 - ERROR [appfabric-executor-360:i.c.c.c.HttpExceptionHandler@70] - Unexpected error: request=POST /v3/namespaces/default/apps/Wiki_Deletions_v1/workflows/DataPipelineWorkflow/runs/9960e2d7-725d-11ea-b34c-22fcfb490df5/stop user=<null>:
      java.util.concurrent.ExecutionException: 1 out of 1 runs of the program program:default.Wiki_Deletions_v1.-SNAPSHOT.workflow.DataPipelineWorkflow failed to stop
      	at io.cdap.cdap.internal.app.services.ProgramLifecycleService.stop(ProgramLifecycleService.java:665) ~[na:na]
      	at io.cdap.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performRunLevelStop(ProgramLifecycleHttpHandler.java:304) ~[na:na]
      	at sun.reflect.GeneratedMethodAccessor170.invoke(Unknown Source) ~[na:na]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_242]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_242]
      	at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:82) ~[io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) [io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) [io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_242]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_242]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_242]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_242]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_242]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_242]
      	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_242]
      Caused by: java.lang.RuntimeException: java.io.IOException: Failed to talk to runtime monitor. Response code is 500. Error is Stopping not allowed for program:default.Wiki_Deletions_v1.-SNAPSHOT.workflow.DataPipelineWorkflow-9960e2d7-725d-11ea-b34c-22fcfb490df5 in STOPPING
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:190) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.AbstractTwillProgramController.doStop(AbstractTwillProgramController.java:147) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.AbstractProgramController.lambda$stop$3(AbstractProgramController.java:161) ~[na:na]
      	... 3 common frames omitted
      Caused by: java.io.IOException: Failed to talk to runtime monitor. Response code is 500. Error is Stopping not allowed for program:default.Wiki_Deletions_v1.-SNAPSHOT.workflow.DataPipelineWorkflow-9960e2d7-725d-11ea-b34c-22fcfb490df5 in STOPPING
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.throwIfNotOK(RuntimeMonitorClient.java:242) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.postCall(RuntimeMonitorClient.java:142) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.requestStop(RuntimeMonitorClient.java:132) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.requestStop(RuntimeMonitor.java:165) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:188) ~[na:na]
      	... 5 common frames omitted
      

      r

      Other times, if stopping takes a while, it can fail with:

      2020-03-30 18:55:46,648 - ERROR [appfabric-executor-1812:i.c.c.c.HttpExceptionHandler@70] - Unexpected error: request=POST /v3/namespaces/default/apps/Wiki_Deletions_v1/workflows/DataPipelineWorkflow/runs/7a091a74-72b7-11ea-bb20-22fcfb490df5/stop user=<null>:
      java.util.concurrent.ExecutionException: 1 out of 1 runs of the program program:default.Wiki_Deletions_v1.-SNAPSHOT.workflow.DataPipelineWorkflow failed to stop
      	at io.cdap.cdap.internal.app.services.ProgramLifecycleService.stop(ProgramLifecycleService.java:665) ~[na:na]
      	at io.cdap.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performRunLevelStop(ProgramLifecycleHttpHandler.java:304) ~[na:na]
      	at sun.reflect.GeneratedMethodAccessor170.invoke(Unknown Source) ~[na:na]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_242]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_242]
      	at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:82) ~[io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) [io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) [io.cdap.http.netty-http-1.3.1.jar:na]
      	at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_242]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_242]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_242]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_242]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_242]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_242]
      	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_242]
      Caused by: java.lang.RuntimeException: java.net.SocketTimeoutException: Read timed out
      	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:190) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.AbstractTwillProgramController.doStop(AbstractTwillProgramController.java:147) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.AbstractProgramController.lambda$stop$3(AbstractProgramController.java:161) ~[na:na]
      	... 3 common frames omitted
      Caused by: java.net.SocketTimeoutException: Read timed out
      	at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_242]
      	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_242]
      	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_242]
      	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_242]
      	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[na:1.8.0_242]
      	at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[na:1.8.0_242]
      	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) ~[na:1.8.0_242]
      	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) ~[na:1.8.0_242]
      	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[na:1.8.0_242]
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[na:1.8.0_242]
      	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[na:1.8.0_242]
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_242]
      	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) ~[na:1.8.0_242]
      	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) ~[na:1.8.0_242]
      	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) ~[na:1.8.0_242]
      	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[na:1.8.0_242]
      	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) ~[na:1.8.0_242]
      	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352) ~[na:1.8.0_242]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.postCall(RuntimeMonitorClient.java:142) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.requestStop(RuntimeMonitorClient.java:132) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.requestStop(RuntimeMonitor.java:165) ~[na:na]
      	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:188) ~[na:na]
      	... 5 common frames omitted
      

      This is pretty common when the underlying Spark YARN application is larger, or is in the process of dying anyway due to OOM errors.

        Attachments

          Activity

            People

            • Assignee:
              mikkin Mikkin Patel
              Reporter:
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: