• Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1.0
    • Component/s: Cloud Provisioner
    • Labels:
    • Release Notes:
      Added some safeguards to prevent cloud pipeline runs from getting stuck in certain edge cases
    • Rank:


      We've seen situations where cloud pipelines have gotten stuck for hours in some state, whether in running or starting. There should be some safeguards in place to avoid these situations.

      One situation we've seen is that the RuntimeMonitorServer gets stuck in its shutdown() method, waiting for the master to send it a shutdown signal that the master never sends for some reason. In this case, the server should wait with some timeout, as it is expected that the shutdown signal is received soon after it enters that state.

      Another situation is a pipeline stuck in running state, where the logs abruptly end with:

      2019-08-01 12:02:49,700 - INFO  [runtime-startup-59:i.c.c.i.a.r.d.r.RemoteExecutionTwillPreparer@456] - Starting runnable DataPipelineWorkflow for runId xxx with SSH
      2019-08-01 12:02:49,721 - INFO  [runtime-startup-59:i.c.c.i.a.r.d.AbstractTwillProgramController@69] - Twill program running: xxx, twill runId: e3cb088e-b453-11e9-9d14-ca841a0c7ad3
      2019-08-01 12:02:49,722 - DEBUG [runtime-startup-59:i.c.c.i.a.r.d.DistributedProgramRunner@610] - Cleanup tmp files for xxx: /var/tmp/cdap/data/tmp/1564660952343-0
      2019-08-01 12:02:49,741 - DEBUG [runtime-startup-59:i.c.c.i.a.r.d.r.RemoteExecutionTwillRunnerService@541] - Startup task completed for program run xxx

      This seems to indicate that the master cannot pull any information from the RuntimeMonitorServer. In this situation, if the master is no longer getting heartbeats, it should kill the remote program and transition to the failed state.




            • Assignee:
              ashau Albert Shau
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: