We've seen situations where cloud pipelines have gotten stuck for hours in some state, whether in running or starting. There should be some safeguards in place to avoid these situations.
One situation we've seen is that the RuntimeMonitorServer gets stuck in its shutdown() method, waiting for the master to send it a shutdown signal that the master never sends for some reason. In this case, the server should wait with some timeout, as it is expected that the shutdown signal is received soon after it enters that state.
Another situation is a pipeline stuck in running state, where the logs abruptly end with:
This seems to indicate that the master cannot pull any information from the RuntimeMonitorServer. In this situation, if the master is no longer getting heartbeats, it should kill the remote program and transition to the failed state.