In remote execution mode (when non-native compute profile is used), program run occasionally stuck in "Starting" phase, even the actual program execution already completed in the target cluster.
It is caused by the RuntimeMonitor not able to fetch metadata from the RemoteRuntimeServer. The underlying failure is caused by SSLHandshakeException, which unfortunately didn't get log due to
The reason of the SSL exception is because occasionally the remote runtime server binds to a port that was already used by the NFS kernel server. Running the netstat command confirming it:
This can happen because NFS binds with the IPV6_V6ONLY socket option. When the remote runtime monitor server starts up, the OS is allowed to give it the same port for binding to the IPv4 127.0.0.1 address. When this happen, HTTPS calls to "localhost:port" will be handled by the NFS server, hence resulting in connection close by the server, hence the SSL handshake error.
I first search for why there is no process ID from the netstat output and saw a lot of references referring to NFS kernel server. After I shutdown the NFS kernel server, the RuntimeMonitor was able to fetch metadata and proceed.