When people see issues with CDAP stability, it is often caused by memory issues, frequent GC or stop-the-world GC making one or more services unresponsive. Often these GC pauses are short enough to survive the ZK timeout, so no failover will happen, but long enough to cause other, seemingly random failures. This can happen in master, CDAP services, or even app containers.
However, the GC pauses do not show up the master or application logs, and users don't necessarily check the GC logs which are in a separate file written directly by the JVM and thus bypassing log collection.
It would be good if CDAP had a way to get notified by the JVM when GC happens (through JMX). Then we can log full GCs in system and application logs, and we can emit metrics about frequency and duration of GC events. This would greatly improve the problem diagnostics for users.