We have seen situations where a program run fails because the container is killed by YARN. In the particular instance I was looking at, the Workflow driver container was getting killed after 10 minutes, but the yarn logs had disappeared by the time I got to the cluster, so could not determine the root cause of the failure.
In any case, when this happens, the twill jars are not cleaned up correctly. You can see a bunch under /cdap/twill. This can quickly fill up hdfs if you have many scheduled workflows, and they all fail in this way for a period of time.