RDDs are cached to prevent recomputation. Spark's default cache level is used (memory only), which can result in RDD partitions being evicted if they are too big. This can result in confusing metrics, as records will be re-processed. It can also result in longer runtimes if transformation times are significant compared to IO.
Users should be able to configure the cache level for Spark programs. This should also be able to overwrite the setting using preferences/runtime arguments, as it is often only discovered after deployment.
We should also consider whether to use a different default, such as MEMORY_AND_DISK_SER.
For really advanced users, it may make sense to even allow specifying cache level per pipeline stage.