Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-13031

Add setting for cache level in spark pipelines

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Pipelines
    • Labels:
    • Rank:
      1|i00adr:

      Description

      RDDs are cached to prevent recomputation. Spark's default cache level is used (memory only), which can result in RDD partitions being evicted if they are too big. This can result in confusing metrics, as records will be re-processed. It can also result in longer runtimes if transformation times are significant compared to IO.

      Users should be able to configure the cache level for Spark programs. This should also be able to overwrite the setting using preferences/runtime arguments, as it is often only discovered after deployment.

      We should also consider whether to use a different default, such as MEMORY_AND_DISK_SER.

      For really advanced users, it may make sense to even allow specifying cache level per pipeline stage.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bhooshan Bhooshan Mogal
                Reporter:
                ashau Albert Shau
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: