Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-14196

SparkCompute output is not properly cached

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Pipelines
    • Labels:
      None
    • Rank:
      1|i00gzr:

      Description

      Most of the time, a pipeline stage is cached if it has multiple outputs in order to prevent re-processing and misleading metrics. At some point, some refactoring was done to support different types of outputs (errors, splitter output, etc). It looks like the refactoring missed the caching logic for sparkcompute plugin types.

      To reproduce, create a simple pipeline with a sparkcompute stage that has two outputs. You'll notice that the metric for records out from that stage are double what you would expect.

      There is a workaround where users can place a no-op transform (for example, projection that isn't configure to do anything) after the sparkcompute and things will be properly cached.

        Attachments

          Activity

            People

            • Assignee:
              ashau Albert Shau
              Reporter:
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: