Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-13119

Reduce connector datasets during planning for mapreduce pipelines

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3.4
    • Component/s: Pipelines
    • Labels:
      None
    • Release Notes:
      Optimized the planner to reduce the amount of temporary data required in certain types of mapreduce pipelines.
    • Rank:
      1|i00awf:

      Description

      Consider a pipeline that has a single source, then has 10 branches from that source, with each branch containing a transform, then an aggregation, then a sink.

      The current planner places a connector dataset in front of each aggregator. This means the first mapreduce job will read from the source, perform the transform on each branch, and write to 10 different temporary directories. There will then be 10 parallel mapreduce jobs that read from each temporary directory, perform an aggregation, then write to a sink.

      However, almost all of the time, transforms will be much cheaper than I/O, so a more efficient plan would be to place a single connector dataset right after the source, before everything branches. The first mapreduce would then consist of reading from the source and writing to a single temporary directory. There would then be 10 parallel mapreduce jobs that all read from that temporary directory, perform a transform, then aggregation, then sink.

        Attachments

          Activity

            People

            • Assignee:
              ashau Albert Shau
              Reporter:
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: