Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-13119

Reduce connector datasets during planning for mapreduce pipelines


    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3.4
    • Component/s: Pipelines
    • Labels:
    • Release Notes:
      Optimized the planner to reduce the amount of temporary data required in certain types of mapreduce pipelines.
    • Rank:


      Consider a pipeline that has a single source, then has 10 branches from that source, with each branch containing a transform, then an aggregation, then a sink.

      The current planner places a connector dataset in front of each aggregator. This means the first mapreduce job will read from the source, perform the transform on each branch, and write to 10 different temporary directories. There will then be 10 parallel mapreduce jobs that read from each temporary directory, perform an aggregation, then write to a sink.

      However, almost all of the time, transforms will be much cheaper than I/O, so a more efficient plan would be to place a single connector dataset right after the source, before everything branches. The first mapreduce would then consist of reading from the source and writing to a single temporary directory. There would then be 10 parallel mapreduce jobs that all read from that temporary directory, perform a transform, then aggregation, then sink.




            • Assignee:
              ashau Albert Shau
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: