Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-12054

CombineFileInputFormat does not work with PFS as input

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.2.0, 4.1.1, 4.1.0, 4.0.0, 3.6.1, 3.5.2
    • Fix Version/s: 4.3.0, 4.2.1
    • Component/s: Datasets, MapReduce
    • Labels:
    • Release Notes:
      Fixed PartitionedFileSet to work with CombineFileInputFormat, as input to a batch job.
    • Rank:
      1|i004mv:

      Description

      Using a PartitionedFileSet that has its InputFormat class configured as a CombineFileInputFormat will not work as input to a batch job (such as a MapReduce job).

      2017-07-06 21:01:25,938 - ERROR [Thread-1460:o.a.h.m.LocalJobRunnerWithFix$Job@562] - Job <jobid> failed 
      java.lang.Exception: java.lang.IllegalArgumentException: Expected a 'org.apache.hadoop.mapreduce.lib.input.FileSplit', but got 'org.apache.hadoop.mapreduce.lib.input.CombineFileSplit'. 
      at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.runTasks(LocalJobRunnerWithFix.java:465) ~[co.cask.cdap.cdap-app-fabric-4.1.1.jar:na] 
      at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.run(LocalJobRunnerWithFix.java:524) ~[co.cask.cdap.cdap-app-fabric-4.1.1.jar:na] 
      java.lang.IllegalArgumentException: Expected a 'org.apache.hadoop.mapreduce.lib.input.FileSplit', but got 'org.apache.hadoop.mapreduce.lib.input.CombineFileSplit'. 
      at co.cask.cdap.internal.app.runtime.batch.dataset.input.BasicPartitionedFileSetInputContext.<init>(BasicPartitionedFileSetInputContext.java:58) ~[na:na] 
      at co.cask.cdap.internal.app.runtime.batch.dataset.input.InputContexts.create(InputContexts.java:36) ~[na:na] 
      at co.cask.cdap.internal.app.runtime.batch.MapperWrapper.run(MapperWrapper.java:87) ~[na:na] 
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] 
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] 
      at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job$MapTaskRunnable.run(LocalJobRunnerWithFix.java:243) ~[co.cask.cdap.cdap-app-fabric-4.1.1.jar:na] 
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] 
      at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_121] 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_121] 
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_121] 
      at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_121]
      

      There is functionality (CDAP-4322) that attempts to compute the PartitionKey of a given FileSplit, and since the CombineInputSplit is not a FileSplit, the code fails to compute the PartitionKey of the input split.
      If user uses CombineInputSplit, each InputSplit can have multiple PartitionKeys being associated with it, so it no longer makes sense to have just a getInputPartitionKey() method, but also a getInputPartitionKeys() method which returns a list of PartitionKeys.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ali.anwar Ali Anwar
                Reporter:
                ali.anwar Ali Anwar
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: