Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-12085

Ability to use Hive "concatenate" on a PFS's partitions

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3.0
    • Component/s: Datasets
    • Labels:
    • Release Notes:
      Expose a programmatic API to leverage Hive's functionality to concatenate a partition of a PartitionedFileSet.
    • Rank:
      1|i004u7:

      Description

      Especially, dynamic partitioning can create very small files on the file system. It would be useful (and it would reduce the load on the name node) if the files of a partition can be combined into a single larger file.

      In particular, the ORC file format supports a very efficient "concatenate" operation, that does not require running a (MR) job. This is surfaced in Hive as the ALTER TABLE ... PARTITION ... CONCATENATE.

      One tricky question is whether the concatenation will be atomic, especially, what does this mean for job that are running concurrently with the concatenate operation.

        Attachments

          Activity

            People

            • Assignee:
              ali.anwar Ali Anwar
              Reporter:
              andreas Andreas Neumann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: