Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-12823

MultiWriter has problem with writing to multiple partitions in Parquet format

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: CDAP
    • Labels:
      None
    • Rank:
      1|i0098n:

      Description

      MultiWriter tries to create a new RecordWriter for each new partition key. However, in DynamicPartitionerWriterWrapper, it uses the same fileOutputFormat object to create RecordWriter, and if the fileOutputFormat is of class ParquetOutputFormat, different instances of RecordWriter will share the same underlying WriteSupport and other objects. Therefore, ParquetOutputFormat will have problem when writing to multiple partitions. The bug can be reproduced by using the pipeline attached and inject data with mingling partition keys in the stream such as:

      a,1
      a,2
      b,3
      a,4
      

      Try with more data to reproduce if the this case doesn't fail, since the failure depends on the order of reading the input data by the mapper.

        Attachments

          Activity

            People

            • Assignee:
              mao Chengfeng Mao
              Reporter:
              mao Chengfeng Mao
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: