Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-9922

TPFS parquet sink fails on byte[]

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1.0
    • Component/s: Pipeline Plugins, Pipelines
    • Labels:
      None
    • Sprint:
      App Eng Sprint 4, App Eng Sprint 5
    • Release Notes:
      Fixed an issue where the File Sink plugin was failing when writing byte array records.
    • Rank:
      1|hzy3lb:

      Description

      If you try writing a byte[] to a TPFSParquet sink (at least in Spark), you get an exception when some Parquet library tries to cast the byte[] to a ByteBuffer. To reproduce, try creating a realtime pipeline that reads from Kafka and writes to TPFSParquet. The source will read some data, but the pipeline will not be able to write anything, and there will be an exception like:

      java.lang.ClassCastException: [B cannot be cast to java.nio.ByteBuffer
          at parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:208) ~[parquet-avro-1.6.0.jar:1.6.0]
          at parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:112) ~[parquet-avro-1.6.0.jar:1.6.0]
          at parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:87) ~[parquet-avro-1.6.0.jar:1.6.0]
          at parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:44) ~[parquet-avro-1.6.0.jar:1.6.0]
          at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
          at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
          at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) ~[org.apache.hive.hive-exec-1.2.1.jar:1.2.1]
          at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.scheduler.Task.run(Task.scala:89) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[co.cask.cdap.spark-assembly-1.6.1.jar:na]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
          at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
      

      Need to do more investigation to see if this is also the case for Avro, and whether it is only in Spark or if MapReduce has the same issue. In any case, the fix is probably to convert any byte[] to ByteBuffer when a StructuredRecord is converted to a GenericRecord.

        Attachments

          Activity

            People

            • Assignee:
              shashank Shashank
              Reporter:
              ashau Albert Shau
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: