Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-11109

TPFS created by TPFSAvro sink cannot be used in MapReduce or Spark

    XMLWordPrintableJSON

    Details

    • Release Notes:
      Fixed an issue that prevented TPFS sources and sinks created by Hydrator pipelines from being used as input or output for MapReduce or Spark.
    • Rank:
      1|hzzdxb:

      Description

      When the sink creates the dataset, it configures the explore schema, but it does not configure the schema for the AvroKeyInput/OutputFormat. The consequence is that this TPFS cannot be used in MR or Spark.

      We need to set the schema as properties for the input and output formats in the file set properties:

      setInputProperty("avro.schema.input.key", schemaString)
      setOutputProperty("avro.schema.output.key", schemaString)
      

      Without that, for example when reading in Spark, we get warnings like:

      2016-06-01 17:02:20,888 - WARN [Executor task launch worker-0:o.a.a.m.AvroKeyInputFormat@49] - Reader schema was not set. Use AvroJob.setInputKeySchema() if desired.
      2016-06-01 17:02:20,889 - INFO [Executor task launch worker-1:o.a.a.m.AvroKeyInputFormat@50] - Using a reader schema equal to the writer schema.
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                andreas Andreas Neumann
                Reporter:
                andreas Andreas Neumann
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: