Details

    • Sprint:
      App Eng Sprint 8
    • Release Notes:
      Removed dataset usage in the Hive source and sink, which allows it to work in Spark and fixes a race condition that could cause pipelines to fail with a transaction conflict exception.
    • Rank:
      1|hzzrk7:

      Description

      The Hive plugin dies in Spark with:

      
      SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601) ~[spark-assembly.jar:na]
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590) ~[spark-assembly.jar:na]
      	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[spark-assembly.jar:na]
      	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) ~[spark-assembly.jar:na]
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843) ~[na:na]
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856) ~[na:na]
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933) ~[na:na]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1146) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) ~[spark-assembly.jar:na]
      	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074) ~[spark-assembly.jar:na]
      	at co.cask.cdap.app.runtime.spark.DefaultSparkExecutionContext$$anon$2.run(DefaultSparkExecutionContext.scala:264) ~[co.cask.cdap.cdap-spark-core-3.6.0.jar:na]
      	at co.cask.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:198) ~[co.cask.cdap.cdap-spark-core-3.6.0.jar:na]
      	... 20 common frames omitted
      java.lang.UnsupportedOperationException: Not supported
      	at co.cask.cdap.etl.spark.batch.SparkBatchRuntimeContext.getDataset(SparkBatchRuntimeContext.java:61) ~[hydrator-spark-core-3.6.0.jar:na]
      	at co.cask.hydrator.plugin.batch.commons.HiveSchemaStore.readHiveSchema(HiveSchemaStore.java:47) ~[1482719772555-0/:na]
      	at co.cask.hydrator.plugin.batch.source.HiveBatchSource.initialize(HiveBatchSource.java:117) ~[1482719772555-0/:na]
      	at co.cask.cdap.etl.spark.function.BatchSourceFunction.call(BatchSourceFunction.java:43) ~[hydrator-spark-core-3.6.0.jar:na]
      	at co.cask.cdap.etl.spark.function.BatchSourceFunction.call(BatchSourceFunction.java:30) ~[hydrator-spark-core-3.6.0.jar:na]
      	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:129) ~[spark-assembly.jar:na]
      	at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:129) ~[spark-assembly.jar:na]
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[spark-assembly.jar:na]
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[spark-assembly.jar:na]
      	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) ~[spark-assembly.jar:na]
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) ~[spark-assembly.jar:na]
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-assembly.jar:na]
      	at org.apache.spark.scheduler.Task.run(Task.scala:89) ~[spark-assembly.jar:na]
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[spark-assembly.jar:na]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_67]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_67]
      	at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_67]
      

      The reason is that it is trying to call getDatset() within a Spark closure, which is not supported.

      The deeper root cause is that there is no way for a plugin to set properties at configure time that it can then access at runtime, which is why the Hive plugin is doing this dataset lookup in the first place.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ashau Albert Shau
                Reporter:
                ashau Albert Shau
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: