Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7651

Having too many variables in hadoop configuration can lead to explore job failure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5.2
    • Fix Version/s: 4.3.1
    • Component/s: Explore
    • Labels:
      None
    • Release Notes:
      Fixed an issue that hive query may failed if the configuration has too many variables substitution.
    • Rank:
      1|hzzoq7:

      Description

      If there are many variables for expansion in hadoop Configuration (see org.apache.hadoop.conf.Configuration#VariableExpansion), then a job launched by explore service can fail.
      I have witnessed consistent failure for Hive on Spark jobs, whereas MapReduce execution engine succeeded. I am unsure about this inconsistency.
      Basically, this failure happens because for explore service (see BaseHiveExploreService#startSession) serialize hConf and cConf into the sessionConf, and so fetching this deserialized data causes Hadoop's Configuration class to substitute many of the variables in the serialized hConf.

      This is the stack trace in the AM of the spark Yarn application:

      Caused by: java.lang.IllegalStateException: Variable substitution depth too large: 20 ...
      ... </configuration>
              at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:962)
              at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)
              at co.cask.cdap.common.conf.ConfigurationUtil.get(ConfigurationUtil.java:45)
              at co.cask.cdap.hive.context.ContextManager.createContext(ContextManager.java:168)
              at co.cask.cdap.hive.context.ContextManager.getContext(ContextManager.java:126)
              at co.cask.cdap.hive.stream.HiveStreamInputFormat.getSplitFinder(HiveStreamInputFormat.java:91)
              at co.cask.cdap.hive.stream.HiveStreamInputFormat.getSplits(HiveStreamInputFormat.java:72)
              at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
              at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
              at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:363)
              at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:573)
              at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
              at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
              at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
              at scala.Option.getOrElse(Option.scala:120)
              at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
              at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
              at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
              at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
              at scala.Option.getOrElse(Option.scala:120)
              at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
              at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91)
              at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
              at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
              at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
              at scala.Option.getOrElse(Option.scala:120)
              at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
              at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:386)
              at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:398)
              at org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:299)
              at org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:334)
              at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:837)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1607)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                terence Terence Yim
                Reporter:
                ali.anwar Ali Anwar
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: