Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7256

Hadoop classes from hive-exec are leaked to programs in the sdk

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0.0
    • Component/s: Applications, Standalone
    • Labels:
    • Release Notes:
      Avoid leaking hive classes to programs in CDAP SDK.
    • Rank:
      1|hzy5sv:

      Description

      When we create the classloader used in programs, we filter out all classes except those in cdap-api, and those that start with org.apache.hadoop (except for everything under org.apache.hadoop.hbase). This is done to isolate programs from CDAP's own dependencies.

      However, in the sdk, hive-exec is included in the sdk classpath because explore is run in the same jvm as everything else. This means that hive classes are exposed to user programs in the sdk, which can cause problems if different versions are being used. For example, if somebody wants to use the ORC file format and has orc-mapreduce 1.2.0 as a dependency in their pom, they will get an error:

      java.lang.Exception: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.getDataColumnCount()I 
      at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.runTasks(LocalJobRunnerWithFix.java:465) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] 
      at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.run(LocalJobRunnerWithFix.java:524) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] 
      Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.getDataColumnCount()I 
      at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1073) ~[orc-core-1.2.0.jar:1.2.0] 
      at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:69) ~[orc-mapreduce-1.2.0.jar:1.2.0] 
      at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:87) ~[orc-mapreduce-1.2.0.jar:1.2.0] 
      at co.cask.cdap.internal.app.runtime.batch.dataset.input.DelegatingRecordReader.nextKeyValue(DelegatingRecordReader.java:84) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] 
      at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] 
      at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] 
      at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] 
      at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na]
      

      This is because OrcMapreduceRecordReader uses VectorizedRowBatch. This class exists with the same classname in both hive-storage-api and hive-exec, and the one in hive-exec gets through the program filter because it starts with org.apache.hadoop, so it is the one that gets used instead of the one from the user's jar.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ali.anwar Ali Anwar
                Reporter:
                ashau Albert Shau
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: