Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7500

MapReduce classloader is closed prematurely in MRAppMaster

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.5.2
    • Fix Version/s: 4.0.0, 3.6.1, 3.5.2
    • Component/s: MapReduce
    • Labels:
      None
    • Release Notes:
      Fixed cases where MapReduce classloader gets closed prematurely.
    • Rank:
      1|hzzni7:

      Description

      For MapReduce jobs, the ApplicationMaster used to close the MapReduceClassLoader in a shutdown hook.
      However, in https://github.com/caskdata/cdap/pull/6873, it was changed to be closed in a finally block right after the main() method returned.

      For mapper and reduce tasks, this seems to be fine. However, MRAppMaster class's main method returns, without completing operations. It seems to be asynchronous, leaving a Service (MRAppMaster) running and operating, even after the main method returns.

      Because of this, operations that depend on MapReduceTaskContext in the MRAppMaster will fail.
      For instance, DynamicPartitioner instantiates a Dataset in OutputCommitter to create the partitions written by all the mapper tasks.

      Before the following error log, I notice the DistributedMapReduceTaskContextProvider shutting down.

      2016-10-21 05:31:19,716 - ERROR [CommitterEvent Processor #1:o.a.h.m.v.a.c.CommitterEventHandler@290] - Could not commit job
      co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'aa:cleanRecords'
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache.getDataset(SingleThreadDatasetCache.java:177) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.MultiThreadDatasetCache.getDataset(MultiThreadDatasetCache.java:103) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.DynamicDatasetCache.getDataset(DynamicDatasetCache.java:202) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.DynamicDatasetCache.getDataset(DynamicDatasetCache.java:176) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.DynamicDatasetCache.getDataset(DynamicDatasetCache.java:147) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.AbstractContext.getDataset(AbstractContext.java:242) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.batch.BasicMapReduceTaskContext.getDataset(BasicMapReduceTaskContext.java:258) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.AbstractContext.getDataset(AbstractContext.java:221) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.AbstractContext.getDataset(AbstractContext.java:210) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.batch.dataset.partitioned.DynamicPartitioningOutputCommitter.commitJob(DynamicPartitioningOutputCommitter.java:81) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.internal.app.runtime.batch.dataset.output.MultipleOutputsCommitter.commitJob(MultipleOutputsCommitter.java:89) ~[cdap-app-fabric-3.5.2-SNAPSHOT.jar:na]
              at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:285) [hadoop-mapreduce-client-app-2.7.1.2.3.4.7-4.jar:na]
              at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237) [hadoop-mapreduce-client-app-2.7.1.2.3.4.7-4.jar:na]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
              at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
      Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:aa.cleanRecord
      s
              at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache.get(LocalCache.java:3990) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.ForwardingLoadingCache.get(ForwardingLoadingCache.java:48) ~[guava-13.0.1.jar:na]
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache$LineageRecordingDatasetCache.get(SingleThreadDatasetCache.java:145) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPS
      HOT.jar:na]
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache$LineageRecordingDatasetCache.get(SingleThreadDatasetCache.java:128) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPS
      HOT.jar:na]
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache.getDataset(SingleThreadDatasetCache.java:171) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              ... 15 common frames omitted
      Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:aa.cleanRecords
              at co.cask.cdap.data.dataset.SystemDatasetInstantiator.getDataset(SystemDatasetInstantiator.java:103) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache$1.load(SingleThreadDatasetCache.java:93) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.SingleThreadDatasetCache$1.load(SingleThreadDatasetCache.java:89) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252) ~[guava-13.0.1.jar:na]
              ... 22 common frames omitted
      Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Failed to create dataset instance: dataset:aa.cleanRecords
              at co.cask.cdap.data2.metadata.writer.LineageWriterDatasetFramework.getDataset(LineageWriterDatasetFramework.java:179) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.j
      ar:na]
              at co.cask.cdap.data.dataset.SystemDatasetInstantiator.getDataset(SystemDatasetInstantiator.java:96) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              ... 28 common frames omitted
      Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Not connected to zooKeeper.
              at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache.get(LocalCache.java:3990) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4884) ~[guava-13.0.1.jar:na]
              at org.apache.twill.discovery.ZKDiscoveryService.discover(ZKDiscoveryService.java:186) ~[org.apache.twill.twill-discovery-core-0.7.0-incubating.jar:0.7.0-incubating]
              at co.cask.cdap.common.guice.ProgramDiscoveryServiceClient.discover(ProgramDiscoveryServiceClient.java:87) ~[co.cask.cdap.cdap-common-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient$4.get(DatasetServiceClient.java:100) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient$4.get(DatasetServiceClient.java:97) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at com.google.common.base.Suppliers$MemoizingSupplier.get(Suppliers.java:116) ~[guava-13.0.1.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient.resolve(DatasetServiceClient.java:392) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient.doRequest(DatasetServiceClient.java:333) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient.doRequest(DatasetServiceClient.java:388) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient.doGet(DatasetServiceClient.java:295) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.DatasetServiceClient.getInstance(DatasetServiceClient.java:134) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.datafabric.dataset.RemoteDatasetFramework.getDataset(RemoteDatasetFramework.java:239) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.ForwardingDatasetFramework.getDataset(ForwardingDatasetFramework.java:160) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.metadata.writer.LineageWriterDatasetFramework.access$101(LineageWriterDatasetFramework.java:55) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.metadata.writer.LineageWriterDatasetFramework$2.call(LineageWriterDatasetFramework.java:172) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.metadata.writer.LineageWriterDatasetFramework$2.call(LineageWriterDatasetFramework.java:169) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.dataset2.DefaultDatasetRuntimeContext.execute(DefaultDatasetRuntimeContext.java:121) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              at co.cask.cdap.data2.metadata.writer.LineageWriterDatasetFramework.getDataset(LineageWriterDatasetFramework.java:167) ~[co.cask.cdap.cdap-data-fabric-3.5.2-SNAPSHOT.jar:na]
              ... 29 common frames omitted
      Caused by: java.lang.IllegalArgumentException: Not connected to zooKeeper.
              at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) ~[guava-13.0.1.jar:na]
              at org.apache.twill.internal.zookeeper.DefaultZKClientService.getZooKeeper(DefaultZKClientService.java:344) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.internal.zookeeper.DefaultZKClientService.getChildren(DefaultZKClientService.java:254) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ForwardingZKClient.getChildren(ForwardingZKClient.java:71) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.internal.zookeeper.FailureRetryZKClient.getChildren(FailureRetryZKClient.java:99) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ForwardingZKClient.getChildren(ForwardingZKClient.java:71) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.internal.zookeeper.RewatchOnExpireZKClient.getChildren(RewatchOnExpireZKClient.java:68) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ZKClientServices$1.getChildren(ZKClientServices.java:94) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.internal.zookeeper.NamespaceZKClient.getChildren(NamespaceZKClient.java:84) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ZKOperations$5.exec(ZKOperations.java:160) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ZKOperations.watchChanges(ZKOperations.java:314) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.zookeeper.ZKOperations.watchChildren(ZKOperations.java:151) ~[org.apache.twill.twill-zookeeper-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.discovery.ZKDiscoveryService$5.load(ZKDiscoveryService.java:330) ~[org.apache.twill.twill-discovery-core-0.7.0-incubating.jar:0.7.0-incubating]
              at org.apache.twill.discovery.ZKDiscoveryService$5.load(ZKDiscoveryService.java:323) ~[org.apache.twill.twill-discovery-core-0.7.0-incubating.jar:0.7.0-incubating]
              at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337) ~[guava-13.0.1.jar:na]
              at com.google.common.cache.Lo
      

      TLDR: DynamicPartitioner is broken (and any other custom OutputCommitter that needs dataset access).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                terence Terence Yim
                Reporter:
                ali.anwar Ali Anwar
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: