Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-9031

Database Sink gets out of memory condition

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 3.5.3
    • Fix Version/s: 5.0.0
    • Component/s: Pipelines
    • Labels:
    • Rank:
      1|hzzy2f:

      Description

      We wrote a test batch Hydrator Pipeline to read a plain text CSV file and load it into a Teradata DB.
      [File]->[CSVParser]->[Database Sink]

      There was one file approximately 70MB in size with 780K records.
      You can see the all the records emitted out of CSVParser but we get the following out of memory exception:

      Error running child : java.lang.OutOfMemoryError: Java heap space at com.teradata.jdbc.jdbc_4.io.TDPacketStream.(TDPacketStream.java:84) at com.teradata.jdbc.jdbc_4.io.TDPacket.(TDPacket.java:55) at com.teradata.jdbc.jdbc.GenericTeradataConnection.createPacket(GenericTeradataConnection.java:357) at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2161) at co.cask.hydrator.plugin.db.batch.sink.ETLDBOutputFormat$1.close(ETLDBOutputFormat.java:81) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:670) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at co.cask.cdap.internal.app.runtime.batch.distributed.MapReduceContainerLauncher.launch(MapReduceContainerLauncher.java:109) at org.apache.hadoop.mapred.YarnChild.main(Unknown Source)

      We configured the DB plugin to use the Teradata JDBC driver.
      One of my developers looked at your Database Plugin and he said the DB sink plugin does not execute the prepared statement until ALL the data has been read from the incoming stream.

      While increasing the memory allocated to the worker from 512MB to 4096MB allowed the data to be loaded without error I foresee issues in the future.
      At some point, the size of the incoming data will be too large to fit the 4096MB worker allocation.

      The DB plugin shouldn't be buffering all the data in memory before attempting to load it to the database.

        Attachments

          Activity

            People

            • Assignee:
              vinisha Vinisha Shah
              Reporter:
              leonid Leonid Fedotov
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: