Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-6109

SparkContext.saveAsDataset failed on PartitionedFileSet after reduceByKey call

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.1, 3.4.0
    • Fix Version/s: 3.5.0, 3.4.2
    • Component/s: Spark
    • Labels:
      None
    • Release Notes:
      Fixed a NullPointerException issue in Spark when saving RDD to PartitionedFileSet dataset.
    • Rank:
      1|hzzdtz:

      Description

      There is a bug in reusing and changing implicit async commit transaction to a non-async commit one. Specifically, it fails when a usage pattern like this:

      val rdd = sc.fromDataset("somedataset").values()
      rdd.reduceByKey(...)
           .saveAsDataset("partitionFileSet")
      

      When RDD.reduceByKey is called on a RDD created from Dataset, the RDD.partitions method will be triggered, which requires a transaction in order to get splits from the underlying Dataset. If there is no explicit transaction, a new implicit transaction will be started and left open. The transaction will be committed when the Spark job that involves the reduceByKey RDD completed with an action. However, when the RDD.saveAsDataset is called, it detects that there is already a transaction opened, so it will reuse it. In here, the correct logic should be changing the current transaction to be committed when the saveAsDataset method completed, rather than the Spark job completed, because we need to execute the onSuccess method on the PartitionedFileSet transactionally.

        Attachments

          Activity

            People

            • Assignee:
              terence Terence Yim
              Reporter:
              terence Terence Yim
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: