Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-16055

Spark pipeline lacks error message when there are output directory conflicts


    • Release Notes:
      Fixed a bug that the failure error message emitted by Spark driver is not being collected.
    • Rank:


      When I run multiple pipelines, each writing to the same directory, using the MapReduce engine, a subset may fail due to the others writing to the same output directory. Note that this pipeline writes to a subdirectory partitioned by minute, so it is possible for more than one of the set to succeed, if they run/complete at different minutes.
      I can appropriately see the error message in the logs (see failedMR.txt for full pipeline logs):

      Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory gs://test-new-df-folder/tmp/2019-10-21-14-16 already exists

      However, when I do the same in Spark execution engine, I do not see such an error message, but the pipelines can still fail. See failedSpark[1-5].txt for logs for such pipeline runs.

      I did encounter 1 run where the error message is appropriately logged. See failedSpark6.txt.

      I have attached the pipeline as q-cdpa-data-pipeline.json.


        1. failedMR.txt
          62 kB
        2. failedSpark1.txt
          109 kB
        3. failedSpark2.txt
          93 kB
        4. failedSpark3.txt
          89 kB
        5. failedSpark5.txt
          94 kB
        6. failedSpark6.txt
          96 kB
        7. q-cdap-data-pipeline.json
          7 kB



            • Assignee:
              terence Terence Yim
              ali.anwar Ali Anwar
            • Votes:
              0 Vote for this issue
              3 Start watching this issue


              • Created: