The only thing that is not implemented for Spark is to call the onFailure() method of the DatasetOutputCommitter if the Spark job fails. For MapReduce it is obvious that this happens exactly for one dataset, at the end of the job. Spark can write multiple times, and it is not clear what the semantics are. If a write to one dataset succeeds, we call its onSuccess(), but if subsequently, a write to another dataset fails (or the processing in between), do we also have to call onFailure()? Or only for the one that failed?
Since none of our datasets implement onFailure(), it is ok for now to leave this out. But we should complete this soon.