"No such previous checkpoint found" error in streaming Supervisor

Caroline

Updated April 26, 2023 16:04

We have observed high throughput streaming supervisors occasionally throwing the error:

No such previous checkpoint [...] found

This KB article describes how this error can occur due to a race condition between checkpointing by the overlord/supervisor and checkpointing by a task request.

These errors are transient and can be ignored. There should be no impact on ingestion throughput.

Note: this error is not to be confused with the following error, which requires a supervisor reset:

Previous sequenceNumber [...] is no longer available for partition[...]. You can clear the previous sequenceNumber and start reading from a valid message by using the supervisor's reset API.

Symptoms:

- the "No such previous checkpoint found" message is seen in the "recentErrors" section of the Supervisor status in the web console. This error is also present in overlord logs.

- some task failures may be observed in the web console if the overlord sends a task shutdown request. The task logs themselves will show Success as long as the task shutdown is successful.

Cause:

The error represents a race condition between checkpointing by the overlord/supervisor and checkpointing by a task request. The error is called here in the Druid code:

    @Override
    public void handle() throws ExecutionException, InterruptedException
    {
      // check for consistency
      // if already received request for this sequenceName and dataSourceMetadata combination then return
      final TaskGroup taskGroup = activelyReadingTaskGroups.get(taskGroupId);

      if (isValidTaskGroup(taskGroupId, taskGroup)) {
        final TreeMap<Integer, Map<PartitionIdType, SequenceOffsetType>> checkpoints = taskGroup.checkpointSequences;

        // check validity of previousCheckpoint
        int index = checkpoints.size();
        for (int sequenceId : checkpoints.descendingKeySet()) {
          Map<PartitionIdType, SequenceOffsetType> checkpoint = checkpoints.get(sequenceId);
          // We have already verified the stream of the current checkpoint is same with that in ioConfig.
          // See checkpoint().
          if (checkpoint.equals(checkpointMetadata.getSeekableStreamSequenceNumbers()
                                                  .getPartitionSequenceNumberMap()
          )) {
            break;
          }
          index--;
        }
        if (index == 0) {
          throw new ISE("No such previous checkpoint [%s] found", checkpointMetadata);
        } else if (index < checkpoints.size()) {
          // if the found checkpoint is not the latest one then already checkpointed by a replica
          Preconditions.checkState(index == checkpoints.size() - 1, "checkpoint consistency failure");
          log.info("Already checkpointed with sequences [%s]", checkpoints.lastEntry().getValue());
          return;
        }
        final Map<PartitionIdType, SequenceOffsetType> newCheckpoint = checkpointTaskGroup(taskGroup, false).get();
        taskGroup.addNewCheckpoint(newCheckpoint);
        log.info("Handled checkpoint notice, new checkpoint is [%s] for taskGroup [%s]", newCheckpoint, taskGroupId);
      }
    }

Here is an example sequence of events :

A checkpointing request was sent from the taskGroup of the partition A with an offset = a.
A few seconds later, the current taskGroup including partition A reached the taskDuration limit, so the supervisor told the taskGroup to publish segments.
Because the supervisor began finalizing ingestion process, it immediately paused the current taskGroup, collected currentOffsets to find the max offsets, and resumed the taskGroup.
The supervisor updated the offsets for partition A in memory to offset = b, corresponding to the offset of the latest checkpoint.
The supervisor came back to the checkpointing request received in step 1 and tried to process it. However, the request failed because the offsets metadata in memory was already updated. This is harmless because segments were already processed in steps 2-4.
The supervisor spawned another task for the partition A and that task successfully published new segments with the offset = b in the metadata store. This happened a few minutes after 3) because publishing segments (and updating metadata store) is done by the task in 3).

Impact:

This should not impact data availability and is not the cause for increased lag.

Remediation:

Reducing taskCount may help decrease the likelihood of encountering the race condition.