Articles in this section

Explanation of Supervisor Exceptions: LOST_CONNECTION_WITH_STREAM, UNABLE_TO_CONNECT_TO_STREAM, UNHEALTHY_TASKS

QUESTION:

What are the streaming exceptions for supervisors and how are they different? There are times where LOST_CONTACT_WITH_STREAM is shown and others where UNABLE_TO_CONNECT_TO_STREAM is shown. These will be shown when the supervisor shows UNHEALTHY_SUPERVISOR. This differs from UNHEALTHY_TASKS in that this status in the supervisor status means that 1+ of the most recent tasks have failed.  The status will change when one round of tasks completes successfully. 

ANSWER:

UNHEALTHY_SUPERVISOR:

Below is the code block responsible for determining the exception for the supervisor:

github link to code

protected State getSpecificUnhealthySupervisorState()
{
ExceptionEvent event = getRecentEventsQueue().getLast();
if (event instanceof SeekableStreamExceptionEvent && ((SeekableStreamExceptionEvent) event).isStreamException()) {
return isAtLeastOneSuccessfulRun()
? SeekableStreamState.LOST_CONTACT_WITH_STREAM
: SeekableStreamState.UNABLE_TO_CONNECT_TO_STREAM;
}

return BasicState.UNHEALTHY_SUPERVISOR;
}

This is checking the recent events/errors to see if there are any stream exceptions. If the tasks have run once without issue, it shows LOST_CONNECTION_WITH_STREAM. If the tasks have not run once but there are stream exceptions, it shows UNABLE_TO_CONNECT_TO_STREAM.

If you look at the recentErrors of the supervisor, it shows if the errors are considered streaming exceptions (streamException:true) at the bottom of each one.

"recentErrors": [
{
"timestamp": "2022-06-22T19:00:00.624Z",
"exceptionClass": "com.amazonaws.services.kinesis.model.LimitExceededException",
"message": "com.amazonaws.services.kinesis.model.LimitExceededException: Rate exceeded for stream live-ga-annotated-events under account 118928031713. (Service: AmazonKinesis; Status Code: 400; Error Code: LimitExceededException; Request ID: ca2d05f1-12dc-2ef0-9100-6d0148561ccd)",
"streamException": true
},

 

UNHEALTHY_TASKS:

Below is the code block responsible for determining whether or not to display UNHEALTHY_TASKS.

github link to code

 if (consecutiveFailedTasks >= supervisorStateManagerConfig.getTaskUnhealthinessThreshold()) {
hasHitTaskUnhealthinessThreshold = true;
supervisorState = BasicState.UNHEALTHY_TASKS;
return;
}

 

REFERENCES:

The code for unhealthy tasks can be found below.  This is a reference point to review and understand how health of supervisors/tasks is determined:

https://github.com/implydata/druid/blob/7a3da28185741658217b5648ace14ba1b27d4a1d/server/src/main/java/org/apache/druid/indexing/overlord/supervisor/SupervisorStateManager.java

 

Was this article helpful?
0 out of 0 found this helpful