SYMPTOM : In case of ingestion failure from AWS S3 with following error messages in log:
2018-02-15T17:18:05,600 INFO [<DATASOURCE_NAME>] io.druid.segment.realtime.appenderator.AppenderatorImpl - Committing metadata[AppenderatorDriverMetadata{activeSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, publishPendingSegments={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]}, lastSegmentIds={<SEGMENT_NAME>_2018-02-02T23:00:00.000Z=<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1}, callerMetadata=null}] for sinks[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1:1]. 2018-02-15T17:18:05,600 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorDriver - Persisted pending data in 368ms. 2018-02-15T17:18:05,601 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down... 2018-02-15T17:18:05,601 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[<DATASOURCE_NAME>_2018-02-02T23:00:00.000Z_2018-02-03T00:00:00.000Z_2018-02-15T16:40:08.642Z_1]. 2018-02-15T17:18:05,605 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=<SEGMENT_NAME>_2018-02-02T23:00:00.000Z, type=index, dataSource=<DATASOURCE_NAME>}] java.lang.IllegalStateException: java.net.SocketException: Connection reset at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:106) ~[commons-io-2.5.jar:2.5] at io.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:66) ~[druid-api-0.11.0-iap4.jar:0.11.0-iap4] at io.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:628) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4] at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:233) ~[druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4] at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4] at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.11.0-iap4.jar:0.11.0-iap4] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_152] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_152] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_152] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152] Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_152] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_152] at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_152] at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) ~[?:1.8.0_152] at sun.security.ssl.InputRecord.read(InputRecord.java:532) ~[?:1.8.0_152] at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) ~[?:1.8.0_152] at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) ~[?:1.8.0_152] at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_152] at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[httpcore-4.4.3.jar:4.4.3] at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[httpcore-4.4.3.jar:4.4.3] at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[httpclient-4.5.1.jar:4.5.1] at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78) ~[jets3t-0.9.4.jar:0.9.4] at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146) ~[jets3t-0.9.4.jar:0.9.4] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_152] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_152] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_152] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_152] at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_152] at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_152] at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_152] at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:95) ~[commons-io-2.5.jar:2.5] ... 9 more 2018-02-15T17:18:05,610 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [<SEGMENT_NAME>_2018-02-02T23:00:00.000Z] status changed to [FAILED].
ROOT CAUSE : the error means that the connection was reset while reading inputs from s3. This is a quite common, but transient error in s3. So, usually Druid can retry on this kind of transient errors. But, currently, to turn on the retrying, prefetching should also be enabled. Prefetching is a feature that Druid can prefetch input files from s3 to local disk to make ingestion get faster.
RESOLUTION : To enable prefetching, please set some configurations like below or leave them as null to use their default values in your ioConfig. Please see http://druid.io/docs/latest/development/extensions-core/s3.html#statics3firehose for more details.
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "static-s3",
"prefixes" : [ "s3://path/to/data/2018/02/02/23/" ],
"maxCacheCapacityBytes" : 1073741824,
"maxFetchCapacityBytes" : 1073741824,
"prefetchTriggerBytes" : 536870912,
"fetchTimeout" : 200000,
"maxFetchRetry" : 3
},
"appendToExisting" : false
},
In our next release, retrying will be available even if prefetching is disabled.
Comments
1 comment
Connection reset simply means that a TCP RST was received. This happens when your peer receives data that it can't process, and there can be various reasons for that. The simplest is when you close the socket, and then write more data on the output stream. By closing the socket, you told your peer that you are done talking, and it can forget about your connection. When you send more data on that stream anyway, the peer rejects it with an RST to let you know it isn't listening.
Please sign in to leave a comment.