Here are some handpicked options for hardening Druid system when dealing with demanding scenario like group by on large datasets which may return a very large query response. This article also provides recommendation guideline for setting these parameters.
Querying a large dataset
Druid provides many tunable parameters at broker and query level. Following are list of advanced configurations currently supported by Druid.
Property | Description | Default | Recommendation |
druid.server.http.defaultQueryTimeout | Query timeout in millis, beyond which unfinished queries will be cancelled. 0 timeout means no timeout . To set the default timeout, see Broker configuration |
300000 milli | |
druid.broker.http.readTimeout | The timeout for data reads from Historical servers and real-time tasks. | PT15M |
|
druid.router.http.readTimeout | timeout of inflight query responses. Router will terminate running queries if they run longer than this duration. | PT8M |
|
druid.broker.http.maxQueuedBytes |
Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar to |
0 (disabled) |
|
druid.processing.buffer.sizeBytes |
This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime processes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed. |
auto (max 1GB) |
|
druid.query.groupBy.maxMergingDictionarySize |
Maximum amount of heap space (approximately) to use for the string dictionary during merging. When the dictionary exceeds this size, a spill to disk will be triggered. |
100000000 | |
druid.query.groupBy.maxOnDiskStorage |
Amount of space on disk used for aggregation, per query, in bytes. By default, this is 0, which means aggregation will not use disk. |
0 (disabled) |
|
Note: Please check your Druid version for support of these advanced options.
Downloading large dataset
Apply HTTP Compression
Apache Druid (incubating) supports http request decompression and response compression, to use this,Activate by setting the Accept-Encoding: gzip
header on your client requests.
This does not require server configuration.
Recommendation: consider this if network is more of a bottleneck than CPU (this trades off CPU for network). Not recommended for smaller result-sets since this option increases CPU usage on broker and on client.
Detect Truncated Responses
Truncated response can be detected by validating returned data. Each format type provides a way to detect if data is truncated. Let's see what each response type works in this scenario. Some of the response formats are better than others but it all depends on what implementation works with external systems.
Always set resultFormat
if using SQL queries.
Query Type | Parameter | Value | Content Type | Format Type | Detection | Recommendation |
SQL | resultFormat | array | application/json |
JSON array of JSON arrays |
Validate Response JSON |
|
SQL | resultFormat | arrayLines | text/plain |
Like "array", but the JSON arrays are separated by newlines instead of being wrapped in a JSON array. This can make it easier to parse the entire response set as a stream, if you do not have ready access to a streaming JSON parser. |
Presence of one blank line in the end |
|
Native Druid | resultAsArray |
true | application/json |
groupBy v2 queries now use an array-based representation of result rows, rather than the map-based representation used by prior versions of Druid. This provides faster generation and processing of result sets. Out of the box, this change is backwards-compatible. |
Validate Response JSON |
|
Other Formats
Druid supports additional formats which are excluded from this KB article due to their relevance. These formats are:
- resultFormat=[object, objectLines, csv] for SQL queries are excluded.
- resultFormat=[list, compactedList] for Druid Native queries only apply for scan queries and not for GROUP BY hence excluded.
Additional Recommendations
- Try avoiding "ORDER BY" if at all humanly possible.
- Colocate clients to reduce network latency.
- Avoid server going down.
- Implement retry in case of error 500 or truncated response.
Further Reading
- https://druid.apache.org/docs/latest/operations/http-compression.html
- https://druid.apache.org/docs/latest/configuration/realtime.html
- https://druid.apache.org/docs/latest/querying/groupbyquery.html#query-context
- https://druid.apache.org/docs/latest/querying/sql
- https://docs.imply.io/3.1/on-prem/misc/release
Comments
0 comments
Please sign in to leave a comment.