Using single-dimension partitioning to improve read performance

Ben

Updated October 26, 2020 22:34

OBJECTIVE:

Use single-dimension partition to improve query performance for certain queries. (Queries which filter on that dimension.)

ADDITIONAL INFORMATION:

In some cases, it can be helpful to have data segments sorted by a particular dimension. If query access is usually filtered by one dimension, sorting by that dimension might reduce seek times and improve query performance. Eg, in a multi-tenant application that stores all tenants in one datasource, with a tenantId dimension.

PROCEDURE:

We can approach this sorting by using single-dimension partitioning. (See 'single_dim' on the partitionsSpec doc page.)

Data is always sorted first by timestamps. With single-dimension partitioning, it's secondarily sorted by the specified partition. Normally, this would mean that the different values (eg, tenantId values) are still spread out all over the place. What can be done is to also create a secondary timestamp column, using the same input timestamp data for the primary one, then set (eg) queryGranularity=segmentGranularity=1 hour, and later query on the desired dimension and the secondary timestamp. This way, the primary timestamps are all truncated to (eg) 1 hour buckets, and within that, data is sorted by the specified single-dimension.

Note that single-dimension is the slowest ingestion method - data has to be sorted before writing to segments. (With hash-based, you just calculate the hash and put it into the appropriate segment.) So writes are slower, which is the trade-off for improved read performance in this case.