Introduction
Compaction helps with managing the segments for a given datasource. In many cases, having a smaller number of segments per segmentGranularity will help with query performance and ease of management. To enable automatic compaction, please refer to the KB article in the References section below.
Compaction runs as a task and therefore it needs a middle manager worker capacity, or in other words a slot, to run.
If there aren't a sufficient number of slots available for compaction, it might never catch up with the generated segments. This might lead to query performance issues.
Determining the number of slots
Compaction Slot Ratio
Compaction gets its slots based on compactionTaskSlotRatio, which is set at the global level. By default this is set to 10% and hence at a given point in time, compaction can get a maximum of 10% of the available slots.
Based on the number of datasources and segmentGranularity to be compacted, the default value might need to be increased.
To get the current compaction slot ratio, run an API command similar to the following example:
curl -ivL -X GET -H 'Content-Type: application/json' -u admin:Password -k https://coord-ip:8081/druid/coordinator/v1/config/compaction/
{ "compactionConfigs": [
{
"dataSource": "parking-citations", "taskPriority": 25,
"inputSegmentSizeBytes": 419430400, "maxRowsPerSegment": null,
"skipOffsetFromLatest": "P1D", "tuningConfig": null, "taskContext": null
}
],
"compactionTaskSlotRatio": 0.1,
"maxCompactionTaskSlots": 2147483647
}
To increase the slot ratio, a command similar to the following could be used:
curl -ivL -X POST -H 'Content-Type: application/json' -u admin:Password -k https://coord-ip:8281/druid/coordinator/v1/config/compaction/taskslots?ratio=0.4
The above command would set the compactionSlotRatio to 40%..
maxCompactionTaskSlots overrides compactionTaskSlotRatio. For example if we have a total of 100 slots in the cluster and if following are the set,
compactionTaskSlotRatio=0.2
maxCompactionTaskSlots=15
Then, if you had 100 slots, this would translate to 20 slots based on compactionTaskSlotRatio and 15 based on maxCompactionTaskSlots. In this scenario, compaction can only get a maximum of 15 slots at any time.
The cluster would get a bare minimum of 1 slot if the total slots are less than 10.
Determining the Slots Required
Make a table of datasource_name, segmentGranularity, for example,
Datasource Name | Segment Granularity |
Part | HOUR |
Suppliers | HOUR |
Partsupp | HOUR |
Supplier | HOUR |
Nation | DAY |
Lineitem | HOUR |
Orders | HOUR |
In the above, 6 datasources have segmentGranularity as HOUR and one has segmentGranularity as DAY.
So we will have (6*24)+1=145 intervals to be compacted every day.
By default co-ordinator would wake up every 30 minutes to check/do compaction. This can be controlled by the coordinator runtime property, druid.coordinator.period.indexingPeriod.
To compact the day's intervals, we should be able to run at least 145 compaction tasks per day. Which would mean we would need a minimum of 3 slots for compaction for the given day. (145 tasks / 48 compaction periods ~= 3 slots each.) Considering there could be delays in data generation and other issues, we need to consider the slots for compacting 1.5 to 2 days of intervals.
This would evaluate to a minimum of 3*1.5~=4 or 3*2=6 slots.
Now say the cluster has 5 datanodes with 5 slots each, this would total up to 25 available slots in the middlemanagers.
By default, compactionTaskSlotRatio=0.1, which would translate to 25*10%~=2 slots available for compaction.
As can be seen above, we need at least 4 slots and only 2 are available. Hence to have compaction catch up, we need to increase compactionTaskSlotRatio=0.2 at least.
Other Considerations:
1. Running compaction task in parallel
This can be done by setting maxNumConcurrentSubTasks to a value higher than 1. Considering the above example, if we are to have Lineitem and Partsupp datasource compacted with maxNumConcurrentSubTasks set to 4, then we would need 6 additional slots (beyond the 4 we've configured so far) whenever a compaction task runs for these datasource. Hence the overall minimum number of slots required for compaction in the example would be 4+6 at the minimum. Hence the compactionTaskSlotRatio has to be further increased.
2. Time taken by the compaction tasks to complete
If the compaction tasks are to be completed within druid.coordinator.period.indexingPeriod, then at the next interval the coordinator would have all the slots available again to schedule the next set of compaction tasks; otherwise, it would have less than the pre-determined value. So the compaction task run time needs to be analyzed as well to arrive at the optimal number of slots for compaction.
If the compaction jobs are quick, for example 5 minutes, then there is no harm in setting the slot ratio to be higher than what is calculated. The slots would be used only if there is a requirement.
3. Slots for Supervisors and other tasks
Carefully also evaluate the minimum number of slots required by the tasks to be started by Supervisors (Kafka or Kinesis) and that of other batch ingestions. Though compaction jobs would lose any locks when an ingestion task needs one, ingestion jobs can't preempt the slots already obtained by compaction if the compaction job does have an appropriate lock.
Conclusion
One of the first options to consider during compaction would be to determine whether the segments could be generated optimally. If that isn't possible, compaction is required. In a cluster with several datasources and supervisor jobs, it is necessary to do a good analysis of the number of slots required for compaction, as discussed above.
Reference
KB Article: Everything you need to know about Auto Compaction
Apache Druid Documentation: Compaction & Re-indexing
Comments
0 comments
Please sign in to leave a comment.