Inconsistent number of rows on the Druid console post re-indexing Job

Hemanth

Updated February 20, 2023 22:31

OBJECTIVE
When we run a re-indexing task for a data source and replace the existing data, we might end up seeing an issue with the number of rows (Total rows) which we see on the Druid console -> Data source page.

PROCEDURE
This is a scenario when we have encountered this issue and the below steps can be useful to reproduce the issue.

For the Wikipedia data set, we can run batch ingestion without any roll-up and set the segmentGranularity to Month.
Post this ingestion, we can see the total rows and the number of segments on the Druid console. We can see 24,433 rows
Now, we can run a re-indexing Job for this data source and replace the data with segmentGranularity to HOUR.
Re-indexing task can be run from the Load data from the console or we can submit the ingestion spec.
In this spec, we need to set "appendToExisting": false". This will try to drop the existing segment and load the new segments which are HOUR granularity.
Post the re-indexing task, we can see the total number of rows is showing double "48,866"
When we query for the total number of rows from the Druid data source, we see 24,433

So we see the inconsistency in the number of rows which is shown on the console with the count(*) for the same dataset.

To resolve this issue, we can add the parameter "dropExisting": true" in the ingestion spec along with the "appendToExisting": false" which will overshadow the MONTH granularity segment and will only show HOUR granularity segments.