When we run a re-indexing task for a data source and replace the existing data, we might end up seeing an issue with the number of rows (Total rows) which we see on the Druid console -> Data source page.
This is a scenario when we have encountered this issue and the below steps can be useful to reproduce the issue.
- For the Wikipedia data set, we can run batch ingestion without any roll-up and set the segmentGranularity to Month.
- Post this ingestion, we can see the total rows and the number of segments on the Druid console. We can see 24,433 rows
- Now, we can run a re-indexing Job for this data source and replace the data with segmentGranularity to HOUR.
- Re-indexing task can be run from the Load data from the console or we can submit the ingestion spec.
- In this spec, we need to set "appendToExisting": false". This will try to drop the existing segment and load the new segments which are HOUR granularity.
- Post the re-indexing task, we can see the total number of rows is showing double "48,866"
- When we query for the total number of rows from the Druid data source, we see 24,433
So we see the inconsistency in the number of rows which is shown on the console with the count(*) for the same dataset.
To resolve this issue, we can add the parameter "dropExisting": true" in the ingestion spec along with the "appendToExisting": false" which will overshadow the MONTH granularity segment and will only show HOUR granularity segments.
Please sign in to leave a comment.