Getting exact count(distinct) results, using druid SQL

Ben

Updated February 01, 2021 20:02

OBJECTIVE:

To get exact count(distinct) results using druid SQL.

ADDITIONAL INFORMATION:

It's very difficult to get exact count(distinct) results in distributed databases, including druid (and usually not the best idea, anyway). But, it is possible (with performance overhead, of course). It can be most easily done using druid SQL, rather than native druid queries.

PROCEDURE:

If you are using Druid Native queries you can not easily compute exact count distinct in a straightforward manner - you would have to use subquery tricks. In Druid SQL you can have exact count distinct (under the hood the SQL planner does the tricks for you). If you are using Druid SQL you need to set the `useApproximateCountDistinct: false` property in the Druid context for (every) query to force the SQL planner to not use the approximate shortcut.

Pivot can query Druid via the native queries or via SQL. (If you are for some reason unsure about how Pivot is querying Druid you can go to "Monitor queries" on your data cube and see if the queries look like SQL or if they look like JOSN objects - native).

If Pivot is using native mode then you can set `"exactResultsOnly": true` in the data cube options. This will prevent approximate results from ever being shown. In the case of count distinct it will effectively cripple it. This is because it is not possible (without elaborate tricks) to get count distinct out of native queries and Pivot will not do those "tricks" for you.

If Pivot is using SQL mode then `"exactResultsOnly": true` would be ignored, but you can set `useApproximateCountDistinct: false` and/or `useApproximateTopN: false` in the Druid context itself. So you can put `{ "druidContext": { "useApproximateCountDistinct": false`, "useApproximateTopN": false }}` in the data cube options. (Note this does not go in the top level). If in the future `"exactResultsOnly": true` would be supported in SQL mode it would equate to setting those to properties for you.

Lastly, Pivot will use the native mode by default. To get Pivot to use SQL mode you can either switch the connection type to be `druidsql` in the settings or to use the new Pivot SQL (alpha) functionality that lets you also define dimension and measures with SQL also. If choosing to use Pivot SQL make sure to use a minimum version of 3.4.3, as there have been a number of usability fixes. (Pivot SQL is an alpha feature and is undergoing rapid development.)