Note: this method is based on running Druid queries across your data, and requires that you have ingested your data at least once already.
Rollup is a feature that can dramatically reduce the size of data stored in Druid, by summarizing raw data during the ingestion phase. It can potentially reduce the number of rows stored, and therefore cluster cost, by a factor of 10x or more. This article will show you how to estimate the benefit of rollup for your dataset, without needing to re-ingest it.
In order to estimate cluster size needed for Druid, you shall estimate the number of rows in tables after rollup aggregation first, with a cardinality aggregation query in "byRow" mode. Please refer to the following query example:
"intervals" : "2020-01-01/2020-02-01",
"fields": ["channel", "page"],
With this query, you can compare the ratio between "count" (number of rows currently in the datasource), and "estimatedNewCount" (estimated number of rows with retaining certain fields at certain time granularity). The "granularity" should be the time granularity to retain and "fields" should be the dimensions to retain. Metrics should not be listed in such a query.
From the result, you will have a clear picture of comparison between the current row count, and an estimation of the new row count with less retained fields, with the proposed time storage granularity.
You may find significant savings from avoiding high-cardinality fields and fields that are uncorrelated with the rest of your data. Try various combinations of "fields" to get a sense of the potential savings you can achieve. When you have chosen a set you want to go with, you can reload your data with a matching rollup configuration.
One common pattern here is to store two copies of your dataset in Druid: one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup. Then, you can query the appropriate dataset for a particular use case. If a use case can be satisfied from the rolled-up dataset, it can achieve substantial improvement in price/performance, by keeping the rolled-up datasource in memory, and the full dataset on hard disks.