Is Rack Awareness possible in Imply / Druid?

Venkat

Updated April 28, 2021 03:00

OBJECTIVE:

Hadoop components are rack-aware. Hadoop has a default replication factor of 3. For example, HDFS block placement will use rack awareness for fault tolerance by placing block replicas on different racks. This provides data availability in the event of a network switch failure or partition within the cluster.

Is the same possible in Druid?

ANSWER:

There isn't rack awareness in a default Imply Cluster Setup. But it can be achieved using coordinator retention policies and historical tiers.

This would involve splitting the nodes into two different racks and defining them to be separate tiers, for example, __rack1_tier and __rack2_tier. (Imply Druid's default replication factor is 2.)

Later, when defining the replicas for a given datasource in its retention policy, ensure that one replica goes to __rack1_tier and the other to __rack2_tier.

But in general, rack awareness isn't very important for Druid, since the data is persisted in deep storage, and the data or historical nodes can be recreated using deep storage and metadata DB.

If Hadoop is used as a deep storage, rack awareness could be enabled at the HDFS level.

REFERENCES:

Imply KB Article: Setting up Druid tiering

Apache Hadoop Documentation: Hadoop Rack Awareness