Integration with Hadoop

Gian

Updated May 29, 2018 22:24

Druid can interact with Hadoop in two ways:

Use HDFS for deep storage using the druid-hdfs-storage extension.
Batch-load data from Hadoop using Map/Reduce jobs.

These are not necessarily linked together; you can load data with Hadoop jobs into a non-HDFS deep storage (like S3), and you can use HDFS for deep storage even if you're loading data from streams rather than using Hadoop jobs.

If you use Hadoop Map/Reduce jobs to load data, then these jobs will scan through your raw data and produce optimized Druid data segments in your configured deep storage. The data will then be loaded by Druid Historical Nodes. Once loading is complete, Hadoop and YARN are not involved in the query path of Druid in any way.

The main advantage of loading data using Hadoop is that it automatically parallelizes the batch data loading process, it uses YARN resources instead of using your Druid machines (leaving your Druid machines free to handle queries), and it can leverage data that resides in your existing Hadoop cluster.

For more information, see our documentation at: https://docs.imply.io/on-premise/manage-data/ingestion-files#hadoop-based-ingestion

Working with specific distributions

Sometimes, specific configurations are needed to integrate Druid with specific Hadoop distributions. These are generally needed when the versions of common Java dependencies (like Jackson, Guava, and so on) differ between your version of Druid and your version of Hadoop.

The Druid documentation has some general tips that apply to many distributions. See here for details: http://druid.io/docs/latest/operations/other-hadoop.html. In particular, three go-to tips include:

Place Hadoop XMLs on Druid's classpath.
Classloader modification on Hadoop, using "mapreduce.job.classloader" or "mapreduce.job.user.classpath.first". Additionally, "mapreduce.job.classloader.system.classes" can be useful to customize class loading even further. The purpose of all of these settings is to control which versions of which dependencies are used for which code paths.
Use specific versions of Hadoop libraries. Druid is bundled with Apache Hadoop, but you can install libraries specific to your distribution and have Druid use those instead.

We have collected some additional tips here for working with specific Hadoop distributions.

Amazon EMR (Elastic MapReduce)

Imply Cloud supports integration with Amazon EMR. For a setup guide, see: https://docs.imply.io/cloud/manage-data/emr

Apache Hadoop

Place your Hadoop XMLs on Druid's classpath.
Set "mapreduce.job.classloader": true in your Druid jobProperties.

Replacing libraries is generally not necessary, since Druid is built against Apache Hadoop by default.