Qubole - Imply Cloud Integration

Mohan

Updated June 06, 2019 22:37

This is to outline steps involved in ingesting data in batch mode using Qubole Hadoop into Imply cloud. Integrating Qubole <> Imply & Druid is just a few clicks away.

Instructions:

All Imply clusters use java 8 by default at the time of this authoring this document. Please make sure your Qubole cluster uses java 8 as well. If using 7, you can change to 8 using the following bootstrap script:

Bootstrap script:

#!/bin/bash

source /usr/lib/hustler/bin/qubole-bash-lib.sh
export PROFILE_FILE=${PROFILE_FILE:-/etc/profile}
export HADOOP_ETC_DIR=${HADOOP_ETC_DIR:-/usr/lib/hadoop2/etc/hadoop}

function restart_master_services() {

monit unmonitor namenode
monit unmonitor timelineserver
monit unmonitor historyserver
monit unmonitor resourcemanager

/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop timelineserver' yarn
/bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver' mapred
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop resourcemanager' yarn
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop namenode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start namenode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start resourcemanager' yarn
/bin/su -s /bin/bash -c 'HADOOP_LIBEXEC_DIR=/usr/lib/hadoop2/libexec /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver' mapred
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh start timelineserver' yarn

sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop # as root user
sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh start # as root user

monit monitor namenode
monit monitor resourcemanager
monit monitor historyserver
monit monitor timelineserver

}

function restart_worker_services() {
monit unmonitor datanode
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh stop datanode' hdfs
/bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/hadoop-daemon.sh start datanode' hdfs
monit monitor datanode
# No need to restart nodemanager since it starts only
# after thhe bootstrap is finished
}

function use_java8() {
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export PATH=$JAVA_HOME/bin:$PATH
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0" >> "$PROFILE_FILE"
echo "export PATH=$JAVA_HOME/bin:$PATH" >> "$PROFILE_FILE"
sudo echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0_60" >> /usr/lib/zeppelin/conf/zeppelin-env.sh
sed -i 's/java-1.7.0/java-1.8.0/' "$HADOOP_ETC_DIR/hadoop-env.sh"
rm -rf /usr/bin/java
ln -s $JAVA_HOME/bin/java /usr/bin/java

is_master=$(nodeinfo is_master)
if [[ "$is_master" == "1" ]]; then
restart_master_services
else
restart_worker_services
fi

}

use_java8

Restart the cluster to reflect the Java 8.

2. Get the master node IP address/hostname from Qubole cluster. You can find it from here: https://us.qubole.com/clusters#/details/<cluster_id>/instances?tab=Nodes. Please make sure to replace <cluster_id> with your cluster id.

Sample Screenshot:

You can use private IP and make sure that the traffic can be reached from Qubole cluster to Imply cluster. Please contact your network administrator for help.

3. You then have to tell Imply middle manager to talk to Hadoop master for data processing. Go to Imply clusters page, select the desired cluster, click on Manage, click on setup, scroll down and click on Advanced Config.

Find Middle Manger advanced configuration box and then add the following entries related to your Qubole cluster:

hadoop.fs.defaultFS=hdfs://<Qubole_cluster_master_node_IP>:9000

hadoop.yarn.resourcemanager.hostname=<Qubole_cluster_master_node_IP>

hadoop.yarn.application.classpath=/usr/lib/hadoop2/etc/hadoop,/usr/lib/hadoop2/*,/usr/lib/hadoop2/lib/*,/usr/lib/hadoop2/share/hadoop/common/*,/usr/lib/hadoop2/share/hadoop/common/lib/*,/usr/lib/hadoop2/share/hadoop/hdfs/*,
/usr/lib/hadoop2/share/hadoop/hdfs/lib/*,/usr/lib/hadoop2/share/hadoop/yarn/*,/usr/lib/hadoop2/share/hadoop/yarn/lib/*,/usr/lib/hadoop2/share/hadoop/mapreduce/*,/usr/lib/hadoop2/share/hadoop/mapreduce/lib/*,/usr/lib/hadoop2/share/hadoop/tools/*,
/usr/lib/hadoop2/share/hadoop/tools/lib/*,/usr/lib/hadoop2/share/hadoop/qubole/*,/usr/lib/hadoop2/share/hadoop/qubole/lib/*

hadoop.mapreduce.framework.name=yarn

hadoop.fs.s3n.awsAccessKeyId=<access_key>

hadoop.fs.s3n.awsSecretAccessKey=<secret_key>

Sample Screenshot:

Restart the Imply cluster to reflect the added entires.

4. Include the following job properties in your ingestion specification to avoid block and multipart size errors:

Tuning Config:

"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"mapreduce.job.user.classpath.first": "true",
"fs.s3a.readahead.range": "65536",
"fs.s3a.multipart.size": "104857600",
"fs.s3a.block.size": "33554432"
}
}

The above four steps should establish communication between Qubole hadoop and Imply clusters and data ingestion should directly hit deep storage after aggregations.

Reference:

Qubole

Qubole Documents

Imply Cloud Documents