This is to outline steps involved in ingesting data in batch mode using Qubole Hadoop into Imply cloud. Integrating Qubole <> Imply & Druid is just a few clicks away.
Instructions:
- All Imply clusters use java 8 by default at the time of this authoring this document. Please make sure your Qubole cluster uses java 8 as well. If using 7, you can change to 8 using the following bootstrap script:
Bootstrap script:
#!/bin/bash source /usr/lib/hustler/bin/qubole-bash-lib.sh function restart_master_services() { monit unmonitor namenode /bin/su -s /bin/bash -c '/usr/lib/hadoop2/sbin/yarn-daemon.sh stop timelineserver' yarn sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh stop # as root user monit monitor namenode } function restart_worker_services() { function use_java8() { is_master=$(nodeinfo is_master) } use_java8 |
Restart the cluster to reflect the Java 8.
2. Get the master node IP address/hostname from Qubole cluster. You can find it from here: https://us.qubole.com/clusters#/details/<cluster_id>/instances?tab=Nodes. Please make sure to replace <cluster_id> with your cluster id.
Sample Screenshot:
You can use private IP and make sure that the traffic can be reached from Qubole cluster to Imply cluster. Please contact your network administrator for help.
3. You then have to tell Imply middle manager to talk to Hadoop master for data processing. Go to Imply clusters page, select the desired cluster, click on Manage, click on setup, scroll down and click on Advanced Config.
Find Middle Manger advanced configuration box and then add the following entries related to your Qubole cluster:
hadoop.fs.defaultFS=hdfs://<Qubole_cluster_master_node_IP>:9000 hadoop.yarn.resourcemanager.hostname=<Qubole_cluster_master_node_IP> hadoop.yarn.application.classpath=/usr/lib/hadoop2/etc/hadoop,/usr/lib/hadoop2/*,/usr/lib/hadoop2/lib/*,/usr/lib/hadoop2/share/hadoop/common/*,/usr/lib/hadoop2/share/hadoop/common/lib/*,/usr/lib/hadoop2/share/hadoop/hdfs/*, hadoop.fs.s3n.awsAccessKeyId=<access_key> hadoop.fs.s3n.awsSecretAccessKey=<secret_key> |
Sample Screenshot:
Restart the Imply cluster to reflect the added entires.
4. Include the following job properties in your ingestion specification to avoid block and multipart size errors:
Tuning Config:
"tuningConfig": { "type": "hadoop", "jobProperties": { "mapreduce.job.user.classpath.first": "true", "fs.s3a.readahead.range": "65536", "fs.s3a.multipart.size": "104857600", "fs.s3a.block.size": "33554432" } } |
The above four steps should establish communication between Qubole hadoop and Imply clusters and data ingestion should directly hit deep storage after aggregations.
Comments
0 comments
Please sign in to leave a comment.