Amazon EMR (Elastic MapReduce) provides a managed Hadoop framework that makes it easy to process vast amounts of data across dynamically scalable Amazon EC2 instances. Often, you may need to perform certain tasks across all data nodes in your EMR cluster, such as installing software or configuring settings. This guide will walk you through the steps to run a script on all data nodes in an EMR cluster after Hadoop has been installed.
Understanding the Difference: Init Script vs. Running Scripts Post-Setup
Before we dive into the steps, it’s important to understand the difference between using an initialization (init) script and running scripts post-setup:
Init Script: An init script is specified when launching the EMR cluster. It runs before Hadoop and other applications are installed and configured. This is useful for tasks that need to be completed before the Hadoop ecosystem is set up, such as installing system-level packages or making kernel parameter changes.
Running Scripts Post-Setup: Running scripts post-setup, as described in this guide, occurs after the cluster and Hadoop have been initialized. This approach is useful for tasks that depend on Hadoop or other applications being fully installed and configured. For example, installing user-level applications or performing configuration tasks that require Hadoop services to be up and running.
Step 1: Create the Script
The first step is to create a script that you want to execute on all data nodes. This script can be written in any language supported by the nodes, such as Bash, Python, or Perl. For this example, we will use a Bash script that waits for the cluster to be ready and then installs the nano text editor on all data nodes.
Here is an example script:
#!/bin/bash
# Wait until the cluster is ready
while [[ $(sed '/localInstance {/{:1; /}/!{N; b1}; /nodeProvision/p}; d' /emr/instance-controller/lib/info/job-flow-state.txt | sed '/nodeProvisionCheckinRecord {/{:1; /}/!{N; b1}; /status/p}; d' | awk '/SUCCESSFUL/' | xargs) != "status: SUCCESSFUL" ]];
do
sleep 1
done
echo "Cluster is Ready!!"
# Install nano text editor
sudo yum install -y nano
exit 0
This script checks the job flow state to ensure all nodes are provisioned and ready before proceeding to install nano.
Step 2: Copy the Script to the Master Node
After creating your script, the next step is to copy it to the master node of your EMR cluster. You can use the scp
(secure copy) command to transfer the file securely. Replace your-key-pair.pem
with your actual key pair file and master-node-dns
with the DNS of your master node:
scp -i your-key-pair.pem install-nano.sh hadoop@master-node-dns:/home/hadoop/
Alternatively, you can upload the script to an S3 bucket and download it from there to the master node:
aws s3 cp install-nano.sh s3://your-bucket/install-nano.sh
Step 3: Connect to the Master Node
To connect to the master node, use the SSH command. This requires your key pair file and the DNS of your master node:
ssh -i your-key-pair.pem hadoop@master-node-dns
Once connected, navigate to the home directory where you copied the script:
cd /home/hadoop/
Step 4: Copy the Script to HDFS
Now, copy the script from the master node to the Hadoop Distributed File System (HDFS). This makes the script accessible to all nodes in the cluster:
hadoop fs -put install-nano.sh /user/hadoop/
Step 5: Run the Script on All Data Nodes
To execute the script across all data nodes, we will use YARN (Yet Another Resource Negotiator). We create a small wrapper script to download and run our main script on each node. Here’s how to do it:
First, create a wrapper script run-on-nodes.sh
:
#!/bin/bash
# Download the main script from HDFS
hadoop fs -copyToLocal /user/hadoop/install-nano.sh .
# Make the script executable
chmod +x install-nano.sh
# Run the script in the background
nohup ./install-nano.sh &>/dev/null &
Next, submit this wrapper script as a YARN job:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input -output /path/to/output \
-mapper /bin/cat -reducer /bin/cat \
-file run-on-nodes.sh -cmdenv EXEC="run-on-nodes.sh"
This command uses Hadoop Streaming to run the script on all nodes.
Conclusion
By following these steps, you can efficiently run a script on all data nodes in your Amazon EMR cluster. This method is useful for automating various administrative tasks, such as software installation and configuration. Unlike init scripts, which run before the cluster is fully set up, this approach allows you to perform tasks that depend on the Hadoop ecosystem being operational.
Top comments (0)