Creating a Raspberry Pi cluster

First, let us get the why out if the equation.

Why would you want to do it?

I can’t speak for you. But I like tinkering with things. Lately, a lot of the Kaggle competitions appear to be dealing with relatively largish datasets. And I was starting to find it difficult to deal with it using Pandas. I initially tried generating a database from the data, and then going through it in chunks. It wasn’t pretty. And it wasn’t efficient. This convinced me that I should look into this Big Data business. Maybe Big Data had something going for it after all.

So I started learning Hadoop and Spark on Coursera. For some reason, although a virtual machine on a single node was the easy way out, I still was interested in performance. Hence, I wanted to check where the performance bottlenecks would me if we were to run things on a cluser. So I wanted to create my own cluster. Raspberry Pi’s just happen to be the cheapest option for this.

The Hardware

  • Raspberry Pi 3 model B \(\times\) 3
  • \(\mu\)SSD cards 64 Gb each
  • A multi-port USB charger (so you don’t need single chargers for every Pi)
  • USB cables to connect the charger to a Pi
  • An \(\mu\)SSD card reader/writer for installing the operating system
  • A wireless router (I used the one in the office)
  • Keyboard, mouse, and a monitor to do the initial configuration on the Pi’s (I found some lying around in the office so I commandeered for a few hours)

Install the OS and Some Initial Configuration

This part is going to show you how to install the operating system, and set up secure shell access. Once that is done, there is no need of using the monitor, mouse and keyboard, because you can simply ssh into the system after.

Prepare the media and install the OS

1. Downloading the OS files

The OS files may be downloaded here. The Raspberry Pi has an excellent set of documentation for installing the OS on the SDD’s here. Mac-specific instructions may be found here. What follows is what worked for me.

2. Finding the name of the microSD card

Fod determining the name of the microSD card, use the df -h command before and after the memory card has been plugged in, as shown below:

$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 48Gi 417Gi 11% 858677 4294108602 0% /
devfs 220Ki 220Ki 0Bi 100% 760 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home

After the memory card is plugged in,

connecting the SD card to the computer
microsd
$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 48Gi 417Gi 11% 858681 4294108598 0% /
devfs 222Ki 222Ki 0Bi 100% 768 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
/dev/disk2s1 60Gi 8.9Mi 60Gi 1% 71 488121 0% /Volumes/Untitled

the extra disk /dev/disk2s1 appears. Now umount the disk and make sure that the disk is really unmounted. Do not disconnet the microSD at this point. You want to write the image to the disk.

$ sudo diskutil umount /dev/disk2s1
Password:
Volume (null) on disk2s1 unmounted
$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 48Gi 417Gi 11% 858748 4294108531 0% /
devfs 222Ki 222Ki 0Bi 100% 768 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
$

3. Write the disk image into the microSD card

Now we are going to write the disk image into the disk. We need to change the name of the disk so that we call the disk by its name rather than the load partition. This is done by converting the disk mount partition into a remote disk location. The nomenclature is as follow:

  • The mount name is shown as: /dev/disk<disk #><mont #> (e.g. /dev/disk2s1)
  • Change the name to /dev/rdisk<disk #> (e.g. /dev/rdisk2)

Now, copy the disk image into the disk using the dd command.

$ sudo dd bs=1m if=2017-01-11-raspbian-jessie.img of=/dev/rdisk2
Password:
4169+0 records in
4169+0 records out
4371513344 bytes transferred in 184.578985 secs (23683700 bytes/sec)
$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 49Gi 416Gi 11% 858777 4294108502 0% /
devfs 223Ki 223Ki 0Bi 100% 772 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
/dev/disk2s1 62Mi 20Mi 42Mi 33% 0 0 100% /Volumes/boot

As you can see, the disk now is a boot disk. Notice however, that the size of the disk khas reduced significantly. from 60Gi to 62Mi. We shall take care of this later.

4. Eject the microSD card

sudo diskutil eject /dev/rdisk2

Repeat the process for all the cards that you have bought for this process.

5. Summary

The commands required for this are the following:

df -h # before mounting the microSD | The firt two commands just check the name of the
df -h # after mounting the microSD |disk. It should be the same for all disks
sudo diskutil umount /dev/disk2s1 # un mount the disk
sudo dd bs=1m if=2017-01-11-raspbian-jessie.img of=/dev/rdisk2 # write the disk image
sudo diskutil eject /dev/rdisk2 # unmount the disk again

Turon on your Raspberry Pi for basic configuration

1. Fill the entrire space of the microSD in the Raspberry Pi

For utilizing all of the disk space, we need to use the following command:

sudo raspi-config

This is going to bring up the following screen:

config screen for the Raspberry Pi
config

The first option is to expand the filesystem. Select this option and reboot. The result is the expanded filesystem shown below:

pi@raspberrypi:~ $ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 59G 3.6G 53G 7% /
devtmpfs 459M 0 459M 0% /dev
tmpfs 463M 0 463M 0% /dev/shm
tmpfs 463M 6.3M 457M 2% /run
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 463M 0 463M 0% /sys/fs/cgroup
/dev/mmcblk0p1 63M 21M 42M 33% /boot
tmpfs 93M 0 93M 0% /run/user/1000

You can see that /dev/root is now 59G, approximately equal to the 60G size of the microSSD.

2. Install Java

Update all packages before installing Java. Then install Java.

pi@raspberrypi:~ $ sudo apt-get update
pi@raspberrypi:~ $ sudo apt-get install oracle-java8-jdk

In the operating system that I had installed, it turns out that Java was already installed.

3. Use static network addresses

Before Hadoop can be installed, the Raspberry Pi has to be set up so that one can use static network addressed. The information in this site was really useful. I followed the instructions exactly and things seemed to work. In summary the following files were changed:

  1. /etc/wpa_supplicant/wpa_supplicant.conf
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
country=GB

network={
ssid="network name I am connecting to"
psk="network password"
key_mgmt=WPA-PSK
}

to

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
country=GB

network={
ssid="network name I am connecting to"
psk="network password"
proto=RSN
key_mgmt=WPA-PSK
pairwise=CCMP
auth_alg=OPEN
}

Reboot. Making these changes should still allow you to connect to the router without a problem. The next step is to connect to configure the Raspberry Pi so that it connects via a static IP.

For this, another file is changed: /etc/network/interfaces

From

## interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

iface eth0 inet manual

allow-hotplug wlan0
iface wlan0 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

allow-hotplug wlan1
iface wlan1 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

to

## interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto wlan0

iface lo inet loopback
iface eth0 inet dhcp

allow-hotplug wlan0
iface wlan0 inet static
address 192.168.1.155
netmask 255.255.255.0
gateway 192.168.1.1
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
iface default inet dhcp

allow-hotplug wlan1
iface wlan1 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

Note that we are using the static IP address 192.168.1.155 for this Raspberry Pi. It is important to change the address for the other Pi’s, so that they can individually be addressed.

Note that you can get your router IP address from your network preferences on a Mac.

Wifi router address from your Mac
routerip

Also note that the line address 192.168.1.155 will have to be different on each of your Pi’s. I have used the addresses

  • 192.168.1.155,
  • 192.168.1.156, and
  • 192.168.1.157

For my 3 Pi’s. You can use whatever you want.

4. Change the hostname (Advanced Configuration)

The advanced configuration allows you to change the hostname of the pi. This is extreemly useful, as you will then no longer need to remember the pi addresses. You can simply use the names. What is more, when you configure Hadoop, it will be relatively easy to use these node names.

hostname IP address
node1 192.168.1.155
node2 192.168.1.156
node3 192.168.1.157

5. Enable ssh (Advanced Configuration)

Now you should be able to acceess the Raspberry Pi through your computer. You no longer need to have a screen sonnected to the Pi anymore. You can directly ssh in at will.

The default username is pi, and the default password is raspberry.

$ ssh pi@192.168.1.155 # ssh pi@node1 should also work
password: raspberry

Now that we can ssh into the different configurations, we don’t have to have a mouse, keyboard, and monitor attached to the Raspberry Pi’s anymore. We can setup all of the other configurations through the secure shell.

6. Summary

For basic configuration of the Raspberry Pi, we want to complete the folliowing tasks:

  1. sudo raspi-config:
    1. Increase the size of the available memory to fill the entire memory of the microSD
    2. Change the hostname
    3. Enabel ssh
  2. Change the wireless options so as to use static IP addresses

Install Hadoop

Now we will go through the installation of Hadoop. For a multi-cluster system, it is preferable to have a separate user/usergroup allocated to Hadoop. The username will be hduser, and the usergroup will be hadoop. This user/usergroup will be present in all of the three Pi’s so that YARN can easily log into any of the machines and read/write files in a consistent manner.

1. Generate a usergroup and user for Hadoop

Before installing Hadoop,

pi@node1:~ $ sudo addgroup hadoop
[..................................messages truncated ..................................]
Done.
pi@node1:~ $ sudo adduser --ingroup hadoop hduser
[..................................messages truncated ..................................]
Enter new UNIX password:
Retype new UNIX password:
[..................................messages truncated ..................................]
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] Y
pi@node1:~ $

The passwords that I have entered are the following:

hostname IP address password
node1 192.168.1.155 pi1
node2 192.168.1.156 pi2
node3 192.168.1.157 pi3

Again, you can choose what you want …

2. Install Hadoop

This basically consists of downloading the compressed binaries, uncompressing them, and then coping them to a convinient location. Here are the steps …

# 1. Go somewhere
cd ~/
# 2. Download the compressed binary
wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
# 3. uncompress it in /opt/ <- a typical place to keep libraries
sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/
cd /opt
# 4. Name it something appropriate and easy-to-remember
sudo mv hadoop-2.7.3 hadoop
# 5. Make hadoop the owner of the folder
sudo chown -R hduser:hadoop hadoop

Repeat this for all the three Pi’s.

Configure One of the Pi’s as a Single Node Cluster

Before moving ahead, we shall configure one of the Pi’s as a single node cluster. These are the main reasons behind it.

  1. A single cluser requires practically all of the configuration that is required for the multi-node cluster.
  2. Once you have configured a single node cluster, you can test it immediately.
  3. Finally, when the single node is working properly, you can simply copy entire configuration folders to the other nodes, and save yourself a ton of time in the configuration process.

So, lets begin the configurations:

1. Update /etc/bash.bashrc

Insert into /etc/bash.bashrc

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
export HADOOP_HOME=$HADOOP_INSTALL

And make sure that the changes take effect …

make the changes take effect:

source /etc/bash.bashrc; source ~/.bashrc

2. Update paths in /opt/hadoop/etc/hadoop/hadoop-env.sh

Update the Hadoop environment variables in: /opt/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native -Djava.net.preferIPv4Stack=true"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

3. Modify the file: /opt/hadoop/etc/hadoop/yarn-site.xml

All the tutorials had things a little different. After some experimenting, information from the following two sites worked best:

The things that we inserted into the YARN configuration file is shown below. The last two properties yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce_shuffle.class are not mentioned in the tutorials. Those were obtained from stack overflow answers, and inserted for basic MapReduce programs to work. I have no idea what they do.

 <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>768</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>64</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>256</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

4. Modify the file /opt/hadoop/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx204M</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>2</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>102</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx102M</value>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>128</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx102M</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>mapreduce.job.maps</name>
<value>2</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>2</value>
</property>
</configuration>

5. Modify the file /opt/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hdfs/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>

5. Modify the file /opt/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>5242880</value>
</property>
</configuration>

6. Create HDFS folders

References

  1. Raspberry Pi OS Download
  2. Raspberry Pi OS Installation Instructions
  3. Raspberry Pi OS Installation Instructions on a Mac
  4. Configuring the Raspberry Pi for using Static Addresses
  5. Building a Pi Hadoop Cluster – 1
  6. Building a 7 node Hadoop Cluster
  7. Building a Pi Cluster with Apache Spark and YARN

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s