Cheedge Lee

Posted on Mar 2 • Originally published at notes-renovation.hashnode.dev

Configuring Your Home Cluster Network: A Complete Guide

It’s a renovated note.

Nowadays many of us will enjoy the cloud cluster rather than build a self managed cluster, as it’s less management, high availability, more secure, pay-as-you-go, and all the advantages you can think of the cloud computing. However, if you accidentally own several old computers, and don’t want to sell/transfer them and don’t know how to deal with them, a home-managed cluster will be a good choice. there is a lot of fun of a self managed cluster.

Architecture

Let’s define our architecture here: 4 nodes: 1 master/worker, 3 worker

1. master node

eth0: connect with institute cable
- public interface, using DHCP
- internet download, update
- users can access it: ssh/scp
eth1: connect with worker ndoes
- private interface, using static IP
- communicate other code:
  - ssh/scp
  - data transfer
  - parallel communicate

config network interface eth0 and eth1

# Configure public interface (assumes DHCP from institute network)
cat > /etc/sysconfig/network-scripts/ifcfg-eth0 << EOF
TYPE=Ethernet
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
# Request static IP from our institute's DHCP server if possible
# This makes routing more reliable
DHCP_CLIENT_ID=cluster-master
EOF

# Configure private cluster network
cat > /etc/sysconfig/network-scripts/ifcfg-eth1 << EOF
TYPE=Ethernet
DEVICE=eth1
BOOTPROTO=static
IPADDR=192.168.10.1
NETMASK=255.255.255.0
ONBOOT=yes
EOF

# Apply new network configuration
systemctl restart network

NAT

# Enable IP forwarding persistently
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf

# Enable connection tracking timeout optimization for HPC workloads
echo "net.netfilter.nf_conntrack_tcp_timeout_established = 86400" >> /etc/sysctl.conf
echo "net.netfilter.nf_conntrack_max = 131072" >> /etc/sysctl.conf

# Apply sysctl changes
sysctl -p

# Set up NAT with higher connection limits
iptables -t nat -A POSTROUTING -o eth0 -s 192.168.10.0/24 -j MASQUERADE

the last command is important, as it will contribute to the return traffic. We will explain later.

setup the firewall

# Clear existing rules
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X

# Set default policies
iptables -P INPUT DROP
iptables -P FORWARD DROP  # We'll explain this choice
iptables -P OUTPUT ACCEPT

# Allow loopback
iptables -A INPUT -i lo -j ACCEPT

# Allow established and related connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH from institute network
iptables -A INPUT -i eth0 -p tcp --dport 22 -j ACCEPT

# Allow all traffic from the cluster's private network
iptables -A INPUT -i eth1 -s 192.168.10.0/24 -j ACCEPT

# Allow forwarding from cluster to internet
iptables -A FORWARD -i eth1 -s 192.168.10.0/24 -o eth0 -j ACCEPT

# Allow HTTP/HTTPS for package downloads
iptables -A INPUT -i eth0 -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -i eth0 -p tcp --dport 443 -j ACCEPT

# Set up local package caching repository later (optional)
iptables -A INPUT -i eth1 -p tcp --dport 80 -j ACCEPT

# Save iptables rules
iptables-save > /etc/sysconfig/iptables

I set the default FORWARD policy to DROP for security reasons:

It prevents unauthorized traffic from traversing the master node
It creates a default-deny stance, where only explicitly allowed traffic passes
It prevents potential lateral movement if one node is compromised

2. worker nodes

config

# Worker node 1 (192.168.10.2)
cat > /etc/sysconfig/network-scripts/ifcfg-eth0 << EOF
TYPE=Ethernet
DEVICE=eth0
BOOTPROTO=static
IPADDR=192.168.10.2
NETMASK=255.255.255.0
GATEWAY=192.168.10.1
DNS1=8.8.8.8
ONBOOT=yes
EOF

# here GATEWAY will auto generate the route rule in table

# Worker node 2 (192.168.10.3)
# Change IPADDR=192.168.10.3 on the second worker node

# Worker node 3 (192.168.10.4)
# Change IPADDR=192.168.10.4 on the third worker node

# Restart network service on each worker node
systemctl restart network

Routing Configuration for Worker Nodes

As we use the master node eth1 (192.168.10.1) as the gateway for work nodes (GATEWAY=192.168.10.1), above setting creates a default route on each worker node that sends all traffic not destined for the local network (192.168.10.0/24) to the master node (192.168.10.1).

$ route -n
# see the result: (send all traffic (0.0.0.0/0) to gateway 192.168.10.1)
0.0.0.0         192.168.10.1      0.0.0.0         UG    0      0        0 eth0

Manual Command

Using manual command can also achieve the same result.

route add -net 0.0.0.0 gw 192.168.10.1

This manually adds a default route to the current routing table. It has the same immediate effect as the configuration file setting, but it's temporary and will be lost after a reboot or network service restart.

The difference is primarily in persistence and when the configuration happens. Using the network configuration file is the standard way to set up permanent routes in CentOS/RHEL systems.

Fire wall

# Clear existing rules
iptables -F
iptables -X

# Set default policies
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT

# Allow loopback
iptables -A INPUT -i lo -j ACCEPT

# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH from cluster nodes only
iptables -A INPUT -p tcp --dport 22 -s 192.168.10.0/24 -j ACCEPT

# HPC/MPI Communication - comprehensive approach
# Allow all TCP/UDP between cluster nodes for parallel computing
# this is optinal if you don't want do parallel computing
iptables -A INPUT -s 192.168.10.0/24 -p tcp -j ACCEPT
iptables -A INPUT -s 192.168.10.0/24 -p udp -j ACCEPT

# Save iptables rules
iptables-save > /etc/sysconfig/iptables
systemctl enable iptables

3. Package management

3.1 Local yum Repo

because the worker node will not have the access to internet, we keep them inside the private, therefore, for package installation, updating, we need to find a way to resolve these.
we want to specify the packages in our yum repo

so here we build a local yum repo

# On master node
# Install required packages
yum install -y createrepo nginx

# Create repository directory
mkdir -p /var/www/html/centos-repo

# Configure Nginx
cat > /etc/nginx/conf.d/repo.conf << EOF
server {
    listen 80;
    server_name _;
    root /var/www/html;

    location / {
        autoindex on;
    }
}
EOF

# Start and enable Nginx
systemctl enable nginx
systemctl start nginx

# Download packages to repository
yum install -y yum-utils
repotrack -p /var/www/html/centos-repo <package-name> 
# Repeat for packages we need

# Create repository metadata
createrepo /var/www/html/centos-repo

# Configure worker nodes to use this repository
cat > /etc/yum.repos.d/cluster-local.repo << EOF
[cluster-local]
name=Cluster Local Repository
baseurl=http://192.168.10.1/centos-repo
enabled=1
gpgcheck=0
EOF

3.2 Optional for package management

In order to realise

worker node package update
self-managed packages

we can also use scp to transfer package from master node.

# On master node, download and transfer RPM
yum install -y yum-utils
yumdownloader <package-name>
scp <package-name>.rpm 192.168.10.1:/tmp/

# On worker node
sudo rpm -ivh /tmp/<package-name>.rpm

3.3 directly route worker node to internet

If we don’t need high security, we can also open the private cluster to public internet. Which will configure the router table and we don’t discuss here.

4. Other Installations

ssh configuration

# On master node, generate SSH key
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""

# Copy the key to all nodes (including itself)
for i in {1..4}; do
  ssh-copy-id -i ~/.ssh/id_rsa.pub 10.10.10.$i
done

# Do the same on each worker node to allow any-to-any communication
# (Run similar commands on each worker node)

Network Monitor & iptables Log

# Install tools
yum install -y tcpdump nmap iftop

# Set up automatic monitoring with fail2ban to prevent brute force attacks
yum install -y fail2ban
cat > /etc/fail2ban/jail.local << EOF
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/secure
maxretry = 5
bantime = 3600
EOF

# Start and enable fail2ban
systemctl enable fail2ban
systemctl start fail2ban

# Add logging rules before the final DROP rules
iptables -A INPUT -j LOG --log-prefix "IPTables-Input-Dropped: " --log-level 4
iptables -A FORWARD -j LOG --log-prefix "IPTables-Forward-Dropped: " --log-level 4

# Save iptables rules
iptables-save > /etc/sysconfig/iptables

Optional Parallel Computing Configuration

# On all nodes (master and workers)
# Install OpenMPI
yum install -y openmpi openmpi-devel

# Configure environment in /etc/profile.d/
cat > /etc/profile.d/mpi.sh << EOF
export PATH=\$PATH:/usr/lib64/openmpi/bin
export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
EOF

# Source the new environment
source /etc/profile.d/mpi.sh

# Test MPI communication
# Create a hostfile
cat > /home/username/hostfile << EOF
192.168.10.1 slots=128
192.168.10.2 slots=128
192.168.10.3 slots=128
192.168.10.4 slots=128
EOF

# Run a simple MPI test
mpirun -np 4 --hostfile /home/username/hostfile hostname

Torque/PBS

# Install Torque on master node
yum install -y torque-server torque-scheduler torque-client

# Configure server nodes file
cat > /var/torque/server_priv/nodes << EOF
192.168.10.1 np=128
192.168.10.2 np=128
192.168.10.3 np=128
192.168.10.4 np=128
EOF

# Start Torque server
systemctl enable pbs_server
systemctl start pbs_server

# Install Torque on worker nodes
for i in {2..4}; do
  ssh 192.168.10.$i "yum install -y torque-mom torque-client; systemctl enable pbs_mom; systemctl start pbs_mom"
done

5. Traffic Flow:

Forward Chain Traffic Flow in Both Directions

When we create the forward chain, iptables -A FORWARD -i eth1 -s 192.168.10.0/24 -o eth0 -j ACCEPT, this command let the traffic come into eth1 can be forward to eth0, which means traffic from worker nodes to master nodes, and master nodes forward it to institute internet. Here comes the question, where is the backward flow?

iptables -A FORWARD -i eth1 -s 192.168.10.0/24 -o eth0 -j ACCEPT

Let’s see how the traffic flows first.

Outbound Traffic (Worker → Internet)

The rule above allows packets to travel from the worker nodes (coming in on eth1) to be forwarded out to the institute network (through eth0). This handles the first half of any connection, which is the outbound request.

Return Traffic (Internet → Worker)

For the return traffic, typically we will think what we need as:

iptables -A FORWARD -i eth0 -d 192.168.10.0/24 -o eth1 -j ACCEPT

However, if we look closely at the original configuration, here is the command:

iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT

This rule is handling the return traffic, because:

When a worker node initiates a connection, the outbound packet creates an entry in the connection tracking table (create a ESTABLISHED connection)
Any returning packets associated with that connection are marked as ESTABLISHED
The rule above allows all ESTABLISHED connections through, regardless of interface

This is more secure than explicitly allowing all traffic from eth0 to eth1, because it only permits return traffic for connections that were initiated from inside our cluster.

If this state tracking rule wasn't present, we would absolutely need to the backward traffic rule. Without either approach, connections would work one-way only, which means worker nodes could send requests, but never receive response…

Connection Flow

Let's trace a web request from a worker node:

Worker (192.168.10.2) tries to access google.com
Packet travels: Worker → Master's eth1
Master checks FORWARD chain, matches -i eth1 -s 192.168.10.0/24 -o eth0 rule
Master performs NAT, changing source IP to its own public IP
Packet leaves through eth0 to institute network
Google responds to master's public IP
Packet arrives at master's eth0
Master checks connection tracking table, sees this is a response
Packet is marked as ESTABLISHED
Master checks FORWARD chain, matches the ESTABLISHED rule
Master performs reverse NAT, changing destination to worker's IP
Packet leaves through eth1 to worker
Worker receives response

More details

more details can check the post before here.

DEV Community