What is Containers any way!
Containers are lightweight, portable, and efficient, making them a popular choice for deploying and running applications. In this tutorial, we’ll guide you through the process of creating a minimal container using Go. The example code provided focuses on essential containerization concepts, including namespaces, chroot, and control groups (cgroups).
Before getting started, ensure you have the following installed:
Go programming language: Install Go
Basic understanding of Linux namespaces and control groups
introduction
So what is Linux namespaces and control groups ?
Namespaces have been part of the Linux kernel since about 2002, and over time more tooling and namespace types have been added. Real container support was added to the Linux kernel only in 2013, however. This is what made namespaces really useful and brought them to the masses.
But what are namespaces exactly? Here’s a wordy definition from Wikipedia:
“Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.”
In other words, the key feature of namespaces is that they isolate processes from each other. On a server where you are running many different services, isolating each service and its associated processes from other services means that there is a smaller blast radius for changes, as well as a smaller footprint for security‑related concerns. Mostly though, isolating services meets the architectural style of microservices as described by Martin Fowler.
Types of Namespaces
Within the Linux kernel, there are different types of namespaces. Each namespace has its own unique properties:
A user namespace has its own set of user IDs and group IDs for assignment to processes. In particular, this means that a process can have root privilege within its user namespace without having it in other user namespaces.
A process ID (PID) namespace assigns a set of PIDs to processes that are independent from the set of PIDs in other namespaces. The first process created in a new namespace has PID 1 and child processes are assigned subsequent PIDs. If a child process is created with its own PID namespace, it has PID 1 in that namespace as well as its PID in the parent process’ namespace. See below for an example.
A network namespace has an independent network stack: its own private routing table, set of IP addresses, socket listing, connection tracking table, firewall, and other network‑related resources.
A mount namespace has an independent list of mount points seen by the processes in the namespace. This means that you can mount and unmount filesystems in a mount namespace without affecting the host filesystem.
An interprocess communication (IPC) namespace has its own IPC resources, for example POSIX message queues.
A UNIX Time‑Sharing (UTS) namespace allows a single system to appear to have different host and domain names to different processes.
the container are fast isolated environment , we will focus on this part many things are involved and my main goal is to Demystifying Containers
assuming that you are on a linux machine (try Power shell Ubuntu image if you are on Windows :-)
run this command : id
host-machine $ id
uid=1000(mohamed) gid=1000(mohamed) groups=1000(mohamed) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c.1023
unshare command
Now I run the following unshare command to create a new namespace with its own user and PID namespaces. I map the root user to the new namespace (in other words, I have root privilege within the new namespace), mount a new proc filesystem, and fork my process (in this case, bash) in the newly created namespace.
unshare --user --pid --map-root-user --mount-proc --fork bash
Congratulation , you are in isolated name space and some how you are on
isolated PID in same file system and same network , your entry point /bin/bash
The ps -ef command shows there are two processes running – bash and the ps command itself – and the id command confirms that I’m root in the new namespace (which is also indicated by the changed command prompt):
root # ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 14:46 pts/0 00:00:00 bash
root 15 1 0 14:46 pts/0 00:00:00 ps -ef
root # id
uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c.1023
Namespaces and Containers
Namespaces are one of the technologies that containers are built on, used to enforce segregation of resources. We’ve shown how to create namespaces manually, but container runtimes like Docker, rkt, podman , runC , containerD , and many other container technology
one of most unique projects are https://katacontainers.io/ they claim that they are mix between container and VM’s .
What Are cgroups?
cgroups, or control groups, are a Linux kernel feature that enables the management and limitation of system resources like CPU, memory, and network bandwidth, among others. We can use cgroups to set limits on these resources and distribute them among different groups of processes.
cgroups have a hierarchical structure with root and child, each with resource limits set by controllers — for example, a CPU controller for CPU time or a memory controller for memory.
We can use cgroups for various purposes, such as controlling resource usage in a multi-tenant environment, providing Quality of Service (QoS) guarantees, and running containers.
Cgroups provide the following features:
Resource limits — You can configure a cgroup to limit how much of a particular resource (memory or CPU, for example) a process can use.
Prioritization — You can control how much of a resource (CPU, disk, or network) a process can use compared to processes in another cgroup when there is resource contention.
Accounting — Resource limits are monitored and reported at the cgroup level.
Control — You can change the status (frozen, stopped, or restarted) of all processes in a cgroup with a single command.
Creating a cgroup
The following command creates a v1 cgroup (you can tell by pathname format) called foo and sets the memory limit for it to 50,000,000 bytes (50 MB).
root # mkdir -p /sys/fs/cgroup/memory/foo
root # echo 50000000 > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
Now I can assign a process to the cgroup, thus imposing the cgroup’s memory limit on it. I’ve written a shell script called test.sh, which prints cgroup testing tool to the screen, and then waits doing nothing. For my purposes, it is a process that continues to run until I stop it.
I start test.sh in the background and its PID is reported as 2428. The script produces its output and then I assign the process to the cgroup by piping its PID into the cgroup file /sys/fs/cgroup/memory/foo/cgroup.procs.
root # ./test.sh &
[1] 2428
root # cgroup testing tool
root # echo 2428 > /sys/fs/cgroup/memory/foo/cgroup.procs
To validate that my process is in fact subject to the memory limits that I defined for cgroup foo, I run the following ps command. The -o cgroup flag displays the cgroups to which the specified process (2428) belongs. The output confirms that its memory cgroup is foo.
root # ps -o cgroup 2428
CGROUP
12:pids:/user.slice/user-0.slice/\
session-13.scope,10:devices:/user.slice,6:memory:/foo,...
By default, the operating system terminates a process when it exceeds a resource limit defined by its cgroup.
and this fair amount of information about namespace and cgroup
you can read full doc about it by Scott van Kalken of F5
at this link , also this post Demystifying Containers 101 and this one focus on Docker ecosystem “A Beginner-Friendly Introduction to Containers, VMs and Docker”
part 1 : Chroot
i will not use Namespaces , “at this part”
this may surprise however i will achieve the isolation , we will use Chroot a simple UNIX tool
chroot, short for "change root," is a Unix system call that changes the root directory of a process to a specified path, effectively creating a new root filesystem for the process and its children. This can be a powerful tool for creating isolated environments or "chroot jails."
How Chroot Works:
Setting a New Root Directory: When you execute the chroot system call or the chroot command in the shell, it changes the root directory for the process and its children. The new root directory becomes the / (root) directory for that process, isolating it from the actual root directory of the host system.
Isolation: After the chroot operation, the process and its children can only access files and directories within the new root directory. They cannot access files outside this new root, providing a level of isolation and containment.
Use Cases:
System Recovery: chroot is commonly used in system recovery scenarios. If your system becomes unbootable or experiences issues, you can boot from a live CD/USB, chroot into the broken system, and make necessary repairs without affecting the rest of the host system.
Environment Isolation: Developers and system administrators may use chroot to create isolated environments for testing or building software. This is especially common in scenarios where different versions of libraries or dependencies are required.
Security: Although chroot provides some level of isolation, it's not foolproof in terms of security. It was not designed as a security feature and should not be solely relied upon for containing malicious processes. Modern containerization technologies, like Docker, utilize more advanced mechanisms, such as Linux namespaces and cgroups, to provide stronger isolation.
Example:
Consider the following example:
mkdir mychroot
cp -r /bin /lib /lib64 /usr /mychroot
chroot /mychroot /bin/bash
In this example:
We create a directory called mychroot and copy essential binaries and libraries into it.
We use chroot to change the root directory to /mychroot.
After the chroot command, executing /bin/bash will run a Bash shell within the isolated environment.
Keep in mind that chroot by itself does not provide complete isolation; it is often used in conjunction with other tools and techniques to create more secure and robust containerized environments.
Prepare the Ubuntu Root Filesystem
now final this you will need before you start a filesystem .
we will use Docker to download Ubuntu filesystem
you will only need docker to download it , in your project root
$ docker run -d --rm --name ubuntu_fs ubuntu:20.04 sleep 1000
$ mkdir -p ./ubuntu_fs
$ docker cp ubuntu_fs:/ ./ubuntu_fs
$ docker stop ubuntu_fs
now we have ubuntu_fs inside our project , inside your main package
package main
import (
"io/ioutil"
"log"
"os"
"os/exec"
"path/filepath"
"strconv"
"syscall"
"strings"
"fmt"
"github.com/vishvananda/netns"
)
func main() {
switch os.Args[1] {
case "run":
run(os.Args[2:]...)
case "child":
child(os.Args[2:]...)
default:
log.Fatal("Unknown command. Use run <command_name>, like `run /bin/bash` or `run echo hello`")
}
}
func run(command ...string) {
log.Println("Executing", command, "from run")
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, command[0:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
// Cloneflags is only available in Linux
// CLONE_NEWUTS namespace isolates hostname
// CLONE_NEWPID namespace isolates processes
// CLONE_NEWNS namespace isolates mounts
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS ,
Unshareflags: syscall.CLONE_NEWNS | syscall.CLONE_NEWNET,
}
// Run child using namespaces. The command provided will be executed inside that.
must(cmd.Run())
}
func child(command ...string) {
// Create cgroup
cg()
cmd := exec.Command(command[0], command[1:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
must(syscall.Sethostname([]byte("container")))
must(syscall.Chroot("./ubuntu_fs"))
// Change directory after chroot
must(os.Chdir("/"))
// Mount /proc inside container so that `ps` command works
must(syscall.Mount("proc", "proc", "proc", 0, ""))
// Mount a temporary filesystem
if _, err := os.Stat("mytemp"); os.IsNotExist(err) {
must(os.Mkdir("mytemp", os.ModePerm))
}
must(syscall.Mount("something", "mytemp", "tmpfs", 0, ""))
must(cmd.Run())
// Cleanup mount
must(syscall.Unmount("proc", 0))
must(syscall.Unmount("mytemp", 0))
}
func cg() {
// cgroup location in Ubuntu
cgroups := "/sys/fs/cgroup/"
pids := filepath.Join(cgroups, "pids")
containers_mini := filepath.Join(pids, "containers_mini")
os.Mkdir(containers_mini, 0755)
// Limit to max 20 pids
must(ioutil.WriteFile(filepath.Join(containers_mini, "pids.max"), []byte("20"), 0700))
// Cleanup cgroup when it is not being used
must(ioutil.WriteFile(filepath.Join(containers_mini, "notify_on_release"), []byte("1"), 0700))
pid := strconv.Itoa(os.Getpid())
// Apply this and any child process in this cgroup
must(ioutil.WriteFile(filepath.Join(containers_mini, "cgroup.procs"), []byte(pid), 0700))
}
func must(err error) {
if err != nil {
log.Printf("Error: %v\n", err)
panic(err)
}
}
this code introduced by Liz Rice
https://youtu.be/Utf-A4rODH8?si=ULuzE8E5N7N17dH9
Understanding the Code
1. Main Function
The main function serves as the entry point of the program. It uses command-line arguments to determine whether to run a new container or act as a child process within an existing container.
func main() {
switch os.Args[1] {
case "run":
run(os.Args[2:]...)
case "child":
child(os.Args[2:]...)
default:
log.Fatal("Unknown command. Use run <command_name>, like `run /bin/bash` or `run echo hello`")
}
}
- Run Function
The run function sets up the container environment and executes a specified command inside it.
func run(command ...string) {
log.Println("Executing", command, "from run")
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, command[0:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS | syscall.CLONE_NEWNET,
}
must(cmd.Run())
}
this command cmd := exec.Command(“/proc/self/exe”, append([]string{“child”}, command[0:]…)…)
make sure that it’s append all command to same process
The Cloneflags specify the namespaces to be isolated (UTS, PID, and mount namespaces).
The Unshareflags further isolate the network namespace.
The cmd.Run() method runs the provided command within the created container.
- Child Function
The child function is responsible for setting up the container filesystem and executing the specified command inside it.
func child(command ...string) {
// ...
cg()
must(syscall.Sethostname([]byte("container")))
must(syscall.Chroot("./ubuntu_fs"))
must(os.Chdir("/"))
must(syscall.Mount("proc", "proc", "proc", 0, ""))
must(syscall.Mount("something", "mytemp", "tmpfs", 0, ""))
must(cmd.Run())
must(syscall.Unmount("proc", 0))
must(syscall.Unmount("mytemp", 0))
}
The cg function sets up a control group (cgroup) to limit resource usage for the container.
Sethostname sets the hostname inside the container.
Chroot changes the root directory for the container.
Mount is used to mount essential filesystems like /proc and a temporary filesystem.
Finally, the command is executed within the container.
- Control Groups (Cgroups)
The cg function creates and configures a cgroup for the container, limiting the number of processes.
func cg() {
// cgroup location in Ubuntu
cgroups := "/sys/fs/cgroup/"
pids := filepath.Join(cgroups, "pids")
containers_mini := filepath.Join(pids, "containers_mini")
os.Mkdir(containers_mini, 0755)
// Limit to max 20 pids
must(ioutil.WriteFile(filepath.Join(containers_mini, "pids.max"), []byte("20"), 0700))
// Cleanup cgroup when it is not being used
must(ioutil.WriteFile(filepath.Join(containers_mini, "notify_on_release"), []byte("1"), 0700))
pid := strconv.Itoa(os.Getpid())
// Apply this and any child process in this cgroup
must(ioutil.WriteFile(filepath.Join(containers_mini, "cgroup.procs"), []byte(pid), 0700))
}
Cgroups are used to control and limit resource usage for processes.
In this example, the cgroup limits the maximum number of processes to 20.
- Error Handling
The must function is a simple utility function for handling errors.
func must(err error) {
if err != nil {
log.Printf("Error: %v\n", err)
panic(err)
}
}
If an error occurs, it is logged, and the program is terminated.
Building and Running the Container
To run the minimal container, follow these steps:
Build the executable: go build -o mycontainer main.go
Create a filesystem directory with an Ubuntu root filesystem, e.g., ubuntu_fs.
Run the container: sudo ./mycontainer run /bin/bash
remember you need to run it as sudo
your entry point is /bin/bash
now you are in your own minimal container , and now you have a deep understanding , may be if i have more time in the future i will add isolation layer on network , our you can do it , thank you for your time i hopped it helped anyone .
read this will help you more
namespace & golang a series of article explains namespace with go examples
“Creating Network Stacks and Connecting with the Internet” by “Shrikanta Mazumder”
https://songrgg.github.io/programming/linux-namespace-part01-uts-pid/
on next part we will create a network layer that give our container a virtual Ethernet in isolated subset that use host bridge as gateway . see you soon
you can find me on LinkedIn
https://www.linkedin.com/in/mohamed-elkerwash/
Top comments (0)