Abdul Raheem for AWS Community Builders

Posted on Sep 14

Deploy Your LLM on AWS EC2

#llm #rag #ec2 #aws

Ever been excited to deploy your own Large Language Model(LLM) but hit a wall because your laptop isn't up to the task? I know exactly how that feels. LLMs are powerful, but they demand a lot of computing power—something most of us just don’t have on hand.

I ran into this problem myself. I wanted to create a cool application using LLMs that my friends could use, but the idea of buying an expensive GPU just to make it work was out of the question. So, I started looking for a workaround. That's when I stumbled upon AWS.

Using AWS turned out to be a lifesaver. I didn’t need to invest in any fancy hardware. Instead, AWS lets you pay only for what you use. Now, I've got three different LLMs applications running on the cloud, and I didn’t have to spend a ton of money on equipment.

In this post, I'll walk you through how you can set up your own LLMs on the cloud step-by-step and will share some tips to save your costs as well. If I can do it, you can too!

Getting Started

For deployment on AWS, we need to understand how LLMs compute work and which instance can be a good fit for us to go with.

I will be using an LLM-based RAG application built with Streamlit, using the LLama2-7B model from Meta.

1. Understanding the Right resource

Since each Large Language Model (LLM) has a different number of parameters and may use different numerical precisions, they require varying GPU capabilities for inference and fine-tuning.

A simple rule of thumb to estimate how much GPU memory is needed to store the model parameters during inference of any open-source LLM is as follows:

GPU Memory (in bytes) = Number of model parameters × Bits per parameter ÷ 8 (bits per byte)

For example, for a 7-billion-parameter model using 32-bit floating-point numbers:

GPU Memory = 7,000,000,000 parameters × 32 bits per parameter ÷ 8 bits per byte
GPU Memory = 28,000,000,000 bytes
GPU Memory = 28 GB

However, this requirement can be quite high. To reduce the memory footprint, we can use quantization techniques. For instance, using 4-bit quantization:

GPU Memory = 7,000,000,000 parameters × 4 bits per parameter ÷ 8 bits per byte
GPU Memory = 3,500,000,000 bytes
GPU Memory = 3.5 GB

Therefore, quantizing the model to 4 bits reduces the GPU memory requirement to approximately 3.5 GB.

2. Which Compute To Choose?

AWS instances like g4, g5, p3, and p4 are the latest generation of GPU-based instances and provide the highest performance in Amazon EC2 for deep learning and high-performance computing (HPC).

Some examples of compute and cost are as follows:

Instance Type	GPU Type	GPU Memory (GB)	vCPUs	On-Demand Price (per hour)
g4dn.xlarge	NVIDIA T4	16	4	$0.526
g5.xlarge	NVIDIA A10G	24	4	$1.006
p3.2xlarge	NVIDIA V100	16	8	$3.06
g4dn.12xlarge	NVIDIA T4	192	48	$4.344
g5.8xlarge	NVIDIA A10G	96	32	$8.288
p3.8xlarge	NVIDIA V100	64	32	$12.24
p4d.24xlarge	NVIDIA A100	320	96	$32.77

Step-by-Step Guide

Since we are aware of all the math and resources we require, let's go with deploying LLM application on AWS.

Step - 1

Search EC2 on your AWS Console. You will see a similar page for EC2.

Step - 2

Click on Instance from the sidebar and then click on Launch Instance

Step - 3

Configure the EC2 instance using the following setting and launch a new instance.

Name: Your-Application-Name
Amazon Machine Image (AMI): Ubuntu-Latest
Instance Type: g4dn.xlarge
Key Pair: Create a New Key Pair and use that
Network Setting: Use default with the addition of "Allow HTTPS traffic from the internet" & "Allow HTTP traffic from the internet"
Storage: 16 GB

Step - 4 (Only For Streamlit Application)

Go to the recently launched new instance and define some inbound roles, since we are running a streamlit application and it runs on port 8501, we have to define that rule.

Here, we will click on "Edit inbound rules" and add the following Custom TCP role with port 8501 and click save rules.

Step - 5

Go to the recently launched EC2 page and click on the connect button and then connect your recently created instance.

Step - 6

Once connected you will see a terminal where you have to run the following commands to install the required dependencies and updates.

Commands:



sudo apt update
sudo apt-get update
sudo apt upgrade -y
sudo apt install git curl unzip tar make sudo vim wget -y
sudo apt install python3-pip

Step - 7

Once downloaded all the dependencies, you need to connect with GitHub to clone(download) your streamlit application and its requirements from the GitHub repo.

You can clone your repo using the following command:

git clone "Your-repository"

Once cloned, you can enter your repo using the following command:

cd "Your-repository"

Step - 8

First, we need to set up a virtual environment to ensure all the dependencies for the Streamlit application are installed in an isolated environment. This helps avoid conflicts with system-wide Python packages.



sudo apt install python3-venv
python3 -m venv venv
source venv/bin/activate

Once the virtual environment is created then we need to install any requirements for the Streamlit application. For this, we will use pip to read the requirements.txt file and download all the libraries defined in that file.

pip install -r requirements.txt

Step - 9

Once everything is installed and ready, it's time to run your LLM-based application.

For this, you will run the following command:



python3 -m streamlit run <your-application-name>
python3 -m streamlit run app.py

Once you click on the external link, it will redirect to your streamlit application. And this is the public link that you can share with your friends and they can enjoy your application as well.

Voila!!!

The above command will run the application till your terminal is connected. Once it is closed the application will also stop working.

To Make sure the application keeps running if the terminal is disconnected, you need to run the following command.

nohup python3 -m streamlit run app.py

The nohup ensures the application keeps running even if you log out or lose your terminal session.

Conclusion

Deploying LLMs doesn’t have to be limited by your hardware. AWS offers a flexible, cost-effective way to harness the power of LLMs without investing in expensive GPUs. With the right instance and some optimization techniques like quantization, you can run sophisticated models in the cloud efficiently. By following this guide, you’re now equipped to deploy your own LLM-based application, making it accessible and scalable. So, dive into cloud deployment, share your innovations effortlessly, and let your application reach its full potential!