Introduction
There are numerous reasons why you’d like to run an LLM model locally and isolated from the internet instead of using the public OpenAI
, Meta
or Deepseek
apis.
For me the most important are the following:
-
Data privacy
- Some industries (healthcare, finance, legal) require sensitive or proprietary data to remain on-premises or within specific geographic regions
- Stringent regulations (e.g., HIPAA, GDPR) by avoiding data transfers to external third-party services.
-
Security
- The content, generated or processed, must remain confidential. A local solution prevents sending queries to an external API
- You have end-to-end control (network, physical access, encryption at rest/in transit) when models are self-hosted.
Everything that includes PII of your clients or the business confidential information should never be uploaded to the public services.
In order to meet these requirements, while still being able to use LLM to boost performance of your organization, the local setup in your cloud account can be the golden bullet.
What is the financial side of this setup?
I prepared an approximate forecast to run the LLama models v3.2 based on requirements in AWS Cloud.
In my calculations I covered the following 2 cases:
- The LLM should be online only during the working hours (40h/week)
- LLM should be available 24/7
- No saving plans, reserved instances, upfront payment included.
Llama Name | Possible EC2 Instance | Instance Details | Monthly Price (40 hrs/week) | Monthly Price (168 hrs/week) |
---|---|---|---|---|
Llama 3.2 1B Instruct | g4dn.xlarge | 16GB RAM 4 vCPUs 1 GPU (NVIDIA T4) |
$91.42 | $383.98 |
Llama 3.2 3B Instruct | g4dn.2xlarge | 32GB RAM 8 vCPUs 1 GPU (NVIDIA T4) |
$130.70 | $548.96 |
Llama 3.2 11B Vision | g5.8xlarge | 128GB RAM 32 vCPUs 1 GPU (24GB Memory) |
$429.33 | $1,803.04 |
Llama 3.2 90B Vision | g5.48xlarge | 768GB RAM 192 vCPUs 1 GPU (192GB Memory) |
$2,834.85 | $11,906.24 |
Notes on the Table
-
Possible EC2 Instance was selected based on the LLama models v3.2 requirements
- For smaller Instruct models (1B, 3B), a single g4dn or g5 instance with an NVIDIA T4 should be enough.
- For 11B Vision, the
g5.8xlarge
meets the minimum 22 GB VRAM requirement (A10G has 24 GB VRAM). - For 90B Vision, you typically need multiple high-end GPUs. The
g5.48xlarge
offers 8× A100 GPUs (40 GB each = 320 GB total VRAM) plus sufficient CPU and RAM.
- Monthly Price
Calculated from approximate On-Demand hourly rates in
us-west-2 (Oregon)
. Prices shown are for:- 160 hours/month (40 hrs/week)
- 720 hours/month (24×7 usage). Actual AWS rates may vary slightly by region and can change over time.
- Storage The selected instances typically come with local NVMe SSD volumes. In production, you’ll often attach an EBS volume to meet or exceed the required disk space. EBS costs are not included in the prices above.
Here is a link to the pricing calculator. You can use it as a baseline in your cost forecasts.
Optimization Options
- Reserved Instances or Savings Plans can drastically reduce hourly rates.
- Spot Instances offer lower prices but can be interrupted.
- For large models, you might also explore distributed training/inference techniques to scale across multiple smaller GPUs.
Always confirm instance pricing using the official AWS Pricing Calculator or up-to-date AWS documentation for G4 instaces and G5 instances.
Top comments (0)