Nithin I Bhandari

Posted on Jul 21, 2023

How to run LLAMA 2 on your local computer

Introduction

LLAMA 2 is a large language model that can generate text, translate languages, and answer your questions in an informative way. In this blog post, I will show you how to run LLAMA 2 on your local computer.

Prerequisite:

Install anaconda
Install Python 11

Steps

Step 1:

1.1: Visit to huggingface.co
Model Link: https://huggingface.co/meta-llama/Llama-2-7b-hf
1.2: Create an account on HuggingFace
1.3: Request for llama model access
It may take a day to get access.
1.4: Go to below link and request llama access
Link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
1.5: As llama 2 is private repo, login by huggingface and generate a token.
Link: https://huggingface.co/settings/tokens



pip install huggingface_hub



huggingface-cli login

Step 2: Create a conda environment and activate conda environment



conda create -n py_3_11_lamma2_run python=3.11 -y



conda activate py_3_11_lamma2_run

Step 3: Install library



pip install transformers torch accelerate

Step 4: Create a file "touch run.py"



import time

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

timeStart = time.time()

tokenizer = AutoTokenizer.from_pretrained(

    "meta-llama/Llama-2-7b-chat-hf"

)

model = AutoModelForCausalLM.from_pretrained(

    "meta-llama/Llama-2-7b-chat-hf",

    torch_dtype=torch.bfloat16,

    low_cpu_mem_usage=True,

)

print("Load model time: ", -timeStart + time.time())

while(True):

    input_str = input('Enter: ')

    input_token_length = input('Enter length: ')

if(input_str == 'exit'):
    break

timeStart = time.time()

inputs = tokenizer.encode(
    input_str,
    return_tensors="pt"
)

outputs = model.generate(
    inputs,
    max_new_tokens=int(input_token_length),
)

output_str = tokenizer.decode(outputs[0])

print(output_str)

print("Time taken: ", -timeStart + time.time())

Step 5: Run python file



python run.py

Performance:

I am using a CPU with 20 GB of RAM (4 GB + 16 GB).
It took 51 seconds to load the model and 227 seconds to generate a response for 250 tokens.
If you use a GPU, it will take significantly less time.
On Google Colab, i got 16 second for a response.

Congratulations! You have successfully run llama on local machine.

Top comments (5)

Umashankar Nedunchezhian • Aug 9 '23

Hi Nitin, Thanks for sharing this when I follow the above steps I am not getting any output after I give the Input String and token length. The code just hangs. I tried this in my MacBook Pro(M2) and also in AMD powered machine with 48 GB RAM with 6 Core processors and Nvidia GPU.

Can you please advice

Nithin I Bhandari • Aug 11 '23

Please try to give less input_token_length that is 1 (1 token).
And check does it is produce an output.
If the above step produce output, then try with 10 token.
And check does it is produce an output.

Also try to see Task manager, does there are any fluctuation in RAM and SSD usage.

Please try these and share does it worked or not.

Nithin I Bhandari • Aug 11 '23

Can you please share the screenshot

Nithin I Bhandari • Aug 11 '23

Try to also change
torch_dtype=torch.bfloat16,
to
torch_dtype=torch.float16,

Carrie Fischer • Jan 1

The article "How to Run Llama 2 on Your Local Computer" by Nithin I. offers a clear and concise guide, simplifying the process for beginners. The step-by-step instructions are incredibly helpful and easy to follow. For further information on tech-related topics like this, visit How to Run Llama 2 Locally

DEV Community

How to run LLAMA 2 on your local computer

Introduction

Prerequisite:

Steps

Step 1:

Step 2: Create a conda environment and activate conda environment

Step 3: Install library

Step 4: Create a file "touch run.py"

Step 5: Run python file

Performance:

Congratulations! You have successfully run llama on local machine.

Top comments (5)

Read next

AMISHNETKE FOUNDATION

🚀 Your Daily Crypto Job Digest For 25 December!! 🚀

What is O11y? Guide to Modern Observability

🔐 Solidity Limitations, Solutions, Best Practices and Gas Optimization 🚀