Introduction
LLAMA 2 is a large language model that can generate text, translate languages, and answer your questions in an informative way. In this blog post, I will show you how to run LLAMA 2 on your local computer.
Prerequisite:
- Install anaconda
- Install Python 11
Steps
Step 1:
1.1: Visit to huggingface.co
Model Link: https://huggingface.co/meta-llama/Llama-2-7b-hf
1.2: Create an account on HuggingFace
1.3: Request for llama model access
It may take a day to get access.
1.4: Go to below link and request llama access
Link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
1.5: As llama 2 is private repo, login by huggingface and generate a token.
Link: https://huggingface.co/settings/tokens
pip install huggingface_hub
huggingface-cli login
Step 2: Create a conda environment and activate conda environment
conda create -n py_3_11_lamma2_run python=3.11 -y
conda activate py_3_11_lamma2_run
Step 3: Install library
pip install transformers torch accelerate
Step 4: Create a file "touch run.py"
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
timeStart = time.time()
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
)
print("Load model time: ", -timeStart + time.time())
while(True):
input_str = input('Enter: ')
input_token_length = input('Enter length: ')
if(input_str == 'exit'):
break
timeStart = time.time()
inputs = tokenizer.encode(
input_str,
return_tensors="pt"
)
outputs = model.generate(
inputs,
max_new_tokens=int(input_token_length),
)
output_str = tokenizer.decode(outputs[0])
print(output_str)
print("Time taken: ", -timeStart + time.time())
Step 5: Run python file
python run.py
Performance:
I am using a CPU with 20 GB of RAM (4 GB + 16 GB).
It took 51 seconds to load the model and 227 seconds to generate a response for 250 tokens.
If you use a GPU, it will take significantly less time.
On Google Colab, i got 16 second for a response.
Top comments (5)
Hi Nitin, Thanks for sharing this when I follow the above steps I am not getting any output after I give the Input String and token length. The code just hangs. I tried this in my MacBook Pro(M2) and also in AMD powered machine with 48 GB RAM with 6 Core processors and Nvidia GPU.
Can you please advice
Please try to give less input_token_length that is 1 (1 token).
And check does it is produce an output.
If the above step produce output, then try with 10 token.
And check does it is produce an output.
Also try to see Task manager, does there are any fluctuation in RAM and SSD usage.
Please try these and share does it worked or not.
Can you please share the screenshot
Try to also change
torch_dtype=torch.bfloat16,
to
torch_dtype=torch.float16,
The article "How to Run Llama 2 on Your Local Computer" by Nithin I. offers a clear and concise guide, simplifying the process for beginners. The step-by-step instructions are incredibly helpful and easy to follow. For further information on tech-related topics like this, visit How to Run Llama 2 Locally