DEV Community

Cover image for Running multiple LLMs on a single GPU
Shannon Lal
Shannon Lal

Posted on

Running multiple LLMs on a single GPU

In recent weeks, I have been working on projects that utilize GPUs, and I have been exploring ways to optimize their usage. To gain insights into GPU utilization, I started by analyzing the memory consumption and usage patterns using the nvidia-smi tool. This provided me with a detailed breakdown of the GPU memory and usage for each application.
One of the areas I have been focusing on is deploying our own LLMs. I noticed that when working with smaller LLMs, such as those with 7B parameters, on an A100 GPU, they were only consuming about 8GB of memory and utilizing around 20% of the GPU during inference. This observation led me to investigate the possibility of running multiple LLM processes in parallel on a single GPU to optimize resource utilization.
To achieve this, I explored using Python's multiprocessing module and the spawn method to launch multiple processes concurrently. By doing so, I aimed to efficiently run multiple LLM inference tasks in parallel on a single GPU. The following code demonstrates the approach I used to set up and execute multiple LLMs on a single GPU.

MAX_MODELS = 3

def load_model(model_name: str, device: str):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True, 
        load_in_8bit=True, 
        device_map={"":device},
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer

def inference(model, prompt: str):
    text = model.process_and_generate(prompt, params={"max_new_tokens": 200, "temperature": 1.0})
    return text


def process_task(task_queue, result_queue):
    model = load_model("tiiuae/falcon-7b-instruct", device="cuda:0")
    while True:
        task = task_queue.get()
        if task is None:
            break
        prompt = task
        start = time.time()
        summary = inference(model, prompt)
        print(f"Completed inference in {time.time() - start}")
        result_queue.put(summary)

def main():
    task_queue = multiprocessing.Queue()
    result_queue = multiprocessing.Queue()
    prompt = "" # The prompt you want to execute

    processes = []
    for _ in range(MAX_MODELS):
        process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue ))
        process.start()
        processes.append(process)

    start = time.time()

    # I want to run this 3 times for each of the models
    for _ in range(MAX_MODELS*3):
        task_queue.put((prompt))

    results = []
    for _ in range(MAX_MODELS*3):
        result = result_queue.get()
        results.append(result)
    end = time.time()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    main()

Enter fullscreen mode Exit fullscreen mode

The following is a quick summary of some of the tests that I ran.

GPU # of LLMs GPU Memory GPU Usage Average Inference Time
A100 with 40GB 1 8 GB 20% 12.8 seconds
A100 with 40GB 2 16 GB 95% 16 seconds
A100 with 40GB 3 32 GB 100% 23.2 seconds

Running multiple LLM instances on a single GPU can significantly reduce costs and increase availability by efficiently utilizing the available resources. However, it's important to note that this approach may result in a slight performance degradation, as evident from the increased average inference time when running multiple LLMs concurrently. If you have any other ways of optimizing GPU usage or questions on how this works feel free to reach out.

Thanks

Top comments (0)