Stephan Janssen

Posted on Oct 21

LLM Inference using 100% Modern Java ☕️🔥

#java #llm #llama3

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.

These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.

Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.

The JLama Project

JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :

Gemma & Gemma 2 Models
Llama & Llama2 & Llama3 Models
Mistral & Mixtral Models
Qwen2 Models
GPT-2 Models
BERT Models
BPE Tokenizers
WordPiece Tokenizers

Here's his Devoxx Belgium 2024 presentation with more information and demo's.

From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attention level 🤩

Features includes:

Paged Attention
Mixture of Experts
Tool Calling
Generate Embeddings
Classifier Support
Huggingface SafeTensors model and tokenizer format
Support for F32, F16, BF16 types
Support for Q8, Q4 model quantization
Fast GEMM operations
Distributed Inference!

JLama requires Java 20 or later and utilises the new Vector API for faster inference.

You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM based SDK.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-21.jdk/Contents/Home

Now you can start JLama with the restapi param and the optional auto-download to start the inference service.

jlama restapi tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 --auto-download

This will download the model if you haven't already.

The JLama3.java Project

The Llama3.java is also a 100% Java implementation developed by Alfonso² Peterssen and inspired by Andrej Karpathy.

Features includes:

Single file, no dependencies
GGUF format parser
Llama 3 tokenizer based on minbpe
Llama 3 inference with Grouped-Query Attention
Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
Support for Q8_0 and Q4_0 quantizations
Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
Simple CLI with --chat and --instruct modes.
GraalVM's Native Image support (EA builds here)
AOT model pre-loading for instant time-to-first-token

Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.

Llama3.java + (OpenAI) REST API

The Llama3.java doesn't have any REST interface so I decided to contribute that part ❤️

I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.

Code is available on GitHub

For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.

Key Features

OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.

Getting Started

To get started with Llama3.java, follow these steps:

Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
Build: Use Maven to build the project with "mvn clean package".
Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
Configure: Update the application.properties file with your model details and server settings.
Run: Start the Spring Boot application using the provided Java command.

DevoxxGenie

When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🤩

Future Directions

The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports

Externalise Llama3.java as a maven service dependency (if/when available)
Add GPU support using TornadoVM
GraalVM native versions 🍏
LLM sharding capabilities
Support for different models: BitNets & Ternary Models

Conclusion

Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.

Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.

As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.

Exciting times for Java Developers! ☕️🔥❤️

~ Stephan Janssen

DEV Community

LLM Inference using 100% Modern Java ☕️🔥

The JLama Project

The JLama3.java Project

Llama3.java + (OpenAI) REST API

Key Features

Getting Started

DevoxxGenie

Future Directions

Conclusion

Top comments (0)

Read next

Understanding Strings and Arrays in Java

Day 27: Regularization Techniques for Large Language Models (LLMs)

Building an Article Generator with LangChain and Llama3: An AI Developer's Journey

Garbage Collection in Java Explained