DEV Community

Stephan Janssen
Stephan Janssen

Posted on

LLM Inference using 100% Modern Java β˜•οΈπŸ”₯

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.

These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.

Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.

The JLama Project

JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :

  • Gemma & Gemma 2 Models
  • Llama & Llama2 & Llama3 Models
  • Mistral & Mixtral Models
  • Qwen2 Models
  • GPT-2 Models
  • BERT Models
  • BPE Tokenizers
  • WordPiece Tokenizers

Here's his Devoxx Belgium 2024 presentation with more information and demo's.

From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attention level 🀩

Features includes:

  • Paged Attention
  • Mixture of Experts
  • Tool Calling
  • Generate Embeddings
  • Classifier Support
  • Huggingface SafeTensors model and tokenizer format
  • Support for F32, F16, BF16 types
  • Support for Q8, Q4 model quantization
  • Fast GEMM operations
  • Distributed Inference!

JLama requires Java 20 or later and utilises the new Vector API for faster inference.

You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM based SDK.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-21.jdk/Contents/Home
Enter fullscreen mode Exit fullscreen mode

Now you can start JLama with the restapi param and the optional auto-download to start the inference service.

jlama restapi tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 --auto-download
Enter fullscreen mode Exit fullscreen mode

This will download the model if you haven't already.

Experimental JLama and DevoxxGenie integration

Alina and Alfonso at Devoxx Belgium 2024

The JLama3.java Project

The Llama3.java is also a 100% Java implementation developed by AlfonsoΒ² Peterssen and inspired by Andrej Karpathy.

Features includes:

  • Single file, no dependencies
  • GGUF format parser
  • Llama 3 tokenizer based on minbpe
  • Llama 3 inference with Grouped-Query Attention
  • Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
  • Support for Q8_0 and Q4_0 quantizations
  • Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
  • Simple CLI with --chat and --instruct modes.
  • GraalVM's Native Image support (EA builds here)
  • AOT model pre-loading for instant time-to-first-token

Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.

Llama3.java + (OpenAI) REST API

The Llama3.java doesn't have any REST interface so I decided to contribute that part ❀️

I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.

Code is available on GitHub

For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.

Key Features

  1. OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
  2. Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
  3. Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
  4. Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.

Getting Started

To get started with Llama3.java, follow these steps:

  1. Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
  2. Build: Use Maven to build the project with "mvn clean package".
  3. Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
  4. Configure: Update the application.properties file with your model details and server settings.
  5. Run: Start the Spring Boot application using the provided Java command.

DevoxxGenie

When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🀩

DevoxxGenie

Future Directions

The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports

  • Externalise Llama3.java as a maven service dependency (if/when available)
  • Add GPU support using TornadoVM
  • GraalVM native versions 🍏
  • LLM sharding capabilities
  • Support for different models: BitNets & Ternary Models

Conclusion

Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.

Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.

As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.

Exciting times for Java Developers! β˜•οΈπŸ”₯❀️

~ Stephan Janssen

Top comments (0)