llama.cpp

llama

License: MIT Server

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

[!IMPORTANT] New llama.cpp package location: ggml-org/llama.cpp

Update your container URLs to: ghcr.io/ggml-org/llama.cpp

More info: https://github.com/ggml-org/llama.cpp/discussions/11801

Recent API changes

Hot topics


Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

Bindings
UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools
Infrastructure
Games

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel and Nvidia GPU
MUSA Moore Threads MTT GPU
CUDA Nvidia GPU
HIP AMD GPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU

Building the project

The main product of this project is the llama library. Its C-style interface can be found in include/llama.h. The project also includes many example programs and tools using the llama library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face by using this CLI argument: -hf <user>/<model>[:quant]

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

llama-perplexity

A tool for measuring the perplexity ^1 (and other quality metrics) of a model over a given text.

llama-bench

Benchmark the performance of the inference for various parameters.

llama-run

A comprehensive example for running llama.cpp models. Useful for inferencing. Used with RamaLama ^3.

llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

References

Join libs.tech

...and unlock some superpowers

GitHub

We won't share your data with anyone else.