GitHub - snunez1/llama.cl: Inference Llama in Common Lisp

llama.cl

This is a Common Lisp port of Karpathy's llama2.c to idiomatic Common Lisp.

Why? Two reasons:

Because Common Lisp is a fantastic language for experimentation, and this makes it easy to explore LLM techniques
To serve as a reference implementation for the Common Lisp community

More than anything else it's the ease of AI experimentation, being able to mix expert systems, graphs, non-deterministic programming easily.

How to run from emacs/slime/sly

Prerequisites

We assume you have a working emacs, lisp and slime/sly setup. Most of the systems llama requires are in quicklisp, however binary-types is not in Quicklisp and you'll need to download it from the repository. Put it in a location accessible to Quicklisp, like ~/common-lisp.

Get the models from Karpathy's repo (original instructions) pretrained on TinyStories the dataset.

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Load the file run.lisp into an emacs buffer
Load slime with M-x slime
Load LLA with (ql:quickload :lla) (optional - requires setup)
Load LLAMA with (ql:quickload :llama) from the REPL
Move into the package (in-package :llama)
Initalise the system with (init #P"stories15M.bin" #P"tokenizer.bin" 32000) (adjust paths if neccessary)
Generate a story with: (generate *model* *tokenizer*)

You can experiment with temperature, prompts and various samplers. See code for all the options. Also tested and working with llama-2-7B. You probably don't want to try anything larger unless you implement the CUDA kernels.

Performance

My machine is running a 3.5 GHz 6-core Intel i7 5930, 256K/15MB cache with 64GB DDR4 RAM and with the stories15M I get about 2.5 tok/sec with CCL and 3.7 tok/s with SBCL.

If you want to use BLAS for matrix multiplication, you'll get about a 10X speed up. Make sure that LLA is loaded before you load LLAMA, if you do so it will automatically use the BLAS library.

Using LLA, the numbers are 14.4 tok/sec for CCL and 34.4 tok/sec for SBCL.

Interestingly, the parallel version (see the forward function) is slower on the the stories15M dataset. Likely the parallisation overhead outweighs the benefits in this case. I got the best results with lparallel kernel equal to the number of physical cores on the machine.

Original README.md

For instructions on conversions to/from .bin format, training and other background, see the original repo

Name		Name	Last commit message	Last commit date
Latest commit History 527 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
export.py		export.py
llama.asd		llama.asd
pkgdcl.lisp		pkgdcl.lisp
run.lisp		run.lisp
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama.cl

How to run from emacs/slime/sly

Prerequisites

Performance

Original README.md

About

Uh oh!

Uh oh!

Languages

License

snunez1/llama.cl

Folders and files

Latest commit

History

Repository files navigation

llama.cl

How to run from emacs/slime/sly

Prerequisites

Performance

Original README.md

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages