This is a Common Lisp port of Karpathy's llama2.c to idiomatic Common Lisp.
Why? Two reasons:
- Because Common Lisp is a fantastic language for experimentation, and this makes it easy to explore LLM techniques
- To serve as a reference implementation for the Common Lisp community
More than anything else it's the ease of AI experimentation, being able to mix expert systems, graphs, non-deterministic programming easily.
We assume you have a working emacs, lisp and slime/sly setup. Most of the systems llama
requires are in quicklisp, however binary-types is not in Quicklisp and you'll need to download it from the repository. Put it in a location accessible to Quicklisp, like ~/common-lisp
.
-
Get the models from Karpathy's repo (original instructions) pretrained on TinyStories the dataset.
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
-
Load the file
run.lisp
into an emacs buffer -
Load slime with
M-x slime
-
Load LLA with
(ql:quickload :lla)
(optional - requires setup) -
Load LLAMA with
(ql:quickload :llama)
from the REPL -
Move into the package
(in-package :llama)
-
Initalise the system with
(init #P"stories15M.bin" #P"tokenizer.bin" 32000)
(adjust paths if neccessary) -
Generate a story with:
(generate *model* *tokenizer*)
You can experiment with temperature, prompts and various samplers. See code for all the options. Also tested and working with llama-2-7B. You probably don't want to try anything larger unless you implement the CUDA kernels.
My machine is running a 3.5 GHz 6-core Intel i7 5930, 256K/15MB cache with 64GB DDR4 RAM and with the stories15M
I get about 2.5 tok/sec with CCL and 3.7 tok/s with SBCL.
If you want to use BLAS for matrix multiplication, you'll get about a 10X speed up. Make sure that LLA is loaded before you load LLAMA
, if you do so it will automatically use the BLAS library.
Using LLA, the numbers are 14.4 tok/sec for CCL and 34.4 tok/sec for SBCL.
Interestingly, the parallel version (see the forward
function) is slower on the the stories15M dataset. Likely the parallisation overhead outweighs the benefits in this case. I got the best results with lparallel kernel equal to the number of physical cores on the machine.
For instructions on conversions to/from .bin format, training and other background, see the original repo