Add LLaDA 8b Diffusion model #14771

am17an · 2025-07-19T09:50:27Z

Continuing on #14644, this PR adds another diffusion model https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct, which has different semantics compared to the dream-7b model, and overall seems to have better performance

There are very few similarities between how they seem to generate tokens, so for now I've just created two different examples llama-diffusion-dream-cli (for the earlier version) and llama-diffusion-llada-cli, for running the new LLaDA model. Added a README as well

I've uploaded a GGUF.
Edit on 30-07-2025: Re-uploaded another GGUF with a config change

Example command
./build/bin/llama-diffusion-cli -m llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0 -ub 128 --diffusion-visual --diffusion-block-length 32

Also I would like this to the server, but I'm not sure what API would be acceptable so I'm hoping to have a discussion on that as well

convert_hf_to_gguf.py

common/arg.cpp

examples/diffusion/README.md

convert_hf_to_gguf.py

ggerganov · 2025-07-21T05:05:23Z

I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established.

We can think about extending the llama_sampler functionality to support these use cases and since it is already modular it would make more sense to implement the sampling logic there. Ideally the diffusion CLI example would be just one for all diffusion models, with different samplers attached.

am17an · 2025-07-21T05:37:17Z

I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established.

We can think about extending the llama_sampler functionality to support these use cases and since it is already modular it would make more sense to implement the sampling logic there. Ideally the diffusion CLI example would be just one for all diffusion models, with different samplers attached.

Yeah agree, I initially wrote them to be one example. However, passing arguments via CLI for two separate sets of sampling parameters/algorithms was quite confusing to me and would be even more so for the end-user, so for the sake of clarity I wrote them separately.
diffusion_generate_dream and diffusion_generate_llada are two different functions with the same outline, decode => sample => unmask, so there is an abstraction to be made, the only thing is to clarify is how we pass separate sets of parameters to the example without overloading the same thing (e.g. --diffusion-algorithm being supported in dream but not llada and vice versa), llama_sampler be used also, but I don't see how it would solve this particular problem

am17an · 2025-07-23T03:01:37Z

@ggerganov would having them in the same example and having extra CLI args for models be acceptable?

ggerganov · 2025-07-25T11:50:08Z

Yes, merging the examples into a single example would be better.

am17an · 2025-07-26T07:14:05Z

Yes, merging the examples into a single example would be better.

Made everything into a single example, please have another look when you have the time

ggerganov

I think the example can be improved by not branching between "llada" and "dream" and instead have a common logic for any diffusion logic. This would make it much easier to scale with more diffusion models in the future. Otherwise, the way you've implemented it now, you have to add new structs, sampling types, generation functions, etc. for each new architecture and this seems a bit unnecessary.

common/arg.cpp

examples/diffusion/diffusion-cli.cpp

am17an · 2025-07-28T06:43:16Z

@ggerganov you're right, we can combine the sampling methods. I was under the assumption that the only sampling methods that would work are their respective paper implementations, but I tried various sampling methods on both models and they seem to have coherent outputs, but I did not do any deep correctness checks.

Refactored to have a concept called schedule which is either timestep based (like dream) or block-based (like LLaDA). Both work for both models. Also refactored the sampling methods to be the same across the models.

The issues that do remain however,

Shifted logits - logits in Dream are shifted by -1 after a pp path, which is not the case in LLaDA. Ideally this should be a part of the GGUF, but I'm not sure.
The BOS token in LLaDA - add_bos_token is false in tokenizer_config.json, I think because the chat_template contains the bos_token.

"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",

However, this code removes this BOS

llama.cpp/common/chat.cpp

Lines 746 to 755 in c35f9ea

    
           minja::chat_template_options tmpl_opts; 
        
           // To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens 
        
           // instead of using `chat_template_options.use_bos_token = false`, since these tokens 
        
           // may be needed inside the template / between messages too. 
        
           auto result = tmpl.apply(tmpl_inputs, tmpl_opts); 
        
           if (string_starts_with(result, tmpl.bos_token())) { 
        
               result = result.substr(tmpl.bos_token().size()); 
        
           } 
        
           if (string_ends_with(result, tmpl.eos_token())) { 
        
               result = result.substr(0, result.size() - tmpl.eos_token().size());

I'm not familiar with chat-template code and I was not able to work around this without adding a bos token

CISC · 2025-07-28T07:32:05Z

2. The BOS token in LLaDA - `add_bos_token` is false in `tokenizer_config.json`, I think because the chat_template contains the `bos_token`.

No, add_bos_token only applies to untemplated generation, it seems like a mistake. ~~It was removed in LLaDA 1.5 chat template BTW~~.

Edit: Nvm, I'm blind, it's still there.

However, this code removes this BOS

llama.cpp/common/chat.cpp

Lines 746 to 755 in c35f9ea

minja::chat_template_options tmpl_opts;

// To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens

// instead of using `chat_template_options.use_bos_token = false`, since these tokens

// may be needed inside the template / between messages too.

auto result = tmpl.apply(tmpl_inputs, tmpl_opts);

if (string_starts_with(result, tmpl.bos_token())) {

result = result.substr(tmpl.bos_token().size());

}

if (string_ends_with(result, tmpl.eos_token())) {

result = result.substr(0, result.size() - tmpl.eos_token().size());

This probably needs to be improved.

I'm not familiar with chat-template code and I was not able to work around this without adding a bos token

Setting add_bos_token to True on conversion should fix that, ~~but only applies to pre-1.5 models~~.

am17an · 2025-07-28T07:49:34Z

Setting add_bos_token to True on conversion should fix that, but only applies to pre-1.5 models.

Yep, this fixes it for regenerated gguf. Though it might be a problem downstream if people use the HF repo to create quants (unless they patch this in the HF repo)

convert_hf_to_gguf.py

llama: fix llama-model fixup working

examples/diffusion/diffusion-cli.cpp

convert_hf_to_gguf.py

…theta and rms_norm_eps

convert_hf_to_gguf.py

CISC · 2025-07-31T08:58:29Z

I just tested LLaDA-1.5 btw, works great! :)

common/common.h

convert_hf_to_gguf.py

gguf-py/gguf/gguf_writer.py

* Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py

wtarreau · 2025-08-03T04:20:11Z

I just tested today and got a crash. As I understand this is still in early development, I'm not going to bother you with another issue, as the problem might already be known or expected. In short I've built llama.cpp at commit 5c0eb5e ("opencl: fix adreno compiler detection logic (#15029)") like I always do:

$ cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_OPENMP=OFF  && cmake --build build --config Release -j $(nproc)

and tried the exact command above with the model linked above:

 ./build/bin/llama-diffusion-cli -m /mnt/models/llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0  -ub 128 --diffusion-visual

and got that one:

...
llama_context:        CPU  output buffer size =     0.48 MiB
/home/willy/ai/llama.cpp/examples/diffusion/diffusion-cli.cpp:616: GGML_ASSERT((params.diffusion.eps == 0) ^ (params.diffusion.block_length == 0)) failed
[New LWP 7410]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000fa000fe66800 in __GI___wait4 (pid=<optimized out>, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000fa000fe66800 in __GI___wait4 (pid=<optimized out>, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000bbeeac7290b4 in ggml_print_backtrace ()
#2  0x0000bbeeac729244 in ggml_abort ()
#3  0x0000bbeeac480970 in main ()
[Inferior 1 (process 7409) detached]
Aborted (core dumped)

It wasn't built with debug symbols so the backtrace is not more precise. If this is news to you and you'd prefer getting an issue, let me know, and I'll do it. But logically it is trivially reproducible ;-)

am17an · 2025-08-03T05:14:56Z

@wtarreau you need to either supply --diffusion-block-length or --diffusion-eps, I recommend --diffusion-block-length 32 for LLaDA models. Might be useful to have a default value here which I'll look into

wtarreau · 2025-08-03T05:29:12Z

Confirmed indeed, I knew it was not necessary to create an issue ;-) Indeed, a default value would be useful, in general a program shouldn't crash due to missing cmdline args, at worst it should complain about their lack.

IMHO you should edit your first comment above to add this missing argument. Your short howto is super useful to try the feature given that you've provided the gguf file as well! Thanks!

I find the output a bit slow, but I couldn't time it, the (remote) machine hung, I'll have to time it on monday after I can reboot it :-)

github-actions bot added examples python python script changes labels Jul 19, 2025

am17an force-pushed the add_llada_8b branch from e362c14 to 87b3235 Compare July 19, 2025 10:05

am17an requested a review from ggerganov July 19, 2025 10:06

am17an force-pushed the add_llada_8b branch from 87b3235 to d27740c Compare July 19, 2025 10:17

am17an requested a review from CISC July 19, 2025 11:05

am17an force-pushed the add_llada_8b branch 3 times, most recently from e4b7346 to 5644f2f Compare July 19, 2025 14:59

CISC reviewed Jul 19, 2025

View reviewed changes

am17an force-pushed the add_llada_8b branch from 7a5747d to 6317827 Compare July 20, 2025 02:45

CISC reviewed Jul 20, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

am17an force-pushed the add_llada_8b branch from e6d91d5 to 6b0ea9f Compare July 22, 2025 09:47

am17an force-pushed the add_llada_8b branch 2 times, most recently from 0f66ad4 to 1439fbe Compare July 26, 2025 07:08

ggerganov reviewed Jul 26, 2025

View reviewed changes

am17an force-pushed the add_llada_8b branch 2 times, most recently from cb015b4 to cf10ebf Compare July 28, 2025 07:27

am17an force-pushed the add_llada_8b branch from cf10ebf to baaf2db Compare July 28, 2025 07:35

am17an force-pushed the add_llada_8b branch from baaf2db to e318469 Compare July 28, 2025 07:54

CISC reviewed Jul 28, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Add support for Llada-8b: diffusion model

bef6c2d

llama: fix llama-model fixup working

am17an added 3 commits July 30, 2025 11:07

Fix README and convert_hf_to_gguf

267a09d

convert_hf_to_gguf.py: address review comments

812bc38

Make everything in a single example

6bb0093

am17an force-pushed the add_llada_8b branch 2 times, most recently from 6ec2264 to 05f99c7 Compare July 30, 2025 04:08

Remove model-specific sampling

3e7efcb

am17an force-pushed the add_llada_8b branch from 05f99c7 to 3e7efcb Compare July 30, 2025 04:16

am17an requested a review from ggerganov July 30, 2025 04:19

Remove unused argmax

a50547c

ggerganov approved these changes Jul 31, 2025

View reviewed changes

examples/diffusion/diffusion-cli.cpp Outdated Show resolved Hide resolved

Remove braced initializers, improve README.md a bit

e864a49

am17an force-pushed the add_llada_8b branch from fcc4180 to e864a49 Compare July 31, 2025 08:27

CISC approved these changes Jul 31, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Add diffusion specific gguf params in set_vocab, remove setting rope_…

9691f4e

…theta and rms_norm_eps

am17an force-pushed the add_llada_8b branch from 6f791ca to 9691f4e Compare July 31, 2025 08:54

CISC reviewed Jul 31, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

CISC reviewed Jul 31, 2025

View reviewed changes

common/common.h Outdated Show resolved Hide resolved

Remove adding the mask token

57201cc

CISC reviewed Jul 31, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Move add_add_bos_token to set_vocab

a326b13

CISC reviewed Jul 31, 2025

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

use add_bool in gguf_writer.py

ac3f91f

am17an force-pushed the add_llada_8b branch from 3a2a3d9 to ac3f91f Compare July 31, 2025 10:16

am17an merged commit 8a4a856 into ggml-org:master Jul 31, 2025
50 checks passed

am17an deleted the add_llada_8b branch August 1, 2025 07:15

Add LLaDA 8b Diffusion model #14771

Add LLaDA 8b Diffusion model #14771

Uh oh!

Conversation

am17an commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jul 21, 2025

Uh oh!

am17an commented Jul 21, 2025

Uh oh!

am17an commented Jul 23, 2025

Uh oh!

ggerganov commented Jul 25, 2025

Uh oh!

am17an commented Jul 26, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Jul 28, 2025

Uh oh!

CISC commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wtarreau commented Aug 3, 2025

Uh oh!

am17an commented Aug 3, 2025

Uh oh!

wtarreau commented Aug 3, 2025

Uh oh!

Uh oh!

am17an commented Jul 19, 2025 •

edited

Loading

CISC commented Jul 28, 2025 •

edited

Loading