Skip to content

Add LLaDA 8b Diffusion model #14771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 31, 2025
Merged

Add LLaDA 8b Diffusion model #14771

merged 12 commits into from
Jul 31, 2025

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jul 19, 2025

Continuing on #14644, this PR adds another diffusion model https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct, which has different semantics compared to the dream-7b model, and overall seems to have better performance

There are very few similarities between how they seem to generate tokens, so for now I've just created two different examples llama-diffusion-dream-cli (for the earlier version) and llama-diffusion-llada-cli, for running the new LLaDA model. Added a README as well

I've uploaded a GGUF.
Edit on 30-07-2025: Re-uploaded another GGUF with a config change

Example command
./build/bin/llama-diffusion-cli -m llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0 -ub 128 --diffusion-visual --diffusion-block-length 32

Also I would like this to the server, but I'm not sure what API would be acceptable so I'm hoping to have a discussion on that as well

@github-actions github-actions bot added examples python python script changes labels Jul 19, 2025
@am17an am17an requested a review from ggerganov July 19, 2025 10:06
@am17an am17an requested a review from CISC July 19, 2025 11:05
@am17an am17an force-pushed the add_llada_8b branch 3 times, most recently from e4b7346 to 5644f2f Compare July 19, 2025 14:59
@ggerganov
Copy link
Member

I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established.

We can think about extending the llama_sampler functionality to support these use cases and since it is already modular it would make more sense to implement the sampling logic there. Ideally the diffusion CLI example would be just one for all diffusion models, with different samplers attached.

@am17an
Copy link
Collaborator Author

am17an commented Jul 21, 2025

I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established.

We can think about extending the llama_sampler functionality to support these use cases and since it is already modular it would make more sense to implement the sampling logic there. Ideally the diffusion CLI example would be just one for all diffusion models, with different samplers attached.

Yeah agree, I initially wrote them to be one example. However, passing arguments via CLI for two separate sets of sampling parameters/algorithms was quite confusing to me and would be even more so for the end-user, so for the sake of clarity I wrote them separately.
diffusion_generate_dream and diffusion_generate_llada are two different functions with the same outline, decode => sample => unmask, so there is an abstraction to be made, the only thing is to clarify is how we pass separate sets of parameters to the example without overloading the same thing (e.g. --diffusion-algorithm being supported in dream but not llada and vice versa), llama_sampler be used also, but I don't see how it would solve this particular problem

@am17an
Copy link
Collaborator Author

am17an commented Jul 23, 2025

@ggerganov would having them in the same example and having extra CLI args for models be acceptable?

@ggerganov
Copy link
Member

Yes, merging the examples into a single example would be better.

@am17an am17an force-pushed the add_llada_8b branch 2 times, most recently from 0f66ad4 to 1439fbe Compare July 26, 2025 07:08
@am17an
Copy link
Collaborator Author

am17an commented Jul 26, 2025

Yes, merging the examples into a single example would be better.

Made everything into a single example, please have another look when you have the time

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example can be improved by not branching between "llada" and "dream" and instead have a common logic for any diffusion logic. This would make it much easier to scale with more diffusion models in the future. Otherwise, the way you've implemented it now, you have to add new structs, sampling types, generation functions, etc. for each new architecture and this seems a bit unnecessary.

@am17an
Copy link
Collaborator Author

am17an commented Jul 28, 2025

@ggerganov you're right, we can combine the sampling methods. I was under the assumption that the only sampling methods that would work are their respective paper implementations, but I tried various sampling methods on both models and they seem to have coherent outputs, but I did not do any deep correctness checks.

Refactored to have a concept called schedule which is either timestep based (like dream) or block-based (like LLaDA). Both work for both models. Also refactored the sampling methods to be the same across the models.

The issues that do remain however,

  1. Shifted logits - logits in Dream are shifted by -1 after a pp path, which is not the case in LLaDA. Ideally this should be a part of the GGUF, but I'm not sure.
  2. The BOS token in LLaDA - add_bos_token is false in tokenizer_config.json, I think because the chat_template contains the bos_token.
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",

However, this code removes this BOS

llama.cpp/common/chat.cpp

Lines 746 to 755 in c35f9ea

minja::chat_template_options tmpl_opts;
// To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens
// instead of using `chat_template_options.use_bos_token = false`, since these tokens
// may be needed inside the template / between messages too.
auto result = tmpl.apply(tmpl_inputs, tmpl_opts);
if (string_starts_with(result, tmpl.bos_token())) {
result = result.substr(tmpl.bos_token().size());
}
if (string_ends_with(result, tmpl.eos_token())) {
result = result.substr(0, result.size() - tmpl.eos_token().size());

I'm not familiar with chat-template code and I was not able to work around this without adding a bos token

@am17an am17an force-pushed the add_llada_8b branch 2 times, most recently from cb015b4 to cf10ebf Compare July 28, 2025 07:27
@CISC
Copy link
Collaborator

CISC commented Jul 28, 2025

2. The BOS token in LLaDA - `add_bos_token` is false in `tokenizer_config.json`, I think because the chat_template contains the `bos_token`.

No, add_bos_token only applies to untemplated generation, it seems like a mistake. It was removed in LLaDA 1.5 chat template BTW.

Edit: Nvm, I'm blind, it's still there.

However, this code removes this BOS

llama.cpp/common/chat.cpp

Lines 746 to 755 in c35f9ea

minja::chat_template_options tmpl_opts;
// To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens
// instead of using `chat_template_options.use_bos_token = false`, since these tokens
// may be needed inside the template / between messages too.
auto result = tmpl.apply(tmpl_inputs, tmpl_opts);
if (string_starts_with(result, tmpl.bos_token())) {
result = result.substr(tmpl.bos_token().size());
}
if (string_ends_with(result, tmpl.eos_token())) {
result = result.substr(0, result.size() - tmpl.eos_token().size());

This probably needs to be improved.

I'm not familiar with chat-template code and I was not able to work around this without adding a bos token

Setting add_bos_token to True on conversion should fix that, but only applies to pre-1.5 models.

@am17an
Copy link
Collaborator Author

am17an commented Jul 28, 2025

Setting add_bos_token to True on conversion should fix that, but only applies to pre-1.5 models.

Yep, this fixes it for regenerated gguf. Though it might be a problem downstream if people use the HF repo to create quants (unless they patch this in the HF repo)

llama: fix llama-model

fixup

working
@CISC
Copy link
Collaborator

CISC commented Jul 31, 2025

I just tested LLaDA-1.5 btw, works great! :)

@am17an am17an merged commit 8a4a856 into ggml-org:master Jul 31, 2025
50 checks passed
@am17an am17an deleted the add_llada_8b branch August 1, 2025 07:15
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 1, 2025
* Add support for Llada-8b: diffusion model

* Add README

* Fix README and convert_hf_to_gguf

* convert_hf_to_gguf.py: address review comments

* Make everything in a single example

* Remove model-specific sampling

* Remove unused argmax

* Remove braced initializers, improve README.md a bit

* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps

* Remove adding the mask token

* Move add_add_bos_token to set_vocab

* use add_bool in gguf_writer.py
@wtarreau
Copy link
Contributor

wtarreau commented Aug 3, 2025

I just tested today and got a crash. As I understand this is still in early development, I'm not going to bother you with another issue, as the problem might already be known or expected. In short I've built llama.cpp at commit 5c0eb5e ("opencl: fix adreno compiler detection logic (#15029)") like I always do:

$ cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_OPENMP=OFF  && cmake --build build --config Release -j $(nproc)

and tried the exact command above with the model linked above:

 ./build/bin/llama-diffusion-cli -m /mnt/models/llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0  -ub 128 --diffusion-visual

and got that one:

...
llama_context:        CPU  output buffer size =     0.48 MiB
/home/willy/ai/llama.cpp/examples/diffusion/diffusion-cli.cpp:616: GGML_ASSERT((params.diffusion.eps == 0) ^ (params.diffusion.block_length == 0)) failed
[New LWP 7410]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000fa000fe66800 in __GI___wait4 (pid=<optimized out>, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000fa000fe66800 in __GI___wait4 (pid=<optimized out>, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000bbeeac7290b4 in ggml_print_backtrace ()
#2  0x0000bbeeac729244 in ggml_abort ()
#3  0x0000bbeeac480970 in main ()
[Inferior 1 (process 7409) detached]
Aborted (core dumped)

It wasn't built with debug symbols so the backtrace is not more precise. If this is news to you and you'd prefer getting an issue, let me know, and I'll do it. But logically it is trivially reproducible ;-)

@am17an
Copy link
Collaborator Author

am17an commented Aug 3, 2025

@wtarreau you need to either supply --diffusion-block-length or --diffusion-eps, I recommend --diffusion-block-length 32 for LLaDA models. Might be useful to have a default value here which I'll look into

@wtarreau
Copy link
Contributor

wtarreau commented Aug 3, 2025

Confirmed indeed, I knew it was not necessary to create an issue ;-) Indeed, a default value would be useful, in general a program shouldn't crash due to missing cmdline args, at worst it should complain about their lack.

IMHO you should edit your first comment above to add this missing argument. Your short howto is super useful to try the feature given that you've provided the gguf file as well! Thanks!

I find the output a bit slow, but I couldn't time it, the (remote) machine hung, I'll have to time it on monday after I can reboot it :-)

@claws61821
Copy link

Is this available in the binaries somehow, or only in source builds? I tried running the example command on the latest Windows Vulkan bin, replacing llama-diffusion-cli with llama-cli in the proper directory and using a different prompt, and I received an error that stated that diffusion-steps is an invalid argument. I was using the LlaDa model from your link above.

load_backend: loaded RPC backend from ...\Documents\AIGen\LlamaCPP\llama-b6183-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from ...\Documents\AIGen\LlamaCPP\llama-b6183-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from ...\Documents\AIGen\LlamaCPP\llama-b6183-bin-win-vulkan-x64\ggml-cpu-icelake.dll
error: invalid argument: --diffusion-steps```

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

Is this available in the binaries somehow, or only in source builds? I tried running the example command on the latest Windows Vulkan bin, replacing llama-diffusion-cli with llama-cli in the proper directory and using a different prompt, and I received an error that stated that diffusion-steps is an invalid argument.

You're supposed to use llama-diffusion-cli, not llama-cli. :)

@wtarreau
Copy link
Contributor

I re-ran a test with it, sometimes with a local machine. It's incredibly slow. I'm well aware that the F16 quantization does participate to this, but I suspect that for the same effort it requires a lot more memory accesses. Here on this machine (Radxa Orion O6, ARMv9 with 8xA720 and 128b DDR5), it took almost 11 minutes:

$ time ./build/bin/llama-diffusion-cli -m models/llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6
kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0  -ub 128 --diffusion-visual --diff
usion-block-length 32

In the first 4 hours, Lily runs 12 kilometers per hour x 4 hours = 48 kilometers.
In the next 4 hours, she runs 6 kilometers per hour x 4 hours = 24 kilometers.
Therefore, Lily runs 48 kilometers + 24 kilometers = 72 kilometers.
Therefore: 72

real    10m53.747s
user    86m26.593s
sys     0m1.322s

This test (same prompt) on Llama-3.1-8B-Q5_K_M gives the result in 27 seconds, or 24 times faster:

time taskset -c 0,5-11 ./build/bin/llama-cli -t 8 -m models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --temp 0.6 -st
(...)
## Step 4: Calculate the total distance Lily can run in 8 hours.
Total distance = distance run at 12 km/h + distance run at 6 km/h = 48 km + 24 km = 72 km.

The final answer is: $\boxed{72}$ [end of text]

llama_perf_sampler_print:    sampling time =      13.91 ms /   227 runs   (    0.06 ms per token, 16325.06 tokens per second)
llama_perf_context_print:        load time =     886.37 ms
llama_perf_context_print: prompt eval time =    3619.70 ms /    46 tokens (   78.69 ms per token,    12.71 tokens per second)
llama_perf_context_print:        eval time =   22018.74 ms /   180 runs   (  122.33 ms per token,     8.17 tokens per second)
llama_perf_context_print:       total time =   25708.06 ms /   226 tokens
llama_perf_context_print:    graphs reused =        173

real    0m26.911s
user    3m26.480s
sys     0m0.662s

The same test on llama-3.2-3B also gives the correct result but this time in 10 seconds:

$ time taskset -c 0,5-11 ./build/bin/llama-cli -t 8 -m models/Llama-3.2-3B-Instruct-Q4_0.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" -st
(...)
## Step 4: Calculate the total distance Lily can run in 8 hours.
Add the distance from the first 4 hours and the distance from the remaining 4 hours to find the total distance. total_distance = 48 + 24 = 72 kilometers.

The final answer is: $\boxed{72}$ [end of text]

llama_perf_sampler_print:    sampling time =      14.49 ms /   235 runs   (    0.06 ms per token, 16215.84 tokens per second)
llama_perf_context_print:        load time =    1122.89 ms
llama_perf_context_print: prompt eval time =     320.13 ms /    46 tokens (    6.96 ms per token,   143.69 tokens per second)
llama_perf_context_print:        eval time =    8722.87 ms /   188 runs   (   46.40 ms per token,    21.55 tokens per second)
llama_perf_context_print:       total time =    9107.97 ms /   234 tokens
llama_perf_context_print:    graphs reused =        181

real    0m10.427s
user    1m13.711s
sys     0m0.503s

I've checked using "perf top" where the CPU was spent in llada:

  95.03%  llama-diffusion-cli                 [.] ggml_vec_dot_f16
   1.08%  llama-diffusion-cli                 [.] ggml_vec_dot_f32
   0.86%  llama-diffusion-cli                 [.] ggml_compute_forward_mul_mat
   0.25%  llama-diffusion-cli                 [.] ggml_cpu_fp32_to_fp16
   0.19%  llama-diffusion-cli                 [.] ggml_barrier
   0.14%  llama-diffusion-cli                 [.] ggml_vec_swiglu_f32

OK, let's zoom into this one:

ggml_vec_dot_f16:
   0.01 |      mov    v28.16b, v30.16b
        |      mov    v29.16b, v30.16b
   0.02 |      mov    v31.16b, v30.16b                                                                                                     
   0.01 |      nop
   2.29 |38:   ldp    q24, q25, [x2]
  15.35 |      ldp    q26, q27, [x4]
   7.97 |      fmla   v31.8h, v24.8h, v26.8h
  31.35 |      ldp    q24, q26, [x2, #32]
   0.00 |      add    x2, x2, #0x40
  22.39 |      fmla   v29.8h, v25.8h, v27.8h
   2.29 |      ldp    q25, q27, [x4, #32]
   0.00 |      add    x4, x4, #0x40
  14.86 |      fmla   v28.8h, v24.8h, v25.8h
   0.89 |      fmla   v30.8h, v26.8h, v27.8h
   2.18 |      cmp    x7, x2
        |      b.ne   38
   0.02 |      fadd   v31.8h, v31.8h, v28.8h
        |      fadd   v29.8h, v29.8h, v30.8h
   0.02 |      fadd   v31.8h, v31.8h, v29.8h
   0.00 |74:   fcvtl  v29.4s, v31.4h

We're quite clearly waiting on data. But even if we could reduce the performance ratio from 24 times to only 6 times by quantizing llada to 4 bits, it's still a huge ratio, and I suspect that the diffusion architecture could be much more memory intensive and might possibly be limited to certain types of hardware only. I'm currently re-running at Q4_K_M, it's indeed a bit faster as expected but still far too slow to be useful for anything beyond research for now.

Running with 32 steps only (it finished in 24), with Q4_K_M gives me:

In the first 4 hours can run 12 kilometers per hour * 4 = 48 kilometers.
In the next 4 hours she run 6 kilometers per hour * 4 = 24 kilometers.
In total she can run 48 + 24 = 72 kilometers in 8 hours.
The answer is 72

real    2m27.864s
user    19m34.702s
sys     0m0.424s

OK better but still quite slow. Maybe it will resist better to lower quantizations (seems to produce the correct output at Q2_K but not faster), time will tell...

@am17an
Copy link
Collaborator Author

am17an commented Aug 17, 2025

@wtarreau yes it's expected to be slow at the moment because it is a pp2048 per diffusion step and it doesn't use the kv cache. I'm working on adding some support for kv cache which should make it 3-4x faster.

@wtarreau
Copy link
Contributor

OK great to know, thanks for explaining!

@am17an
Copy link
Collaborator Author

am17an commented Aug 17, 2025

Although reading your comment it seems like there is something off, for me the same command runs in under a minute on a rtx 3090. I'm AFK but will debug soon.

@wtarreau
Copy link
Contributor

Please note that I'm on CPU with limited DRAM speed (~45GB/s, still quite nice for a CPU but far from a 3090). It's possible that in your case the available DRAM bandwidth masks something.

@claws61821
Copy link

You're supposed to use llama-diffusion-cli, not llama-cli. :)

I did not see a file named llama-diffusion-cli in the binary package, and made the assumption that llama-diffusion-cli MIGHT have been a placeholder name for the subproject prior to merge. On the slim possibility that one of the other files might still answer to that name despite not presenting it, I re-ran the command using that name.

./llama-diffusion-cli: The term `./llama-diffusion-cli` is not recognized as the name of a cmdlet, function, script file, or operable program.

So I ask again, politely: Is this merge included in some accessible way in the released binaries, or is it only available in source builds at this time?

@wtarreau
Copy link
Contributor

So I ask again, politely: Is this merge included in some accessible way in the released binaries, or is it only available in source builds at this time?

I'm not aware of any binary builds, normally it's just the boring git pull ; cmake -B build ... and it usually works fine. Just out of curiosity, where did you get your binaries if you didn't build them yourself ? If they weren't packaged with your distro, generally speaking it's not a good idea to install random binaries found on the net, you don't always know if you can trust the persons providing them. And even if you have good reasons to trust them, they will not necessarily be up to date. Since the project moves fast, I really encourage you to have an up-to-date copy of the source repo that you periodically update and rebuild. I'm used to keeping a copy of known good binaries (I just rename build to build-$version before rebuilding) so that I can compare in case something looks odd, but that's all. Hoping this helps.

@wtarreau
Copy link
Contributor

Just after saying that, I've just discovered that there are actually binary releases. I wasn't aware. So presumably your binary should be available for your OS there: https://github.com/ggml-org/llama.cpp/releases/tag/b6188.

@claws61821
Copy link

I'm not aware of any binary builds, normally it's just the boring git pull ; cmake -B build ... and it usually works fine. Just out of curiosity, where did you get your binaries if you didn't build them yourself ?

Binary 6183 from this repository, for Vulkan, for Windows x64, downloaded yesterday.

@wtarreau
Copy link
Contributor

OK. As said above I've just discovered that there were binaries so I never tried them. And last time I had to use windows was more than 12 years ago, by then llama.cpp didn't exist :-) It could be possible that some binaries are not built on some OSes, I don't know.

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

So I ask again, politely: Is this merge included in some accessible way in the released binaries, or is it only available in source builds at this time?

Only binaries from tools have been included in releases for quite a while now, examples are excluded, so you have to build from source.

@claws61821
Copy link

Only binaries from tools have been included in releases for quite a while now, examples are excluded, so you have to build from source.

Thank you very much, CISC. I'm new around here, and wasn't able to find a quick generic answer in a couple of hours of scrolling before I asked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants