-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Add LLaDA 8b Diffusion model #14771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaDA 8b Diffusion model #14771
Conversation
e4b7346
to
5644f2f
Compare
I would like to avoid adding a second diffusion example - we are increasing the maintenance efforts for not significant benefit. The diffusion architecture is not yet well established. We can think about extending the |
Yeah agree, I initially wrote them to be one example. However, passing arguments via CLI for two separate sets of sampling parameters/algorithms was quite confusing to me and would be even more so for the end-user, so for the sake of clarity I wrote them separately. |
@ggerganov would having them in the same example and having extra CLI args for models be acceptable? |
Yes, merging the examples into a single example would be better. |
0f66ad4
to
1439fbe
Compare
Made everything into a single example, please have another look when you have the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example can be improved by not branching between "llada" and "dream" and instead have a common logic for any diffusion logic. This would make it much easier to scale with more diffusion models in the future. Otherwise, the way you've implemented it now, you have to add new structs, sampling types, generation functions, etc. for each new architecture and this seems a bit unnecessary.
@ggerganov you're right, we can combine the sampling methods. I was under the assumption that the only sampling methods that would work are their respective paper implementations, but I tried various sampling methods on both models and they seem to have coherent outputs, but I did not do any deep correctness checks. Refactored to have a concept called The issues that do remain however,
However, this code removes this BOS Lines 746 to 755 in c35f9ea
I'm not familiar with chat-template code and I was not able to work around this without adding a bos token |
cb015b4
to
cf10ebf
Compare
No, Edit: Nvm, I'm blind, it's still there.
This probably needs to be improved.
Setting |
Yep, this fixes it for regenerated gguf. Though it might be a problem downstream if people use the HF repo to create quants (unless they patch this in the HF repo) |
llama: fix llama-model fixup working
I just tested LLaDA-1.5 btw, works great! :) |
* Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py
I just tested today and got a crash. As I understand this is still in early development, I'm not going to bother you with another issue, as the problem might already be known or expected. In short I've built llama.cpp at commit 5c0eb5e ("opencl: fix adreno compiler detection logic (#15029)") like I always do:
and tried the exact command above with the model linked above:
and got that one:
It wasn't built with debug symbols so the backtrace is not more precise. If this is news to you and you'd prefer getting an issue, let me know, and I'll do it. But logically it is trivially reproducible ;-) |
@wtarreau you need to either supply |
Confirmed indeed, I knew it was not necessary to create an issue ;-) Indeed, a default value would be useful, in general a program shouldn't crash due to missing cmdline args, at worst it should complain about their lack. IMHO you should edit your first comment above to add this missing argument. Your short howto is super useful to try the feature given that you've provided the gguf file as well! Thanks! I find the output a bit slow, but I couldn't time it, the (remote) machine hung, I'll have to time it on monday after I can reboot it :-) |
Is this available in the binaries somehow, or only in source builds? I tried running the example command on the latest Windows Vulkan bin, replacing
|
You're supposed to use |
I re-ran a test with it, sometimes with a local machine. It's incredibly slow. I'm well aware that the F16 quantization does participate to this, but I suspect that for the same effort it requires a lot more memory accesses. Here on this machine (Radxa Orion O6, ARMv9 with 8xA720 and 128b DDR5), it took almost 11 minutes:
This test (same prompt) on Llama-3.1-8B-Q5_K_M gives the result in 27 seconds, or 24 times faster:
The same test on llama-3.2-3B also gives the correct result but this time in 10 seconds:
I've checked using "perf top" where the CPU was spent in llada:
OK, let's zoom into this one:
We're quite clearly waiting on data. But even if we could reduce the performance ratio from 24 times to only 6 times by quantizing llada to 4 bits, it's still a huge ratio, and I suspect that the diffusion architecture could be much more memory intensive and might possibly be limited to certain types of hardware only. I'm currently re-running at Q4_K_M, it's indeed a bit faster as expected but still far too slow to be useful for anything beyond research for now. Running with 32 steps only (it finished in 24), with Q4_K_M gives me:
OK better but still quite slow. Maybe it will resist better to lower quantizations (seems to produce the correct output at Q2_K but not faster), time will tell... |
@wtarreau yes it's expected to be slow at the moment because it is a pp2048 per diffusion step and it doesn't use the kv cache. I'm working on adding some support for kv cache which should make it 3-4x faster. |
OK great to know, thanks for explaining! |
Although reading your comment it seems like there is something off, for me the same command runs in under a minute on a rtx 3090. I'm AFK but will debug soon. |
Please note that I'm on CPU with limited DRAM speed (~45GB/s, still quite nice for a CPU but far from a 3090). It's possible that in your case the available DRAM bandwidth masks something. |
I did not see a file named
So I ask again, politely: Is this merge included in some accessible way in the released binaries, or is it only available in source builds at this time? |
I'm not aware of any binary builds, normally it's just the boring |
Just after saying that, I've just discovered that there are actually binary releases. I wasn't aware. So presumably your binary should be available for your OS there: https://github.com/ggml-org/llama.cpp/releases/tag/b6188. |
Binary 6183 from this repository, for Vulkan, for Windows x64, downloaded yesterday. |
OK. As said above I've just discovered that there were binaries so I never tried them. And last time I had to use windows was more than 12 years ago, by then llama.cpp didn't exist :-) It could be possible that some binaries are not built on some OSes, I don't know. |
Only binaries from |
Thank you very much, CISC. I'm new around here, and wasn't able to find a quick generic answer in a couple of hours of scrolling before I asked. |
Continuing on #14644, this PR adds another diffusion model https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct, which has different semantics compared to the dream-7b model, and overall seems to have better performance
There are very few similarities between how they seem to generate tokens, so for now I've just created two different examples
llama-diffusion-dream-cli
(for the earlier version) andllama-diffusion-llada-cli
, for running the new LLaDA model. Added a README as wellI've uploaded a GGUF.
Edit on 30-07-2025: Re-uploaded another GGUF with a config change
Example command
./build/bin/llama-diffusion-cli -m llada-8b.gguf -p "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?" --diffusion_steps 128 -ngl 99 --temp 0 -ub 128 --diffusion-visual --diffusion-block-length 32
Also I would like this to the server, but I'm not sure what API would be acceptable so I'm hoping to have a discussion on that as well