-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Description
Name and Version
version: 4248 (3b4f2e3) built with clang version 19.1.4 for aarch64-unknown-linux-android24
Operating systems
Linux
GGML backends
CPU
Hardware
Device - Zenfone 9: - Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | AARCH64_REPACK = 1 |
lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: ARM
Model name: Cortex-A510
Model: 3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r0p3
CPU(s) scaling MHz: 77%
CPU max MHz: 2016.0000
CPU min MHz: 307.2000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull sha1
sha2 crc32 atomics fphp asimdhp
cpuid asimdrdm jscvt fcma lrcpc
dcpop sha3 sm3 sm4 asimddp sha51
2 asimdfhm dit uscat ilrcpc flag
m ssbs sb paca pacg dcpodp flagm
2 frint i8mm bf16 bti
Models
bartowski/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/tree/main
Problem description & steps to reproduce
Performance in the Old Version:
For Q4_0 and IQ4_NL, performance was normal and as expected, given that repacking was not applied in these cases.
The Q4_0_4_4 prompt processing performance was exceptional in the old version, significantly surpassing other formats.
Performance in the New Version:
The Q4_0_4_4 format now shows drastically reduced performance, falling below the levels of Q4_0 and IQ4_NL. This is a notable regression from the old version's behavior.
Repacking for Q4_0 and IQ4_NL appears to be ineffective in the new version. Instead of improving performance, it is slightly slower compared to the old version. This does not align with expectations of repacking offering at least similar performance to the previous implementation of Q4_0_4_4.
i8mm Support Issue:
Even though lscpu indicates support for i8mm, llama.cpp does not detect or leverage this feature in the new version.
First Bad Commit
I could not pinpoint the first commit, but I found that before the Neon changes [f2f5c3b (4105)] I still had the expected performance.
Relevant log output
Previous version (3 weeks ago) - build: f2f5c3b6 (4105)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0 | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 8.60 ± 0.95 |
| llama 3B Q4_0 | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 4.56 ± 0.79 |
| llama 3B IQ4_NL - 4.5 bpw | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 9.21 ± 0.97 |
| llama 3B IQ4_NL - 4.5 bpw | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 6.73 ± 1.10 |
| llama 3B Q4_0_4_4 | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 16.71 ± 0.96 |
| llama 3B Q4_0_4_4 | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 5.67 ± 0.27 |
Current version (main yesterday) - build: 3b4f2e33 (4248)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0 | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 6.42 ± 1.73 |
| llama 3B Q4_0 | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 2.59 ± 0.10 |
| llama 3B IQ4_NL - 4.5 bpw | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 7.47 ± 0.88 |
| llama 3B IQ4_NL - 4.5 bpw | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 4.12 ± 0.57 |
| llama 3B Q4_0_4_4 | 1.98 GiB | 3.61 B | CPU | 4 | pp512 | 2.28 ± 0.32 |
| llama 3B Q4_0_4_4 | 1.98 GiB | 3.61 B | CPU | 4 | tg128 | 1.12 ± 0.33 |