Skip to content

Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662

@gustrd

Description

@gustrd

Name and Version

version: 4248 (3b4f2e3) built with clang version 19.1.4 for aarch64-unknown-linux-android24

Operating systems

Linux

GGML backends

CPU

Hardware

Device - Zenfone 9:  - Qualcomm® Snapdragon® 8+ Gen 1 Mobile Platform

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | AARCH64_REPACK = 1 |
lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A510
    Model:              3
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p3
    CPU(s) scaling MHz: 77%
    CPU max MHz:        2016.0000
    CPU min MHz:        307.2000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull sha1
                        sha2 crc32 atomics fphp asimdhp
                        cpuid asimdrdm jscvt fcma lrcpc
                        dcpop sha3 sm3 sm4 asimddp sha51
                        2 asimdfhm dit uscat ilrcpc flag
                        m ssbs sb paca pacg dcpodp flagm
                        2 frint i8mm bf16 bti

Models

bartowski/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/tree/main

Problem description & steps to reproduce

Performance in the Old Version:
For Q4_0 and IQ4_NL, performance was normal and as expected, given that repacking was not applied in these cases.
The Q4_0_4_4 prompt processing performance was exceptional in the old version, significantly surpassing other formats.

Performance in the New Version:
The Q4_0_4_4 format now shows drastically reduced performance, falling below the levels of Q4_0 and IQ4_NL. This is a notable regression from the old version's behavior.
Repacking for Q4_0 and IQ4_NL appears to be ineffective in the new version. Instead of improving performance, it is slightly slower compared to the old version. This does not align with expectations of repacking offering at least similar performance to the previous implementation of Q4_0_4_4.

i8mm Support Issue:
Even though lscpu indicates support for i8mm, llama.cpp does not detect or leverage this feature in the new version.

First Bad Commit

I could not pinpoint the first commit, but I found that before the Neon changes [f2f5c3b (4105)] I still had the expected performance.

Relevant log output

Previous version (3 weeks ago) - build: f2f5c3b6 (4105)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          8.60 ± 0.95 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.56 ± 0.79 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          9.21 ± 0.97 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          6.73 ± 1.10 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |         16.71 ± 0.96 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          5.67 ± 0.27 |

Current version (main yesterday) - build: 3b4f2e33 (4248)
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
 model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          6.42 ± 1.73 |
| llama 3B Q4_0                  |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          2.59 ± 0.10 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          7.47 ± 0.88 |
| llama 3B IQ4_NL - 4.5 bpw      |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          4.12 ± 0.57 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         pp512 |          2.28 ± 0.32 |
| llama 3B Q4_0_4_4              |   1.98 GiB |     3.61 B | CPU        |       4 |         tg128 |          1.12 ± 0.33 |

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions