Skip to content

vulkan: Use coopmat2 for conv2d #14982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

Stacked on #14933, Draft until that's merged.

I haven't done any perf tuning on this yet, there may still be more perf to get.

Directed perf tests:

5090 before:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    307 runs -  3262.35 us/run - 137.42 GFLOP/run -  42.12 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              137632 runs -     7.28 us/run - 133.69 MFLOP/run -  18.38 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99495 runs -    10.11 us/run - 135.78 MFLOP/run -  13.43 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                483328 runs -     2.08 us/run - 642.82 kFLOP/run - 308.66 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.63 us/run -  20.90 MFLOP/run -   3.71 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.51 us/run -   2.78 MFLOP/run - 505.39 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.41 us/run -  22.28 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              100572 runs -     9.95 us/run - 115.40 MFLOP/run -  11.60 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26269 runs -    38.22 us/run - 923.24 MFLOP/run -  24.16 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10670 runs -    93.73 us/run -   1.85 GFLOP/run -  19.73 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    316 runs -  3174.30 us/run - 137.42 GFLOP/run -  43.29 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              135388 runs -     7.41 us/run - 133.69 MFLOP/run -  18.04 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               98021 runs -    10.24 us/run - 135.78 MFLOP/run -  13.26 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                507904 runs -     1.98 us/run - 642.82 kFLOP/run - 325.29 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               177082 runs -     5.67 us/run -  20.90 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               180224 runs -     5.57 us/run -   2.78 MFLOP/run - 499.80 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.41 us/run -  22.28 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102306 runs -     9.79 us/run - 115.40 MFLOP/run -  11.78 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27250 runs -    36.84 us/run - 923.24 MFLOP/run -  25.06 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    11605 runs -    86.26 us/run -   1.85 GFLOP/run -  21.43 TFLOPS

5090 after:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    952 runs -  1051.23 us/run - 137.42 GFLOP/run - 130.72 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              203456 runs -     4.93 us/run - 133.69 MFLOP/run -  27.14 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              152559 runs -     6.57 us/run - 135.78 MFLOP/run -  20.67 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                466944 runs -     2.15 us/run - 642.82 kFLOP/run - 299.08 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               196226 runs -     5.20 us/run -  20.90 MFLOP/run -   4.02 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               196608 runs -     5.18 us/run -   2.78 MFLOP/run - 537.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                62846 runs -    15.92 us/run -  22.28 MFLOP/run -   1.40 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              140454 runs -     7.14 us/run - 115.40 MFLOP/run -  16.17 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51666 runs -    19.40 us/run - 923.24 MFLOP/run -  47.59 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    25025 runs -    40.00 us/run -   1.85 GFLOP/run -  46.22 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    961 runs -  1040.79 us/run - 137.42 GFLOP/run - 132.04 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              201212 runs -     4.98 us/run - 133.69 MFLOP/run -  26.85 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              154770 runs -     6.47 us/run - 135.78 MFLOP/run -  21.00 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                499712 runs -     2.00 us/run - 642.82 kFLOP/run - 320.75 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               196226 runs -     5.20 us/run -  20.90 MFLOP/run -   4.02 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               204800 runs -     5.03 us/run -   2.78 MFLOP/run - 553.85 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                62846 runs -    16.02 us/run -  22.28 MFLOP/run -   1.39 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              148257 runs -     6.75 us/run - 115.40 MFLOP/run -  17.11 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               52865 runs -    18.95 us/run - 923.24 MFLOP/run -  48.72 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    23155 runs -    43.27 us/run -   1.85 GFLOP/run -  42.73 TFLOPS

stable-diffusion:

5090 before:

Vulkan Timings:
ADD: 85 x 68.008 us
CONT: 3 x 23.626 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 1785.04 us (43290.9 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 3306.5 us (46752 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 653.312 us (26245.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 6177.79 us (50045.5 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 1771.92 us (43620.9 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 3477.5 us (44457.8 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 557.056 us (30810.4 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 483.328 us (3747.25 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 4.352 us (26.3529 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 12.672 us (11750.1 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 1892.63 us (40843.2 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 692.732 us (27897.1 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 6333.79 us (48818.2 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 90.832 us (23619.3 GFLOPS/s)
GROUP_NORM: 30 x 842.821 us
MUL: 30 x 59.145 us
MUL_MAT m=4096 n=4096 k=512: 1 x 223.072 us (76939.7 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 393.632 us (43639.2 GFLOPS/s)
SCALE: 1 x 26.592 us
SILU: 29 x 64.174 us
SOFT_MAX: 1 x 45.056 us
UPSCALE: 3 x 159.05 us
Total time: 95267.3 us.

5090 after:

Vulkan Timings:
ADD: 85 x 64.818 us
CONT: 3 x 23.946 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 549.702 us (140578 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 1081.38 us (142952 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 296.928 us (57745.7 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 1934.34 us (159833 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 565.254 us (136740 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 1142.78 us (135285 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 215.072 us (79801.6 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 335.872 us (5392.39 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 5.6 us (20.48 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 10.656 us (13973.1 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 626.902 us (123306 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 328.428 us (58841.5 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 1936.38 us (159681 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 37.312 us (57498.6 GFLOPS/s)
GROUP_NORM: 30 x 828.746 us
MUL: 30 x 59.382 us
MUL_MAT m=4096 n=4096 k=512: 1 x 93.44 us (183680 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 100.704 us (170577 GFLOPS/s)
SCALE: 1 x 26.656 us
SILU: 29 x 63.516 us
SOFT_MAX: 1 x 40.928 us
UPSCALE: 3 x 165.888 us
Total time: 55182.3 us.

@jeffbolznv jeffbolznv requested a review from 0cc4m July 31, 2025 04:00
@jeffbolznv jeffbolznv marked this pull request as draft July 31, 2025 04:00
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 31, 2025
@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 31, 2025

I upgraded the nvidia driver and the shader compiler and did a quick test.
This patch gives sd.cpp vae decoding a ~2.48x speed up

sd2, 512x512.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

before (w/ prev pr):
taking 0.72s

after:
taking 0.29s

I also noticed that the first run of a specific pipeline seems to take longer.

eg fresh after compilation:

[INFO ] stable-diffusion.cpp:1826 - decoding 9 latents
[INFO ] stable-diffusion.cpp:1836 - latent 1 decoded, taking 0.88s
[INFO ] stable-diffusion.cpp:1836 - latent 2 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 3 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 4 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 5 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 6 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 7 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 8 decoded, taking 0.29s
[INFO ] stable-diffusion.cpp:1836 - latent 9 decoded, taking 0.29s

Any following runs don't look like this.


edit: sampling speed is now also faster with conv2d_direct used in the diffusion model.
Flash attention for the diffusion enabled.

enabled: 7.50it/s
disabled: 7.21it/s

@Green-Sky
Copy link
Collaborator

perf:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

before:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     31 runs - 32451.00 us/run - 137.42 GFLOP/run -   4.23 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.37 us/run - 133.69 MFLOP/run -   2.46 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14003 runs -    74.79 us/run - 135.78 MFLOP/run -   1.82 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                204800 runs -     5.00 us/run - 642.82 kFLOP/run - 128.55 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                43074 runs -    24.64 us/run -  20.90 MFLOP/run - 847.95 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                73728 runs -    15.06 us/run -   2.78 MFLOP/run - 184.89 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   106.79 us/run -  22.28 MFLOP/run - 208.63 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    49.14 us/run - 115.40 MFLOP/run -   2.35 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   339.33 us/run - 923.24 MFLOP/run -   2.72 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   665.94 us/run -   1.85 GFLOP/run -   2.78 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     33 runs - 31061.55 us/run - 137.42 GFLOP/run -   4.42 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.72 us/run - 133.69 MFLOP/run -   2.44 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    75.64 us/run - 135.78 MFLOP/run -   1.80 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                196608 runs -     5.10 us/run - 642.82 kFLOP/run - 126.12 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                43074 runs -    25.25 us/run -  20.90 MFLOP/run - 827.65 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    15.34 us/run -   2.78 MFLOP/run - 181.57 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   109.01 us/run -  22.28 MFLOP/run - 204.37 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    48.75 us/run - 115.40 MFLOP/run -   2.37 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   338.26 us/run - 923.24 MFLOP/run -   2.73 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   661.42 us/run -   1.85 GFLOP/run -   2.80 TFLOPS

after:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     93 runs - 10853.65 us/run - 137.42 GFLOP/run -  12.66 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               35156 runs -    29.01 us/run - 133.69 MFLOP/run -   4.61 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26532 runs -    38.19 us/run - 135.78 MFLOP/run -   3.56 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                180224 runs -     5.81 us/run - 642.82 kFLOP/run - 110.72 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                47860 runs -    22.64 us/run -  20.90 MFLOP/run - 922.98 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    15.81 us/run -   2.78 MFLOP/run - 176.13 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   111.26 us/run -  22.28 MFLOP/run - 200.24 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               47685 runs -    21.00 us/run - 115.40 MFLOP/run -   5.49 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6867 runs -   147.92 us/run - 923.24 MFLOP/run -   6.24 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3740 runs -   267.41 us/run -   1.85 GFLOP/run -   6.91 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    112 runs -  8989.54 us/run - 137.42 GFLOP/run -  15.29 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               34408 runs -    29.24 us/run - 133.69 MFLOP/run -   4.57 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26532 runs -    38.36 us/run - 135.78 MFLOP/run -   3.54 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                172032 runs -     5.86 us/run - 642.82 kFLOP/run - 109.67 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                47860 runs -    22.67 us/run -  20.90 MFLOP/run - 921.58 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    15.74 us/run -   2.78 MFLOP/run - 176.89 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   111.05 us/run -  22.28 MFLOP/run - 200.62 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               49419 runs -    20.54 us/run - 115.40 MFLOP/run -   5.62 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7085 runs -   143.28 us/run - 923.24 MFLOP/run -   6.44 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3795 runs -   264.19 us/run -   1.85 GFLOP/run -   7.00 TFLOPS

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 1, 2025

Looks good:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

PR 14933:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    118 runs -  8525.57 us/run - 137.42 GFLOP/run -  16.12 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               52360 runs -    19.35 us/run - 133.69 MFLOP/run -   6.91 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42009 runs -    23.88 us/run - 135.78 MFLOP/run -   5.69 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.90 us/run - 642.82 kFLOP/run - 164.64 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                67004 runs -    15.44 us/run -  20.90 MFLOP/run -   1.35 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                98304 runs -    10.80 us/run -   2.78 MFLOP/run - 257.81 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    47.54 us/run -  22.28 MFLOP/run - 468.62 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               48552 runs -    20.96 us/run - 115.40 MFLOP/run -   5.51 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10028 runs -   100.70 us/run - 923.24 MFLOP/run -   9.17 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5115 runs -   197.22 us/run -   1.85 GFLOP/run -   9.37 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    120 runs -  8375.70 us/run - 137.42 GFLOP/run -  16.41 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               52360 runs -    19.36 us/run - 133.69 MFLOP/run -   6.91 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42009 runs -    23.90 us/run - 135.78 MFLOP/run -   5.68 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.84 us/run - 642.82 kFLOP/run - 167.32 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                86148 runs -    12.05 us/run -  20.90 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               131072 runs -     7.99 us/run -   2.78 MFLOP/run - 348.52 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                26934 runs -    40.80 us/run -  22.28 MFLOP/run - 546.07 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               48552 runs -    20.84 us/run - 115.40 MFLOP/run -   5.54 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10028 runs -   100.65 us/run - 923.24 MFLOP/run -   9.17 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5115 runs -   196.76 us/run -   1.85 GFLOP/run -   9.40 TFLOPS


This PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    325 runs -  3083.60 us/run - 137.42 GFLOP/run -  44.57 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               80036 runs -    12.50 us/run - 133.69 MFLOP/run -  10.69 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               68541 runs -    14.65 us/run - 135.78 MFLOP/run -   9.27 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                270336 runs -     3.75 us/run - 642.82 kFLOP/run - 171.41 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90934 runs -    11.04 us/run -  20.90 MFLOP/run -   1.89 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               122880 runs -     8.35 us/run -   2.78 MFLOP/run - 333.37 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                26934 runs -    43.77 us/run -  22.28 MFLOP/run - 508.98 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               78030 runs -    12.83 us/run - 115.40 MFLOP/run -   9.00 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18748 runs -    53.52 us/run - 923.24 MFLOP/run -  17.25 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10285 runs -    97.25 us/run -   1.85 GFLOP/run -  19.01 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    335 runs -  2989.67 us/run - 137.42 GFLOP/run -  45.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               80036 runs -    12.56 us/run - 133.69 MFLOP/run -  10.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               67804 runs -    14.75 us/run - 135.78 MFLOP/run -   9.20 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                278528 runs -     3.60 us/run - 642.82 kFLOP/run - 178.81 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90934 runs -    11.00 us/run -  20.90 MFLOP/run -   1.90 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               122880 runs -     8.41 us/run -   2.78 MFLOP/run - 331.00 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                26934 runs -    43.74 us/run -  22.28 MFLOP/run - 509.32 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               79764 runs -    12.59 us/run - 115.40 MFLOP/run -   9.16 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19184 runs -    52.28 us/run - 923.24 MFLOP/run -  17.66 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10505 runs -    95.39 us/run -   1.85 GFLOP/run -  19.38 TFLOPS

@jeffbolznv jeffbolznv marked this pull request as ready for review August 2, 2025 13:36
@jeffbolznv jeffbolznv changed the title Draft: vulkan: Use coopmat2 for conv2d vulkan: Use coopmat2 for conv2d Aug 2, 2025
@etasnadi
Copy link
Contributor

etasnadi commented Aug 2, 2025

The 4096 by 4096 case is unfortunately somewhat slower, however that is a synthetic test so it's not high priority.

From #14933:

CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            1192 runs -   839.13 us/run - 137.42 GFLOP/run - 163.77 TFLOPS
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    952 runs -  1051.23 us/run - 137.42 GFLOP/run - 130.72 TFLOPS

@jeffbolznv
Copy link
Collaborator Author

I have a couple more small changes that get another 10% or so, but haven't matched im2col for that case yet. I'll put those in a separate PR after this merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants