Testing IQ5_K
W790E Sage + QYFS + 512G + RTX5090
Computed blk.46.attn_kv_b.weight as 512 x 8960 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 100096
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 2745.81 MiB
llama_new_context_with_model: KV self size = 2745.78 MiB, c^KV (q8_0): 2745.78 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 4277.20 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 814.05 MiB
llama_new_context_with_model: graph nodes = 4171
llama_new_context_with_model: graph splits = 2
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
main: n_kv_max = 100096, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 0.623 | 6579.74 | 6.467 | 158.35 |
| 4096 | 1024 | 4096 | 0.644 | 6362.65 | 6.940 | 147.55 |
| 4096 | 1024 | 8192 | 0.785 | 5218.57 | 8.210 | 124.73 |
| 4096 | 1024 | 12288 | 0.867 | 4724.69 | 8.661 | 118.24 |
| 4096 | 1024 | 16384 | 0.995 | 4118.57 | 9.505 | 107.74 |
| 4096 | 1024 | 20480 | 1.163 | 3521.51 | 10.317 | 99.25 |
| 4096 | 1024 | 24576 | 1.363 | 3005.77 | 10.811 | 94.72 |
| 4096 | 1024 | 28672 | 1.337 | 3062.61 | 11.633 | 88.03 |
| 4096 | 1024 | 32768 | 1.467 | 2792.33 | 12.991 | 78.82 |
| 4096 | 1024 | 36864 | 1.617 | 2532.51 | 13.217 | 77.48 |

