Pruning for Large Language Models — From SparseGPT to KV-Cache Pruning31 March 2026AI Accelerator Pruning LLM SparseGPT Wanda Model Compression Sparsity KV-Cache Transformer Inference Optimization Structured Pruning Unstructured Pruning 2:4 Sparsity SliceGPT Attention Head Pruning Dynamic Sparsity
Extreme and Mixed-Precision Quantization: From FP8 to Binary Neural Networks31 March 2026AI Accelerator Quantization FP8 INT4 Binary Neural Networks BitNet QuIP AQLM HQQ Mixed Precision LLM Optimization Model Compression GGUF KV-Cache Vision Transformer Diffusion Models Inference Optimization