inferena

InferenceArena View on GitHub

Benchmark config: seq_len=128, float32, input=[0,1,…,127].

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 135.63 188 18 486 10.98
  ONNX Runtime 1.24.4 (CPU) 65.50 118 20 10.98
  JAX 0.9.2 (CPU) 6.79 194 31 2107 10.98
  Candle (CPU) 0.31 453 61 11.11
  Luminal (CPU) 3.37 17006 14459 10.81
  Burn (wgpu/Lavapipe) 0.00 2369 320 5700 11.73
  Meganeura (Vulkan/Lavapipe) 7.29 3933 852 3651 10.99
  llama.cpp (CPU) 0.10 221 24 10.98
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 53.15 65 21 125 8.35
  Candle (CPU) 0.00 255 9 10.80
  Burn (wgpu/vulkan) 0.00 178 37 211 11.51
  Luminal (CPU) 1.53 7719 7591 10.81
  Meganeura (Vulkan) 1.06 79 24 67 8.64
  GGML (CPU) 0.04 862 15 8.69
  ONNX Runtime (CPUExecutionProvider) 8.73 77 14 6.01
  JAX (CPU) 4.35 173 31 359 5.79
Apple M3 PyTorch 2.11.0 (MPS) 0.00 216 354 8.35
  MLX (MLX) 0.00 82 188 8.64
  Candle (CPU) 0.05 254 10.80
  Burn (wgpu) 0.00 925 797 11.37
  Luminal (CPU) 1.95 14650 14704 10.81
  Meganeura (Metal) 1.60 99 80 9.46
Intel Graphics (RPL-U) PyTorch  
  Candle (CPU) 0.00 605 10.80
  Burn (wgpu) 0.00 1502 5248 11.82
  Luminal (CPU) 3.70 15959 15948 10.81
  Meganeura (Vulkan) 2.34 227 190 8.64
  llama.cpp (CPU)  

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.

Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M

Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 51.63 40 11 116 0.00
  Meganeura (Vulkan/Lavapipe) 2.75 696 3850 0.01
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 19.71 26 14 47 0.00
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 0.46 30 20 25 0.01
  GGML (CPU)  
  ONNX Runtime (CPU)  
  JAX (CPU)  
Apple M3 PyTorch 2.11.0 (MPS) 0.00 199 136 0.00
  MLX (MLX) 0.00 14 105 0.00
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Metal) 0.59 53 32 0.01
Intel Graphics (RPL-U) PyTorch  
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 2.62 114 188 0.01
  llama.cpp (CPU)  

Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).

Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA

Two benchmark configurations are used depending on framework capabilities:

Simplified U-Net (PyTorch, Meganeura): Conv-only U-Net without cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels (~2M params). Enables apples-to-apples comparison of Conv2D + GroupNorm + skip connection performance.

Full SD 1.5 U-Net (Candle): Complete architecture with cross-attention, timestep embedding, ~860M params, 64×64×4 latent. Marked DIFFERENT MODEL.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 53.02 14 11 28 0.57
  Meganeura (Vulkan/Lavapipe) 2.75 379 666 0.57
  Candle (CPU) 0.00 10777 0.00
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 12.51 3 2 6 0.57
  Candle (CPU) 0.00 5529 0.00
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 1.32 9 10 19 0.57
  GGML (CPU)  
  ONNX Runtime (CPU)  
  JAX (CPU)  
Apple M3 PyTorch 2.11.0 (MPS) 0.00 504 192 0.57
  MLX (MLX) 0.00 6 8 0.51
  Candle (CPU) 0.08 9962 0.00
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Metal) 3.29 21 490 0.57
Intel Graphics (RPL-U) PyTorch  
  Candle (CPU) 0.00 16876 0.00
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 3.56 44 38 0.57
  llama.cpp (CPU)  

Run ./run.sh -m StableDiffusion to populate this table.

Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion

Benchmark config: batch=4, 3x224x224, float32, random weights, cross-entropy loss.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 60.61 141 40 284 10.10
  ONNX Runtime 1.24.4 (CPU) 0.28 76 18 10.37
  Candle (CPU) 0.00 782 311 6.91
  Meganeura (Vulkan/Lavapipe) 0.98 3906 1192
  Burn (wgpu)  
  JAX (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 36.24 48 15 89 6.92
  Candle (CPU) 0.00 487 158 6.91
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 0.22 84 25 6.92
  GGML (CPU)  
  ONNX Runtime (CPUExecutionProvider) 2.30 44 13 6.92
  JAX (CPU) 1.43 103 43 282 6.92

Correctness: PyTorch vs ONNX Runtime: CLOSE (loss diff 0.27, rel error 8.8%).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m ResNet-50

Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 39.88 150 371 11.80
  ONNX Runtime 1.24.4 (CPU) 0.84 212 11.80
  Candle (CPU) 0.01 616 0.00
  Meganeura (Vulkan/Lavapipe) 7.84 53467 0.01
  Burn (wgpu)  
  JAX (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 16.57 82 63 207 0.01
  Candle (CPU) 0.00 345 0.00
  Burn (wgpu)  
  Luminal (CPU)  
  Meganeura (Vulkan) 0.14 128 115 0.01
  GGML (faster-whisper (CTranslate2)) 24.36 250 229 0.00
  ONNX Runtime (CPUExecutionProvider) 3.36 72 0.01
  JAX (CPU) 1.82 294 236 526 0.01

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).

Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny

Legend: Bold = best among matching frameworks Struck through = different / simplified model = not supported Framework names link to tested revision