Benchmark config: seq_len=128, float32, input=[0,1,…,127].
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 135.63 | 188 | 18 | 486 | 10.98 |
| ONNX Runtime 1.24.4 (CPU) | 65.50 | 118 | 20 | — | 10.98 | |
| JAX 0.9.2 (CPU) | 6.79 | 194 | 31 | 2107 | 10.98 | |
| Candle (CPU) | 0.31 | 453 | 61 | — | 11.11 | |
| Luminal (CPU) | 3.37 | 17006 | — | 14459 | 10.81 | |
| Burn (wgpu/Lavapipe) | ||||||
| Meganeura (Vulkan/Lavapipe) | 7.29 | 3933 | 852 | 3651 | 10.99 | |
| llama.cpp (CPU) | 0.10 | 221 | 24 | — | 10.98 | |
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 53.15 | 65 | 21 | 125 | 8.35 |
| Candle (CPU) | ||||||
| Burn (wgpu/vulkan) | ||||||
| Luminal (CPU) | ||||||
| Meganeura (Vulkan) | 1.06 | 79 | 24 | 67 | 8.64 | |
| GGML (CPU) | 0.04 | 862 | 15 | — | 8.69 | |
| ONNX Runtime (CPUExecutionProvider) | ||||||
| JAX (CPU) | ||||||
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 216 | — | 354 | 8.35 |
| MLX (MLX) | — | |||||
| Candle (CPU) | — | |||||
| Burn (wgpu) | — | |||||
| Luminal (CPU) | — | |||||
| Meganeura (Metal) | — | |||||
| Intel Graphics (RPL-U) | PyTorch | ✗ | ✗ | ✗ | ✗ | |
| Candle (CPU) | 0.00 | 605 | — | — | 10.80 | |
| Burn (wgpu) | — | |||||
| Luminal (CPU) | 3.70 | 15959 | — | 15948 | 10.81 | |
| Meganeura (Vulkan) | — | |||||
| llama.cpp (CPU) | ✗ | ✗ | ✗ | ✗ |
Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.
Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.
- Candle runs the real LLaMA architecture with the same safetensors
weights, but its
forward()returns last-position logits only (private fields prevent getting all-position logits). Loss is computed on 1 position vs 128 for others — hence DIFFERENT MODEL in correctness check. Timing is valid. Backward not yet wired. - Burn and Luminal use a simplified model (single-head attention, no RoPE/RMSNorm) with random weights.
- Luminal backward is estimated as a second forward pass.
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M
Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 51.63 | 40 | 11 | 116 | 0.00 |
| Meganeura (Vulkan/Lavapipe) | 2.75 | 696 | — | 3850 | 0.01 | |
| ONNX Runtime (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Candle (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 19.71 | 26 | 14 | 47 | 0.00 |
| Candle (CPU) | — | — | — | — | ||
| Burn (wgpu) | — | — | — | — | ||
| Luminal (CPU) | — | — | — | — | ||
| Meganeura (Vulkan) | 0.46 | 30 | 20 | 25 | 0.01 | |
| GGML (CPU) | — | — | — | — | ||
| ONNX Runtime (CPU) | — | — | — | — | ||
| JAX (CPU) | — | — | — | — | ||
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 199 | — | 136 | 0.00 |
| MLX (MLX) | 0.00 | 14 | — | 105 | 0.00 | |
| Candle (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Metal) | 0.59 | 53 | — | 32 | 0.01 | |
| Intel Graphics (RPL-U) | PyTorch | ✗ | ✗ | ✗ | ✗ | |
| Candle (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan) | 2.62 | 114 | — | 188 | 0.01 | |
| llama.cpp (CPU) | ✗ | ✗ | ✗ | ✗ |
Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).
Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.
- Burn and Luminal do not implement this architecture yet (reported as ✗).
- Inputs are synthetic: random noisy actions, sinusoidal timestep, random VLM context.
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA
Two benchmark configurations are used depending on framework capabilities:
Simplified U-Net (PyTorch, Meganeura): Conv-only U-Net without cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels (~2M params). Enables apples-to-apples comparison of Conv2D + GroupNorm + skip connection performance.
Full SD 1.5 U-Net (Candle): Complete architecture with cross-attention, timestep embedding, ~860M params, 64×64×4 latent. Marked DIFFERENT MODEL.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 53.02 | 14 | 11 | 28 | 0.57 |
| Meganeura (Vulkan/Lavapipe) | 2.75 | 379 | — | 666 | 0.57 | |
| Candle (CPU) | ||||||
| ONNX Runtime (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 12.51 | 3 | 2 | 6 | 0.57 |
| Candle (CPU) | ||||||
| Burn (wgpu) | — | — | — | — | ||
| Luminal (CPU) | — | — | — | — | ||
| Meganeura (Vulkan) | 1.32 | 9 | 10 | 19 | 0.57 | |
| GGML (CPU) | — | — | — | — | ||
| ONNX Runtime (CPU) | — | — | — | — | ||
| JAX (CPU) | — | — | — | — | ||
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 504 | — | 192 | 0.57 |
| MLX (MLX) | — | |||||
| Candle (CPU) | — | |||||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Metal) | 3.29 | 21 | — | 490 | 0.57 | |
| Intel Graphics (RPL-U) | PyTorch | ✗ | ✗ | ✗ | ✗ | |
| Candle (CPU) | 0.00 | 16876 | — | — | 0.00 | |
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan) | — | |||||
| llama.cpp (CPU) | ✗ | ✗ | ✗ | ✗ |
Run ./run.sh -m StableDiffusion to populate this table.
Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).
- Input is deterministic synthetic data — no actual image generation.
- PyTorch and Meganeura use a simplified architecture for fair comparison.
- Candle runs the full SD 1.5 UNet but on CPU only (DIFFERENT MODEL vs others).
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion
Benchmark config: batch=4, 3x224x224, float32, random weights, cross-entropy loss.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 60.61 | 141 | 40 | 284 | 10.10 |
| ONNX Runtime 1.24.4 (CPU) | 0.28 | 76 | 18 | — | 10.37 | |
| Candle (CPU) | ||||||
| Meganeura (Vulkan/Lavapipe) | ||||||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 36.24 | 48 | 15 | 89 | 6.92 |
| Candle (CPU) | 0.00 | 487 | 158 | — | 6.91 | |
| Burn (wgpu) | — | — | — | — | ||
| Luminal (CPU) | — | — | — | — | ||
| Meganeura (Vulkan) | 0.22 | 84 | 25 | — | 6.92 | |
| GGML (CPU) | — | — | — | — | ||
| ONNX Runtime (CPUExecutionProvider) | 2.30 | 44 | 13 | — | 6.92 | |
| JAX (CPU) | 1.43 | 103 | 43 | 282 | 6.92 |
Correctness: PyTorch vs ONNX Runtime: CLOSE (loss diff 0.27, rel error 8.8%).
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m ResNet-50
Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 39.88 | 150 | — | 371 | 11.80 |
| ONNX Runtime 1.24.4 (CPU) | 0.84 | 212 | — | — | 11.80 | |
| Candle (CPU) | ||||||
| Meganeura (Vulkan/Lavapipe) | ||||||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 16.57 | 82 | 63 | 207 | 0.01 |
| Candle (CPU) | 0.00 | 345 | — | — | 0.00 | |
| Burn (wgpu) | — | — | — | — | ||
| Luminal (CPU) | — | — | — | — | ||
| Meganeura (Vulkan) | 0.14 | 128 | 115 | — | 0.01 | |
| GGML (faster-whisper (CTranslate2)) | 24.36 | 250 | 229 | — | 0.00 | |
| ONNX Runtime (CPUExecutionProvider) | 3.36 | 72 | — | — | 0.01 | |
| JAX (CPU) | 1.82 | 294 | 236 | 526 | 0.01 |
Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).
Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI
- Input is synthetic mel spectrogram, not real audio
- Decoder runs with a 4-token input (language/task tokens), not full transcription
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny