Lab 07 β GPGPU Backends: CoreML Β· DirectML Β· VulkanΒΆ
Goal: Understand and validate the three major GPGPU backends not covered in earlier labs.
Backend |
Vendor |
OS |
Primary Use |
|---|---|---|---|
CoreML |
Apple |
macOS / iOS / iPadOS |
On-device inference via ANE, GPU, CPU |
DirectML |
Microsoft |
Windows |
DirectX 12 GPU acceleration for ML |
Vulkan |
Khronos |
Cross-platform |
Low-level GPU compute (llama.cpp, MLC-LLM) |
PrerequisitesΒΆ
# CoreML
pip install coremltools torch torchvision
# DirectML (Windows only for execution; ONNX export works everywhere)
pip install onnx onnxruntime
# On Windows: pip install onnxruntime-directml
# Vulkan Python bindings
pip install vulkan
# System: install Vulkan SDK from https://vulkan.lunarg.com/
import platform
import sys
import json
import time
import subprocess
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
print(f"Python : {sys.version.split()[0]}")
print(f"PyTorch : {torch.__version__}")
print(f"Platform : {platform.system()} {platform.machine()}")
print(f"MPS avail: {torch.backends.mps.is_available()}")
print(f"CUDA avail: {torch.cuda.is_available()}")
SYSTEM = platform.system() # 'Darwin', 'Windows', 'Linux'
ARCH = platform.machine() # 'arm64', 'x86_64', 'AMD64'
DEVICE = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
print(f"DEVICE : {DEVICE}")
Part 1 β CoreML (Apple)ΒΆ
1.1 What is CoreML?ΒΆ
CoreML is Appleβs on-device ML framework. It dispatches workloads across three compute units:
Unit |
Hardware |
Best For |
|---|---|---|
ANE (Apple Neural Engine) |
Dedicated ML chip |
Fixed-shape inference, transformers |
GPU |
Metal GPU |
Flexible ops, conv, matmul |
CPU |
ARM cores |
Fallback, dynamic shapes |
Compute Unit SelectionΒΆ
import coremltools as ct
ct.ComputeUnit.ALL # ANE β GPU β CPU (default, best performance)
ct.ComputeUnit.CPU_AND_NE # Skip GPU; good for power efficiency
ct.ComputeUnit.CPU_AND_GPU # Skip ANE; good for dynamic shapes
ct.ComputeUnit.CPU_ONLY # Deterministic, debug mode
CoreML Model FormatsΒΆ
Format |
Extension |
Notes |
|---|---|---|
ML Program |
|
Modern (CoreML 5+), ANE-optimized |
NeuralNetwork |
|
Legacy, still widely used |
Conversion path: PyTorch β TorchScript β coremltools.convert() β .mlpackage
# ββ Shared model used across all three backends ββββββββββββββββββββββββββββββ
class MiniClassifier(nn.Module):
"""Small CNN for CIFAR-10-style 3x32x32 input β 10 classes."""
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(4),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 4 * 4, 128), nn.ReLU(),
nn.Linear(128, 10),
)
def forward(self, x):
return self.classifier(self.features(x))
model = MiniClassifier().to(DEVICE).eval()
dummy_input = torch.randn(1, 3, 32, 32, device=DEVICE)
with torch.no_grad():
pt_out = model(dummy_input)
print(f"PyTorch device : {DEVICE}")
print(f"PyTorch output shape : {pt_out.shape}")
print(f"PyTorch output (logits[:5]): {pt_out[0, :5].cpu().numpy().round(4)}")
# ββ 1.2 Convert PyTorch β CoreML βββββββββββββββββββββββββββββββββββββββββββββ
if SYSTEM != "Darwin":
print("CoreML requires macOS. Skipping conversion β showing code only.")
else:
try:
import coremltools as ct
print(f"coremltools version: {ct.__version__}")
# Step 1: TorchScript trace
traced = torch.jit.trace(model, dummy_input)
# Step 2: Convert to CoreML ML Program (.mlpackage)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(name="image", shape=dummy_input.shape)],
outputs=[ct.TensorType(name="logits")],
convert_to="mlprogram", # modern format (ANE-compatible)
compute_precision=ct.precision.FLOAT16, # ANE prefers FP16
minimum_deployment_target=ct.target.macOS13,
)
save_path = Path("mini_classifier.mlpackage")
mlmodel.save(str(save_path))
print(f"Saved: {save_path}")
print(f"Package size: {sum(f.stat().st_size for f in save_path.rglob('*') if f.is_file()) / 1024:.1f} KB")
except ImportError:
print("coremltools not installed. Run: pip install coremltools")
# ββ 1.3 CoreML Inference & Parity Check ββββββββββββββββββββββββββββββββββββββ
if SYSTEM != "Darwin":
print("Skipping β macOS only.")
else:
try:
import coremltools as ct
mlmodel = ct.models.MLModel("mini_classifier.mlpackage",
compute_units=ct.ComputeUnit.ALL)
# CoreML expects numpy input
cml_input = {"image": dummy_input.numpy().astype(np.float32)}
cml_out = mlmodel.predict(cml_input)["logits"]
# Compare with PyTorch FP32 reference
pt_ref = pt_out.detach().numpy()
abs_diff = np.abs(cml_out - pt_ref).max()
print(f"CoreML logits[:5] : {cml_out[0, :5].round(4)}")
print(f"PyTorch logits[:5]: {pt_ref[0, :5].round(4)}")
print(f"Max abs difference : {abs_diff:.6f}")
# FP16 conversion introduces ~1e-2 tolerance
tol = 1e-2
status = "PASS" if abs_diff < tol else "FAIL"
print(f"Parity check (tol={tol}): {status}")
except Exception as e:
print(f"Error: {e}")
# ββ 1.4 Benchmark: CPU_ONLY vs CPU_AND_GPU vs ALL (ANE) ββββββββββββββββββββββ
if SYSTEM != "Darwin":
print("Skipping β macOS only.")
else:
try:
import coremltools as ct
compute_units = {
"CPU_ONLY" : ct.ComputeUnit.CPU_ONLY,
"CPU_AND_GPU" : ct.ComputeUnit.CPU_AND_GPU,
"ALL (ANE)" : ct.ComputeUnit.ALL,
}
cml_input = {"image": dummy_input.numpy().astype(np.float32)}
WARMUP, RUNS = 5, 50
print(f"{'Compute Unit':<20} {'Latency (ms)':<15} {'Throughput (inf/s)':<20}")
print("-" * 55)
for name, cu in compute_units.items():
m = ct.models.MLModel("mini_classifier.mlpackage", compute_units=cu)
for _ in range(WARMUP):
m.predict(cml_input)
t0 = time.perf_counter()
for _ in range(RUNS):
m.predict(cml_input)
elapsed = time.perf_counter() - t0
lat_ms = elapsed / RUNS * 1000
tput = RUNS / elapsed
print(f"{name:<20} {lat_ms:<15.2f} {tput:<20.1f}")
except Exception as e:
print(f"Error: {e}")
# ββ 1.5 Inspect CoreML Model Spec ββββββββββββββββββββββββββββββββββββββββββββ
if SYSTEM != "Darwin":
print("Skipping β macOS only.")
else:
try:
import coremltools as ct
mlmodel = ct.models.MLModel("mini_classifier.mlpackage")
spec = mlmodel.get_spec()
print("=== Model Description ===")
print(f"Type : {spec.WhichOneof('Type')}")
print(f"CoreML version : {spec.specificationVersion}")
print("\n=== Inputs ===")
for inp in mlmodel.input_description._fd_spec:
print(f" {inp.name}")
print("\n=== Outputs ===")
for out in mlmodel.output_description._fd_spec:
print(f" {out.name}")
# Operator count (ML Program)
try:
prog = spec.mlProgram
ops = []
for fn in prog.functions.values():
for blk in fn.block_specializations.values():
ops.extend(op.type for op in blk.operations)
from collections import Counter
print("\n=== Top Operators ===")
for op, cnt in Counter(ops).most_common(10):
print(f" {op:<30} {cnt}")
except Exception:
print("(operator inspection not available for this format)")
except Exception as e:
print(f"Error: {e}")
Part 2 β DirectML (Microsoft / Windows)ΒΆ
2.1 What is DirectML?ΒΆ
DirectML is a low-level DirectX 12 API for GPU-accelerated ML on Windows. It is the backend used by:
ONNX Runtime (
ExecutionProvider = 'DmlExecutionProvider')Windows ML (WinML) β for UWP/desktop app inference
PyTorch-DirectML (
torch_directmlpackage, experimental)
Why DirectML MattersΒΆ
Feature |
DirectML |
CUDA |
|---|---|---|
GPU vendor |
Any (NVIDIA, AMD, Intel) |
NVIDIA only |
OS |
Windows only |
Linux / Windows |
Runtime |
DirectX 12 driver |
CUDA driver + toolkit |
API level |
Low-level (operator graph) |
High-level (cuBLAS, cuDNN) |
Python entry point |
ONNX Runtime DML EP |
torch.cuda |
ArchitectureΒΆ
Python (PyTorch / ONNX Runtime)
β
βΌ
DmlExecutionProvider βββ onnxruntime-directml
β
βΌ
DirectML.dll βββ ships with Windows 10 1903+
β
βΌ
DirectX 12 / WDDM 2.x
β
βΌ
GPU Hardware (any vendor)
2.2 ONNX Export (cross-platform)ΒΆ
ONNX export works on any OS. DirectML execution requires Windows.
# ββ 2.2 Export model to ONNX βββββββββββββββββββββββββββββββββββββββββββββββββ
onnx_available = False
try:
import onnx
import onnxruntime as ort
print(f"onnx version : {onnx.__version__}")
print(f"onnxruntime version : {ort.__version__}")
onnx_available = True
except ImportError as e:
print(f"Missing: {e}. Run: pip install onnx onnxruntime")
print("Skipping ONNX export β dependency not installed.")
if onnx_available:
onnx_path = Path("mini_classifier.onnx")
torch.onnx.export(
model,
dummy_input,
str(onnx_path),
input_names=["image"],
output_names=["logits"],
dynamic_axes={"image": {0: "batch_size"}, "logits": {0: "batch_size"}},
opset_version=17,
verbose=False,
)
# Verify ONNX model
onnx_model = onnx.load(str(onnx_path))
onnx.checker.check_model(onnx_model)
print(f"ONNX model saved : {onnx_path} ({onnx_path.stat().st_size / 1024:.1f} KB)")
print(f"ONNX opset : {onnx_model.opset_import[0].version}")
print(f"Nodes in graph : {len(onnx_model.graph.node)}")
# ββ 2.3 ONNX Runtime Inference β CPU Execution Provider ββββββββββββββββββββββ
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
cpu_sess = ort.InferenceSession(
str(onnx_path),
sess_options=sess_options,
providers=["CPUExecutionProvider"],
)
ort_input = {"image": dummy_input.numpy().astype(np.float32)}
ort_cpu_out = cpu_sess.run(["logits"], ort_input)[0]
pt_ref = pt_out.detach().numpy()
abs_diff = np.abs(ort_cpu_out - pt_ref).max()
print(f"ORT CPU logits[:5] : {ort_cpu_out[0, :5].round(4)}")
print(f"PyTorch logits[:5] : {pt_ref[0, :5].round(4)}")
print(f"Max abs diff : {abs_diff:.8f}")
print(f"Parity (tol=1e-4) : {'PASS' if abs_diff < 1e-4 else 'FAIL'}")
# ββ 2.4 DirectML Execution Provider (Windows only) βββββββββββββββββββββββββββ
available_eps = ort.get_available_providers()
print(f"Available ONNX Runtime EPs: {available_eps}")
if "DmlExecutionProvider" in available_eps:
dml_sess = ort.InferenceSession(
str(onnx_path),
providers=[("DmlExecutionProvider", {"device_id": 0})],
)
dml_out = dml_sess.run(["logits"], ort_input)[0]
abs_diff = np.abs(dml_out - pt_ref).max()
print(f"DirectML logits[:5]: {dml_out[0, :5].round(4)}")
print(f"Max abs diff : {abs_diff:.8f}")
print(f"Parity (tol=1e-2) : {'PASS' if abs_diff < 1e-2 else 'FAIL'}")
# Latency benchmark
WARMUP, RUNS = 10, 200
for _ in range(WARMUP):
dml_sess.run(["logits"], ort_input)
t0 = time.perf_counter()
for _ in range(RUNS):
dml_sess.run(["logits"], ort_input)
dml_lat = (time.perf_counter() - t0) / RUNS * 1000
for _ in range(WARMUP):
cpu_sess.run(["logits"], ort_input)
t0 = time.perf_counter()
for _ in range(RUNS):
cpu_sess.run(["logits"], ort_input)
cpu_lat = (time.perf_counter() - t0) / RUNS * 1000
print(f"\nLatency β CPU : {cpu_lat:.2f} ms")
print(f"Latency β DirectML: {dml_lat:.2f} ms")
print(f"Speedup : {cpu_lat / dml_lat:.2f}x")
else:
print("DmlExecutionProvider not available on this platform.")
print("On Windows, install: pip install onnxruntime-directml")
print("\nββ Simulating DirectML EP result using CPU EP for reference ββ")
# Cross-platform: benchmark CPU as stand-in, show what DML would report
WARMUP, RUNS = 10, 200
for _ in range(WARMUP):
cpu_sess.run(["logits"], ort_input)
t0 = time.perf_counter()
for _ in range(RUNS):
cpu_sess.run(["logits"], ort_input)
cpu_lat = (time.perf_counter() - t0) / RUNS * 1000
print(f"CPU EP latency : {cpu_lat:.2f} ms (baseline)")
print("Expected DML speedup on discrete GPU: 2β10x for small models, ")
print(" 10β50x for large transformer inference.")
# ββ 2.5 ONNX Runtime CUDA EP (when available) ββββββββββββββββββββββββββββββββ
# On Linux/Windows with NVIDIA GPU: use CUDAExecutionProvider instead of DML
if "CUDAExecutionProvider" in available_eps:
cuda_sess = ort.InferenceSession(
str(onnx_path),
providers=[("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider"],
)
cuda_out = cuda_sess.run(["logits"], ort_input)[0]
abs_diff = np.abs(cuda_out - pt_ref).max()
print(f"ORT CUDA logits[:5]: {cuda_out[0, :5].round(4)}")
print(f"Max abs diff : {abs_diff:.8f}")
print(f"Parity (tol=1e-3) : {'PASS' if abs_diff < 1e-3 else 'FAIL'}")
else:
print("CUDAExecutionProvider not available β skipping.")
print("On Linux/Windows with NVIDIA GPU, install: pip install onnxruntime-gpu")
# ββ 2.6 ONNX Runtime EP Compatibility Matrix βββββββββββββββββββββββββββββββββ
EP_INFO = {
"CPUExecutionProvider" : {"OS": "All", "GPU vendor": "N/A", "install": "onnxruntime"},
"CUDAExecutionProvider" : {"OS": "Lin/Win", "GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
"DmlExecutionProvider" : {"OS": "Windows", "GPU vendor": "Any", "install": "onnxruntime-directml"},
"ROCMExecutionProvider" : {"OS": "Linux", "GPU vendor": "AMD", "install": "onnxruntime-rocm"},
"CoreMLExecutionProvider" : {"OS": "macOS/iOS","GPU vendor": "Apple","install": "onnxruntime-silicon"},
"TensorrtExecutionProvider": {"OS": "Lin/Win","GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
"OpenVINOExecutionProvider": {"OS": "All", "GPU vendor": "Intel", "install": "onnxruntime-openvino"},
}
print(f"{'Execution Provider':<35} {'OS':<12} {'GPU':<10} {'Package'}")
print("-" * 80)
for ep, info in EP_INFO.items():
available = "β" if ep in available_eps else " "
print(f"[{available}] {ep:<33} {info['OS']:<12} {info['GPU vendor']:<10} {info['install']}")
Part 3 β Vulkan (Cross-Platform)ΒΆ
3.1 What is Vulkan Compute?ΒΆ
Vulkan is a low-level, cross-platform GPU API by the Khronos Group. Unlike CUDA or Metal, it:
Works on any Vulkan-capable GPU (NVIDIA, AMD, Intel, Apple via MoltenVK, mobile)
Runs on Linux, Windows, macOS (macOS via MoltenVK translation layer)
Requires explicit memory management and compute shaders (GLSL/SPIR-V)
ML Runtimes that Use VulkanΒΆ
Runtime |
Vulkan Use Case |
|---|---|
llama.cpp |
LLM inference on any Vulkan GPU |
MLC-LLM |
Compiled LLM deployment |
NCNN |
Mobile/embedded inference |
Kompute |
General Vulkan ML compute |
whisper.cpp |
Speech recognition on Vulkan |
Vulkan vs CUDA vs MetalΒΆ
Feature |
Vulkan |
CUDA |
Metal |
|---|---|---|---|
Portability |
β β β β β |
β ββββ |
β β βββ |
Performance ceiling |
β β β β β |
β β β β β |
β β β β β |
Ease of use |
β β βββ |
β β β β β |
β β β ββ |
Ecosystem (ML) |
β β β ββ |
β β β β β |
β β β ββ |
macOS support |
Via MoltenVK |
No |
Native |
3.2 Vulkan Device EnumerationΒΆ
# ββ 3.2 Vulkan Device Detection via vulkan Python bindings βββββββββββββββββββ
try:
import vulkan as vk
VULKAN_AVAILABLE = True
print("vulkan Python bindings loaded.")
except ImportError:
VULKAN_AVAILABLE = False
print("vulkan not installed. Run: pip install vulkan")
print("Also install Vulkan SDK: https://vulkan.lunarg.com/")
# ββ 3.3 Enumerate Vulkan Physical Devices ββββββββββββββββββββββββββββββββββββ
def enumerate_vulkan_devices():
"""List all Vulkan-capable GPUs and their properties."""
if not VULKAN_AVAILABLE:
print("Vulkan not available.")
return []
app_info = vk.VkApplicationInfo(
sType=vk.VK_STRUCTURE_TYPE_APPLICATION_INFO,
pApplicationName="zero-to-ai",
applicationVersion=vk.VK_MAKE_VERSION(1, 0, 0),
pEngineName="No Engine",
engineVersion=vk.VK_MAKE_VERSION(1, 0, 0),
apiVersion=vk.VK_API_VERSION_1_0,
)
create_info = vk.VkInstanceCreateInfo(
sType=vk.VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
pApplicationInfo=app_info,
)
instance = vk.vkCreateInstance(create_info, None)
physical_devices = vk.vkEnumeratePhysicalDevices(instance)
devices = []
DEVICE_TYPES = {
vk.VK_PHYSICAL_DEVICE_TYPE_OTHER : "Other",
vk.VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU : "Integrated GPU",
vk.VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU : "Discrete GPU",
vk.VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU : "Virtual GPU",
vk.VK_PHYSICAL_DEVICE_TYPE_CPU : "CPU",
}
print(f"Found {len(physical_devices)} Vulkan device(s):\n")
for i, dev in enumerate(physical_devices):
props = vk.vkGetPhysicalDeviceProperties(dev)
mem = vk.vkGetPhysicalDeviceMemoryProperties(dev)
limits = props.limits
vram_mb = sum(
mem.memoryHeaps[j].size
for j in range(mem.memoryHeapCount)
if mem.memoryHeaps[j].flags & vk.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
) // (1024 * 1024)
major = vk.VK_VERSION_MAJOR(props.apiVersion)
minor = vk.VK_VERSION_MINOR(props.apiVersion)
patch = vk.VK_VERSION_PATCH(props.apiVersion)
info = {
"index" : i,
"name" : props.deviceName,
"type" : DEVICE_TYPES.get(props.deviceType, "Unknown"),
"api_version" : f"{major}.{minor}.{patch}",
"driver_version" : props.driverVersion,
"vendor_id" : hex(props.vendorID),
"device_id" : hex(props.deviceID),
"vram_mb" : vram_mb,
"max_workgroup_x": limits.maxComputeWorkGroupSize[0],
"max_workgroup_y": limits.maxComputeWorkGroupSize[1],
"max_workgroup_z": limits.maxComputeWorkGroupSize[2],
"max_shared_mem" : limits.maxComputeSharedMemorySize,
}
devices.append(info)
print(f"[{i}] {info['name']}")
print(f" Type : {info['type']}")
print(f" Vulkan API : {info['api_version']}")
print(f" Vendor ID : {info['vendor_id']}")
print(f" VRAM : {vram_mb} MB")
print(f" Max workgroup : {info['max_workgroup_x']} x {info['max_workgroup_y']} x {info['max_workgroup_z']}")
print(f" Shared mem : {info['max_shared_mem'] // 1024} KB\n")
vk.vkDestroyInstance(instance, None)
return devices
vulkan_devices = enumerate_vulkan_devices()
# ββ 3.4 llama.cpp Vulkan Backend βββββββββββββββββββββββββββββββββββββββββββββ
# llama.cpp supports Vulkan as a backend for cross-platform LLM inference.
# This cell shows how to build and run it.
LLAMACPP_VULKAN_BUILD = """
# Clone and build llama.cpp with Vulkan support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with Vulkan (requires Vulkan SDK installed)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
# Run inference on any Vulkan GPU
./build/bin/llama-cli \\
--model models/llama-3.2-1b-q4_k_m.gguf \\
--n-gpu-layers 33 \\
--vulkan-device 0 \\
--prompt "Hello, world!"
# List Vulkan devices
./build/bin/llama-cli --list-devices
"""
print("=== llama.cpp Vulkan Build Instructions ===")
print(LLAMACPP_VULKAN_BUILD)
# Check if llama-cli is installed via pip (llama-cpp-python with Vulkan)
try:
from llama_cpp import Llama
print("llama-cpp-python is installed.")
print("To use Vulkan backend, rebuild with:")
print(' CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --force-reinstall')
except ImportError:
print("llama-cpp-python not installed.")
print("Install with Vulkan: CMAKE_ARGS=\"-DGGML_VULKAN=ON\" pip install llama-cpp-python")
# ββ 3.5 Detect Vulkan via System Tools (fallback when Python bindings absent) β
def detect_vulkan_system():
"""Use system tools to detect Vulkan support without Python bindings."""
results = {}
# vulkaninfo (from Vulkan SDK)
for tool in ["vulkaninfo", "vulkaninfo --summary"]:
try:
out = subprocess.run(
tool.split(),
capture_output=True, text=True, timeout=10
)
if out.returncode == 0:
results["vulkaninfo"] = out.stdout[:500]
print("vulkaninfo output (first 500 chars):")
print(out.stdout[:500])
return results
except (FileNotFoundError, subprocess.TimeoutExpired):
continue
# macOS: check MoltenVK
if SYSTEM == "Darwin":
moltenvk_paths = [
"/usr/local/lib/libMoltenVK.dylib",
"/opt/homebrew/lib/libMoltenVK.dylib",
Path.home() / ".local/lib/libMoltenVK.dylib",
]
for p in moltenvk_paths:
if Path(p).exists():
results["moltenvk"] = str(p)
print(f"MoltenVK found: {p}")
# Homebrew Vulkan loader
try:
out = subprocess.run(["brew", "list", "vulkan-loader"],
capture_output=True, text=True, timeout=5)
if out.returncode == 0:
results["vulkan_loader"] = "homebrew"
print("Vulkan loader installed via Homebrew")
except FileNotFoundError:
pass
# Linux: check for libvulkan
if SYSTEM == "Linux":
try:
out = subprocess.run(["ldconfig", "-p"],
capture_output=True, text=True, timeout=5)
vulkan_libs = [l for l in out.stdout.splitlines() if "vulkan" in l.lower()]
if vulkan_libs:
results["vulkan_libs"] = vulkan_libs
print("Vulkan libraries found:")
for lib in vulkan_libs:
print(f" {lib.strip()}")
except FileNotFoundError:
pass
if not results:
print("No Vulkan installation detected.")
print("Install Vulkan SDK: https://vulkan.lunarg.com/sdk/home")
if SYSTEM == "Darwin":
print("Or via Homebrew: brew install vulkan-loader molten-vk")
elif SYSTEM == "Linux":
print("Or via apt: sudo apt install libvulkan1 vulkan-tools")
return results
if not VULKAN_AVAILABLE:
detect_vulkan_system()
# ββ 3.6 PyTorch Backend Selection: CUDA vs ROCm vs MPS vs Vulkan βββββββββββββ
# PyTorch itself does not expose a Vulkan compute path for general tensors;
# Vulkan is used by llama.cpp / MLC-LLM / NCNN as their own runtime.
# This cell compares PyTorch's available backends for context.
def select_best_torch_device():
"""Return the best available PyTorch device with fallback chain."""
if torch.cuda.is_available():
device = torch.device("cuda")
props = torch.cuda.get_device_properties(0)
backend = f"CUDA (NVIDIA) β {props.name}, {props.total_memory // 1024**3} GB"
elif torch.backends.mps.is_available():
device = torch.device("mps")
backend = "Metal (Apple MPS)"
else:
device = torch.device("cpu")
backend = f"CPU ({platform.processor() or 'unknown'})"
return device, backend
device, backend = select_best_torch_device()
print(f"Selected backend: {backend}")
print(f"torch.device : {device}")
# Show all backends and their status
print("\n=== PyTorch Backend Status ===")
print(f" CUDA (NVIDIA) : {'available' if torch.cuda.is_available() else 'not available'}")
if torch.cuda.is_available():
print(f" Device count : {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f" [{i}] {torch.cuda.get_device_name(i)}")
print(f" ROCm (AMD) : {'available' if torch.cuda.is_available() and 'AMD' in torch.cuda.get_device_name(0) else 'not available (or CUDA not AMD)'}")
print(f" Metal/MPS (Apple): {'available' if torch.backends.mps.is_available() else 'not available'}")
print(f" Vulkan (via PT): not supported β use llama.cpp/MLC-LLM for Vulkan inference")
Part 4 β Cross-Backend Parity ValidationΒΆ
When deploying the same model across multiple backends, you must verify:
Numerical parity β outputs agree within dtype-appropriate tolerance
Performance ranking β understand expected speedup per backend
Operator coverage β not all ops are supported on all backends
Expected Tolerances Across BackendsΒΆ
Backend |
Precision |
Typical atol |
|---|---|---|
PyTorch FP32 (reference) |
FP32 |
β |
ONNX Runtime CPU |
FP32 |
1e-5 |
ONNX Runtime CUDA |
FP32 |
1e-4 |
CoreML FP16 |
FP16 |
1e-2 |
CoreML FP32 |
FP32 |
1e-4 |
DirectML |
FP32/FP16 |
1e-3 |
Vulkan (llama.cpp Q4_K_M) |
INT4 quant |
0.1β0.3 (perplexity metric) |
# ββ 4.1 Cross-Backend Parity Report ββββββββββββββββββββββββββββββββββββββββββ
import dataclasses
from typing import Optional
@dataclasses.dataclass
class BackendResult:
name : str
output : Optional[np.ndarray]
latency_ms : Optional[float]
error : Optional[str] = None
def run_ort_backend(ep_name, ep_options=None):
"""Run ONNX Runtime with a given execution provider."""
providers = [(ep_name, ep_options or {})] if ep_options else [ep_name]
if ep_name not in ort.get_available_providers():
return BackendResult(ep_name, None, None, error="EP not available")
try:
sess = ort.InferenceSession(str(onnx_path), providers=providers)
inp = {"image": dummy_input.numpy().astype(np.float32)}
for _ in range(5): # warmup
sess.run(["logits"], inp)
t0 = time.perf_counter()
for _ in range(100):
out = sess.run(["logits"], inp)[0]
lat = (time.perf_counter() - t0) / 100 * 1000
return BackendResult(ep_name, out, lat)
except Exception as e:
return BackendResult(ep_name, None, None, error=str(e)[:80])
# Collect results
reference = pt_out.detach().numpy()
results = [
run_ort_backend("CPUExecutionProvider"),
run_ort_backend("CUDAExecutionProvider"),
run_ort_backend("DmlExecutionProvider"),
run_ort_backend("ROCMExecutionProvider"),
run_ort_backend("CoreMLExecutionProvider"),
]
print(f"{'Backend':<35} {'Status':<10} {'Max diff':<12} {'Latency (ms)'}")
print("-" * 75)
for r in results:
if r.error:
print(f"{r.name:<35} {'SKIP':<10} {'β':<12} β ({r.error})")
else:
diff = np.abs(r.output - reference).max()
ok = "PASS" if diff < 1e-2 else "FAIL"
print(f"{r.name:<35} {ok:<10} {diff:<12.6f} {r.latency_ms:.2f} ms")
Part 5 β ExercisesΒΆ
CoreML Compute Units (macOS): Convert a ResNet-18 to CoreML. Benchmark
CPU_ONLYvsALL. What speedup does the ANE provide? At what batch size does the GPU become faster than ANE?ONNX FP16 optimization: Export the model with FP16 weights using
onnxconverter_common. Compare parity and latency vs FP32 on CPU EP.DirectML simulation (Windows): Install
onnxruntime-directml. Run the parity matrix again. Which EP is fastest on your GPU?Vulkan device enumeration: Install the Vulkan SDK and
pip install vulkan. Run the enumeration cell. How many devices does your machine expose? Is the integrated GPU also listed?llama.cpp Vulkan backend: Clone llama.cpp, build with
-DGGML_VULKAN=ON, download a GGUF model (e.g.Llama-3.2-1B-Q4_K_M.gguf). Compare tokens/sec on CPU vs Vulkan GPU.Cross-backend parity: Export a model with a
LayerNormop. Compare ORT CPU vs CoreML FP16 numerically. Does FP16 cause any tokens to flip in a text classification task?
Key TakeawaysΒΆ
CoreML |
DirectML |
Vulkan |
|
|---|---|---|---|
Best for |
Apple on-device |
Windows any-GPU |
Cross-platform LLM inference |
Entry point |
|
ONNX Runtime DML EP |
|
Precision |
FP16/FP32/INT8 |
FP32/FP16 |
Quantized (Q4βQ8) |
Dynamic shapes |
Limited (ANE) |
Yes |
Yes |
Debugging |
Instruments / |
PIX / DX Capture |
RenderDoc / validation layers |
Always validate numerical parity when switching backends β FP16 and quantization introduce non-trivial differences
CoreML ANE requires static input shapes β use
ct.RangeDimfor bounded dynamic shapesDirectML and Vulkan both skip the vendor-specific driver stack β great for portability, sometimes slower than native CUDA/ROCm
Vulkanβs SPIR-V compute shaders are how llama.cpp achieves GPU inference on Intel, AMD, and Apple GPUs without CUDA
Previous: lab_06_framework_validation.ipynb
Next: lab_08_benchmarking.ipynb β Benchmarking: TTFT, throughput, SLOs
Back to Overview: README.md