Run this notebook: Open in Colab Open in Kaggle

Lab 07 — GPGPU Backends: CoreML · DirectML · Vulkan¶

Goal: Understand and validate the three major GPGPU backends not covered in earlier labs.

Backend	Vendor	OS	Primary Use
CoreML	Apple	macOS / iOS / iPadOS	On-device inference via ANE, GPU, CPU
DirectML	Microsoft	Windows	DirectX 12 GPU acceleration for ML
Vulkan	Khronos	Cross-platform	Low-level GPU compute (llama.cpp, MLC-LLM)

Prerequisites¶

# CoreML
pip install coremltools torch torchvision

# DirectML (Windows only for execution; ONNX export works everywhere)
pip install onnx onnxruntime
# On Windows: pip install onnxruntime-directml

# Vulkan Python bindings
pip install vulkan
# System: install Vulkan SDK from https://vulkan.lunarg.com/

import platform
import sys
import json
import time
import subprocess
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn

print(f"Python   : {sys.version.split()[0]}")
print(f"PyTorch  : {torch.__version__}")
print(f"Platform : {platform.system()} {platform.machine()}")
print(f"MPS avail: {torch.backends.mps.is_available()}")
print(f"CUDA avail: {torch.cuda.is_available()}")

SYSTEM = platform.system()   # 'Darwin', 'Windows', 'Linux'
ARCH   = platform.machine()  # 'arm64', 'x86_64', 'AMD64'

DEVICE = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"DEVICE   : {DEVICE}")

Part 1 — CoreML (Apple)¶

1.1 What is CoreML?¶

CoreML is Apple’s on-device ML framework. It dispatches workloads across three compute units:

Unit	Hardware	Best For
ANE (Apple Neural Engine)	Dedicated ML chip	Fixed-shape inference, transformers
GPU	Metal GPU	Flexible ops, conv, matmul
CPU	ARM cores	Fallback, dynamic shapes

Compute Unit Selection¶

import coremltools as ct

ct.ComputeUnit.ALL          # ANE → GPU → CPU (default, best performance)
ct.ComputeUnit.CPU_AND_NE   # Skip GPU; good for power efficiency
ct.ComputeUnit.CPU_AND_GPU  # Skip ANE; good for dynamic shapes
ct.ComputeUnit.CPU_ONLY     # Deterministic, debug mode

CoreML Model Formats¶

Format	Extension	Notes
ML Program	`.mlpackage`	Modern (CoreML 5+), ANE-optimized
NeuralNetwork	`.mlmodel`	Legacy, still widely used

Conversion path: PyTorch → TorchScript → coremltools.convert() → .mlpackage

# ── Shared model used across all three backends ──────────────────────────────
class MiniClassifier(nn.Module):
    """Small CNN for CIFAR-10-style 3x32x32 input → 10 classes."""
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 4 * 4, 128), nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

model = MiniClassifier().to(DEVICE).eval()
dummy_input = torch.randn(1, 3, 32, 32, device=DEVICE)
with torch.no_grad():
    pt_out = model(dummy_input)
print(f"PyTorch device       : {DEVICE}")
print(f"PyTorch output shape : {pt_out.shape}")
print(f"PyTorch output (logits[:5]): {pt_out[0, :5].cpu().numpy().round(4)}")

# ── 1.2 Convert PyTorch → CoreML ─────────────────────────────────────────────
if SYSTEM != "Darwin":
    print("CoreML requires macOS. Skipping conversion — showing code only.")
else:
    try:
        import coremltools as ct
        print(f"coremltools version: {ct.__version__}")

        # Step 1: TorchScript trace
        traced = torch.jit.trace(model, dummy_input)

        # Step 2: Convert to CoreML ML Program (.mlpackage)
        mlmodel = ct.convert(
            traced,
            inputs=[ct.TensorType(name="image", shape=dummy_input.shape)],
            outputs=[ct.TensorType(name="logits")],
            convert_to="mlprogram",          # modern format (ANE-compatible)
            compute_precision=ct.precision.FLOAT16,  # ANE prefers FP16
            minimum_deployment_target=ct.target.macOS13,
        )

        save_path = Path("mini_classifier.mlpackage")
        mlmodel.save(str(save_path))
        print(f"Saved: {save_path}")
        print(f"Package size: {sum(f.stat().st_size for f in save_path.rglob('*') if f.is_file()) / 1024:.1f} KB")

    except ImportError:
        print("coremltools not installed. Run: pip install coremltools")

# ── 1.3 CoreML Inference & Parity Check ──────────────────────────────────────
if SYSTEM != "Darwin":
    print("Skipping — macOS only.")
else:
    try:
        import coremltools as ct
        mlmodel = ct.models.MLModel("mini_classifier.mlpackage",
                                    compute_units=ct.ComputeUnit.ALL)

        # CoreML expects numpy input
        cml_input = {"image": dummy_input.numpy().astype(np.float32)}
        cml_out   = mlmodel.predict(cml_input)["logits"]

        # Compare with PyTorch FP32 reference
        pt_ref = pt_out.detach().numpy()
        abs_diff = np.abs(cml_out - pt_ref).max()
        print(f"CoreML logits[:5] : {cml_out[0, :5].round(4)}")
        print(f"PyTorch logits[:5]: {pt_ref[0, :5].round(4)}")
        print(f"Max abs difference : {abs_diff:.6f}")

        # FP16 conversion introduces ~1e-2 tolerance
        tol = 1e-2
        status = "PASS" if abs_diff < tol else "FAIL"
        print(f"Parity check (tol={tol}): {status}")

    except Exception as e:
        print(f"Error: {e}")

# ── 1.4 Benchmark: CPU_ONLY vs CPU_AND_GPU vs ALL (ANE) ──────────────────────
if SYSTEM != "Darwin":
    print("Skipping — macOS only.")
else:
    try:
        import coremltools as ct

        compute_units = {
            "CPU_ONLY"    : ct.ComputeUnit.CPU_ONLY,
            "CPU_AND_GPU" : ct.ComputeUnit.CPU_AND_GPU,
            "ALL (ANE)"   : ct.ComputeUnit.ALL,
        }
        cml_input = {"image": dummy_input.numpy().astype(np.float32)}
        WARMUP, RUNS = 5, 50

        print(f"{'Compute Unit':<20} {'Latency (ms)':<15} {'Throughput (inf/s)':<20}")
        print("-" * 55)

        for name, cu in compute_units.items():
            m = ct.models.MLModel("mini_classifier.mlpackage", compute_units=cu)
            for _ in range(WARMUP):
                m.predict(cml_input)
            t0 = time.perf_counter()
            for _ in range(RUNS):
                m.predict(cml_input)
            elapsed = time.perf_counter() - t0
            lat_ms  = elapsed / RUNS * 1000
            tput    = RUNS / elapsed
            print(f"{name:<20} {lat_ms:<15.2f} {tput:<20.1f}")

    except Exception as e:
        print(f"Error: {e}")

# ── 1.5 Inspect CoreML Model Spec ────────────────────────────────────────────
if SYSTEM != "Darwin":
    print("Skipping — macOS only.")
else:
    try:
        import coremltools as ct
        mlmodel = ct.models.MLModel("mini_classifier.mlpackage")
        spec    = mlmodel.get_spec()

        print("=== Model Description ===")
        print(f"Type             : {spec.WhichOneof('Type')}")
        print(f"CoreML version   : {spec.specificationVersion}")

        print("\n=== Inputs ===")
        for inp in mlmodel.input_description._fd_spec:
            print(f"  {inp.name}")

        print("\n=== Outputs ===")
        for out in mlmodel.output_description._fd_spec:
            print(f"  {out.name}")

        # Operator count (ML Program)
        try:
            prog = spec.mlProgram
            ops  = []
            for fn in prog.functions.values():
                for blk in fn.block_specializations.values():
                    ops.extend(op.type for op in blk.operations)
            from collections import Counter
            print("\n=== Top Operators ===")
            for op, cnt in Counter(ops).most_common(10):
                print(f"  {op:<30} {cnt}")
        except Exception:
            print("(operator inspection not available for this format)")

    except Exception as e:
        print(f"Error: {e}")

Part 2 — DirectML (Microsoft / Windows)¶

2.1 What is DirectML?¶

DirectML is a low-level DirectX 12 API for GPU-accelerated ML on Windows. It is the backend used by:

ONNX Runtime (ExecutionProvider = 'DmlExecutionProvider')
Windows ML (WinML) — for UWP/desktop app inference
PyTorch-DirectML (torch_directml package, experimental)

Why DirectML Matters¶

Feature	DirectML	CUDA
GPU vendor	Any (NVIDIA, AMD, Intel)	NVIDIA only
OS	Windows only	Linux / Windows
Runtime	DirectX 12 driver	CUDA driver + toolkit
API level	Low-level (operator graph)	High-level (cuBLAS, cuDNN)
Python entry point	ONNX Runtime DML EP	torch.cuda

Architecture¶

Python (PyTorch / ONNX Runtime)
        │
        ▼
  DmlExecutionProvider  ◄── onnxruntime-directml
        │
        ▼
   DirectML.dll  ◄── ships with Windows 10 1903+
        │
        ▼
   DirectX 12 / WDDM 2.x
        │
        ▼
   GPU Hardware (any vendor)

2.2 ONNX Export (cross-platform)¶

ONNX export works on any OS. DirectML execution requires Windows.

# ── 2.2 Export model to ONNX ─────────────────────────────────────────────────
onnx_available = False
try:
    import onnx
    import onnxruntime as ort
    print(f"onnx version         : {onnx.__version__}")
    print(f"onnxruntime version  : {ort.__version__}")
    onnx_available = True
except ImportError as e:
    print(f"Missing: {e}. Run: pip install onnx onnxruntime")
    print("Skipping ONNX export — dependency not installed.")

if onnx_available:
    onnx_path = Path("mini_classifier.onnx")

    torch.onnx.export(
        model,
        dummy_input,
        str(onnx_path),
        input_names=["image"],
        output_names=["logits"],
        dynamic_axes={"image": {0: "batch_size"}, "logits": {0: "batch_size"}},
        opset_version=17,
        verbose=False,
    )

    # Verify ONNX model
    onnx_model = onnx.load(str(onnx_path))
    onnx.checker.check_model(onnx_model)
    print(f"ONNX model saved   : {onnx_path} ({onnx_path.stat().st_size / 1024:.1f} KB)")
    print(f"ONNX opset         : {onnx_model.opset_import[0].version}")
    print(f"Nodes in graph     : {len(onnx_model.graph.node)}")

# ── 2.3 ONNX Runtime Inference — CPU Execution Provider ──────────────────────
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

cpu_sess = ort.InferenceSession(
    str(onnx_path),
    sess_options=sess_options,
    providers=["CPUExecutionProvider"],
)

ort_input  = {"image": dummy_input.numpy().astype(np.float32)}
ort_cpu_out = cpu_sess.run(["logits"], ort_input)[0]

pt_ref   = pt_out.detach().numpy()
abs_diff = np.abs(ort_cpu_out - pt_ref).max()
print(f"ORT CPU logits[:5] : {ort_cpu_out[0, :5].round(4)}")
print(f"PyTorch logits[:5] : {pt_ref[0, :5].round(4)}")
print(f"Max abs diff       : {abs_diff:.8f}")
print(f"Parity (tol=1e-4)  : {'PASS' if abs_diff < 1e-4 else 'FAIL'}")

# ── 2.4 DirectML Execution Provider (Windows only) ───────────────────────────
available_eps = ort.get_available_providers()
print(f"Available ONNX Runtime EPs: {available_eps}")

if "DmlExecutionProvider" in available_eps:
    dml_sess = ort.InferenceSession(
        str(onnx_path),
        providers=[("DmlExecutionProvider", {"device_id": 0})],
    )
    dml_out  = dml_sess.run(["logits"], ort_input)[0]
    abs_diff = np.abs(dml_out - pt_ref).max()
    print(f"DirectML logits[:5]: {dml_out[0, :5].round(4)}")
    print(f"Max abs diff       : {abs_diff:.8f}")
    print(f"Parity (tol=1e-2)  : {'PASS' if abs_diff < 1e-2 else 'FAIL'}")

    # Latency benchmark
    WARMUP, RUNS = 10, 200
    for _ in range(WARMUP):
        dml_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        dml_sess.run(["logits"], ort_input)
    dml_lat = (time.perf_counter() - t0) / RUNS * 1000

    for _ in range(WARMUP):
        cpu_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        cpu_sess.run(["logits"], ort_input)
    cpu_lat = (time.perf_counter() - t0) / RUNS * 1000

    print(f"\nLatency  — CPU     : {cpu_lat:.2f} ms")
    print(f"Latency  — DirectML: {dml_lat:.2f} ms")
    print(f"Speedup            : {cpu_lat / dml_lat:.2f}x")
else:
    print("DmlExecutionProvider not available on this platform.")
    print("On Windows, install: pip install onnxruntime-directml")
    print("\n── Simulating DirectML EP result using CPU EP for reference ──")

    # Cross-platform: benchmark CPU as stand-in, show what DML would report
    WARMUP, RUNS = 10, 200
    for _ in range(WARMUP):
        cpu_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        cpu_sess.run(["logits"], ort_input)
    cpu_lat = (time.perf_counter() - t0) / RUNS * 1000
    print(f"CPU EP latency     : {cpu_lat:.2f} ms (baseline)")
    print("Expected DML speedup on discrete GPU: 2–10x for small models, ")
    print("                                      10–50x for large transformer inference.")

# ── 2.5 ONNX Runtime CUDA EP (when available) ────────────────────────────────
# On Linux/Windows with NVIDIA GPU: use CUDAExecutionProvider instead of DML

if "CUDAExecutionProvider" in available_eps:
    cuda_sess = ort.InferenceSession(
        str(onnx_path),
        providers=[("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider"],
    )
    cuda_out  = cuda_sess.run(["logits"], ort_input)[0]
    abs_diff  = np.abs(cuda_out - pt_ref).max()
    print(f"ORT CUDA logits[:5]: {cuda_out[0, :5].round(4)}")
    print(f"Max abs diff       : {abs_diff:.8f}")
    print(f"Parity (tol=1e-3)  : {'PASS' if abs_diff < 1e-3 else 'FAIL'}")
else:
    print("CUDAExecutionProvider not available — skipping.")
    print("On Linux/Windows with NVIDIA GPU, install: pip install onnxruntime-gpu")

# ── 2.6 ONNX Runtime EP Compatibility Matrix ─────────────────────────────────
EP_INFO = {
    "CPUExecutionProvider"    : {"OS": "All",     "GPU vendor": "N/A",   "install": "onnxruntime"},
    "CUDAExecutionProvider"   : {"OS": "Lin/Win", "GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
    "DmlExecutionProvider"    : {"OS": "Windows", "GPU vendor": "Any",   "install": "onnxruntime-directml"},
    "ROCMExecutionProvider"   : {"OS": "Linux",   "GPU vendor": "AMD",   "install": "onnxruntime-rocm"},
    "CoreMLExecutionProvider" : {"OS": "macOS/iOS","GPU vendor": "Apple","install": "onnxruntime-silicon"},
    "TensorrtExecutionProvider": {"OS": "Lin/Win","GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
    "OpenVINOExecutionProvider": {"OS": "All",    "GPU vendor": "Intel", "install": "onnxruntime-openvino"},
}

print(f"{'Execution Provider':<35} {'OS':<12} {'GPU':<10} {'Package'}")
print("-" * 80)
for ep, info in EP_INFO.items():
    available = "✓" if ep in available_eps else " "
    print(f"[{available}] {ep:<33} {info['OS']:<12} {info['GPU vendor']:<10} {info['install']}")

Part 3 — Vulkan (Cross-Platform)¶

3.1 What is Vulkan Compute?¶

Vulkan is a low-level, cross-platform GPU API by the Khronos Group. Unlike CUDA or Metal, it:

Works on any Vulkan-capable GPU (NVIDIA, AMD, Intel, Apple via MoltenVK, mobile)
Runs on Linux, Windows, macOS (macOS via MoltenVK translation layer)
Requires explicit memory management and compute shaders (GLSL/SPIR-V)

ML Runtimes that Use Vulkan¶

Runtime	Vulkan Use Case
llama.cpp	LLM inference on any Vulkan GPU
MLC-LLM	Compiled LLM deployment
NCNN	Mobile/embedded inference
Kompute	General Vulkan ML compute
whisper.cpp	Speech recognition on Vulkan

Vulkan vs CUDA vs Metal¶

Feature	Vulkan	CUDA	Metal
Portability	★★★★★	★☆☆☆☆	★★☆☆☆
Performance ceiling	★★★★☆	★★★★★	★★★★☆
Ease of use	★★☆☆☆	★★★★☆	★★★☆☆
Ecosystem (ML)	★★★☆☆	★★★★★	★★★☆☆
macOS support	Via MoltenVK	No	Native

3.2 Vulkan Device Enumeration¶

# ── 3.2 Vulkan Device Detection via vulkan Python bindings ───────────────────
try:
    import vulkan as vk
    VULKAN_AVAILABLE = True
    print("vulkan Python bindings loaded.")
except ImportError:
    VULKAN_AVAILABLE = False
    print("vulkan not installed. Run: pip install vulkan")
    print("Also install Vulkan SDK: https://vulkan.lunarg.com/")

# ── 3.3 Enumerate Vulkan Physical Devices ────────────────────────────────────
def enumerate_vulkan_devices():
    """List all Vulkan-capable GPUs and their properties."""
    if not VULKAN_AVAILABLE:
        print("Vulkan not available.")
        return []

    app_info = vk.VkApplicationInfo(
        sType=vk.VK_STRUCTURE_TYPE_APPLICATION_INFO,
        pApplicationName="zero-to-ai",
        applicationVersion=vk.VK_MAKE_VERSION(1, 0, 0),
        pEngineName="No Engine",
        engineVersion=vk.VK_MAKE_VERSION(1, 0, 0),
        apiVersion=vk.VK_API_VERSION_1_0,
    )
    create_info = vk.VkInstanceCreateInfo(
        sType=vk.VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
        pApplicationInfo=app_info,
    )
    instance = vk.vkCreateInstance(create_info, None)

    physical_devices = vk.vkEnumeratePhysicalDevices(instance)
    devices = []

    DEVICE_TYPES = {
        vk.VK_PHYSICAL_DEVICE_TYPE_OTHER          : "Other",
        vk.VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU : "Integrated GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU   : "Discrete GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU    : "Virtual GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_CPU            : "CPU",
    }

    print(f"Found {len(physical_devices)} Vulkan device(s):\n")
    for i, dev in enumerate(physical_devices):
        props  = vk.vkGetPhysicalDeviceProperties(dev)
        mem    = vk.vkGetPhysicalDeviceMemoryProperties(dev)
        limits = props.limits

        vram_mb = sum(
            mem.memoryHeaps[j].size
            for j in range(mem.memoryHeapCount)
            if mem.memoryHeaps[j].flags & vk.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
        ) // (1024 * 1024)

        major = vk.VK_VERSION_MAJOR(props.apiVersion)
        minor = vk.VK_VERSION_MINOR(props.apiVersion)
        patch = vk.VK_VERSION_PATCH(props.apiVersion)

        info = {
            "index"          : i,
            "name"           : props.deviceName,
            "type"           : DEVICE_TYPES.get(props.deviceType, "Unknown"),
            "api_version"    : f"{major}.{minor}.{patch}",
            "driver_version" : props.driverVersion,
            "vendor_id"      : hex(props.vendorID),
            "device_id"      : hex(props.deviceID),
            "vram_mb"        : vram_mb,
            "max_workgroup_x": limits.maxComputeWorkGroupSize[0],
            "max_workgroup_y": limits.maxComputeWorkGroupSize[1],
            "max_workgroup_z": limits.maxComputeWorkGroupSize[2],
            "max_shared_mem" : limits.maxComputeSharedMemorySize,
        }
        devices.append(info)

        print(f"[{i}] {info['name']}")
        print(f"    Type          : {info['type']}")
        print(f"    Vulkan API    : {info['api_version']}")
        print(f"    Vendor ID     : {info['vendor_id']}")
        print(f"    VRAM          : {vram_mb} MB")
        print(f"    Max workgroup : {info['max_workgroup_x']} x {info['max_workgroup_y']} x {info['max_workgroup_z']}")
        print(f"    Shared mem    : {info['max_shared_mem'] // 1024} KB\n")

    vk.vkDestroyInstance(instance, None)
    return devices


vulkan_devices = enumerate_vulkan_devices()

# ── 3.4 llama.cpp Vulkan Backend ─────────────────────────────────────────────
# llama.cpp supports Vulkan as a backend for cross-platform LLM inference.
# This cell shows how to build and run it.

LLAMACPP_VULKAN_BUILD = """
# Clone and build llama.cpp with Vulkan support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Vulkan (requires Vulkan SDK installed)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run inference on any Vulkan GPU
./build/bin/llama-cli \\
    --model models/llama-3.2-1b-q4_k_m.gguf \\
    --n-gpu-layers 33 \\
    --vulkan-device 0 \\
    --prompt "Hello, world!"

# List Vulkan devices
./build/bin/llama-cli --list-devices
"""

print("=== llama.cpp Vulkan Build Instructions ===")
print(LLAMACPP_VULKAN_BUILD)

# Check if llama-cli is installed via pip (llama-cpp-python with Vulkan)
try:
    from llama_cpp import Llama
    print("llama-cpp-python is installed.")
    print("To use Vulkan backend, rebuild with:")
    print('  CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --force-reinstall')
except ImportError:
    print("llama-cpp-python not installed.")
    print("Install with Vulkan: CMAKE_ARGS=\"-DGGML_VULKAN=ON\" pip install llama-cpp-python")

# ── 3.5 Detect Vulkan via System Tools (fallback when Python bindings absent) ─
def detect_vulkan_system():
    """Use system tools to detect Vulkan support without Python bindings."""
    results = {}

    # vulkaninfo (from Vulkan SDK)
    for tool in ["vulkaninfo", "vulkaninfo --summary"]:
        try:
            out = subprocess.run(
                tool.split(),
                capture_output=True, text=True, timeout=10
            )
            if out.returncode == 0:
                results["vulkaninfo"] = out.stdout[:500]
                print("vulkaninfo output (first 500 chars):")
                print(out.stdout[:500])
                return results
        except (FileNotFoundError, subprocess.TimeoutExpired):
            continue

    # macOS: check MoltenVK
    if SYSTEM == "Darwin":
        moltenvk_paths = [
            "/usr/local/lib/libMoltenVK.dylib",
            "/opt/homebrew/lib/libMoltenVK.dylib",
            Path.home() / ".local/lib/libMoltenVK.dylib",
        ]
        for p in moltenvk_paths:
            if Path(p).exists():
                results["moltenvk"] = str(p)
                print(f"MoltenVK found: {p}")

        # Homebrew Vulkan loader
        try:
            out = subprocess.run(["brew", "list", "vulkan-loader"],
                                 capture_output=True, text=True, timeout=5)
            if out.returncode == 0:
                results["vulkan_loader"] = "homebrew"
                print("Vulkan loader installed via Homebrew")
        except FileNotFoundError:
            pass

    # Linux: check for libvulkan
    if SYSTEM == "Linux":
        try:
            out = subprocess.run(["ldconfig", "-p"],
                                 capture_output=True, text=True, timeout=5)
            vulkan_libs = [l for l in out.stdout.splitlines() if "vulkan" in l.lower()]
            if vulkan_libs:
                results["vulkan_libs"] = vulkan_libs
                print("Vulkan libraries found:")
                for lib in vulkan_libs:
                    print(f"  {lib.strip()}")
        except FileNotFoundError:
            pass

    if not results:
        print("No Vulkan installation detected.")
        print("Install Vulkan SDK: https://vulkan.lunarg.com/sdk/home")
        if SYSTEM == "Darwin":
            print("Or via Homebrew: brew install vulkan-loader molten-vk")
        elif SYSTEM == "Linux":
            print("Or via apt: sudo apt install libvulkan1 vulkan-tools")

    return results


if not VULKAN_AVAILABLE:
    detect_vulkan_system()

# ── 3.6 PyTorch Backend Selection: CUDA vs ROCm vs MPS vs Vulkan ─────────────
# PyTorch itself does not expose a Vulkan compute path for general tensors;
# Vulkan is used by llama.cpp / MLC-LLM / NCNN as their own runtime.
# This cell compares PyTorch's available backends for context.

def select_best_torch_device():
    """Return the best available PyTorch device with fallback chain."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        props  = torch.cuda.get_device_properties(0)
        backend = f"CUDA (NVIDIA) — {props.name}, {props.total_memory // 1024**3} GB"
    elif torch.backends.mps.is_available():
        device  = torch.device("mps")
        backend = "Metal (Apple MPS)"
    else:
        device  = torch.device("cpu")
        backend = f"CPU ({platform.processor() or 'unknown'})"
    return device, backend


device, backend = select_best_torch_device()
print(f"Selected backend: {backend}")
print(f"torch.device    : {device}")

# Show all backends and their status
print("\n=== PyTorch Backend Status ===")
print(f"  CUDA (NVIDIA)  : {'available' if torch.cuda.is_available() else 'not available'}")
if torch.cuda.is_available():
    print(f"    Device count : {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"    [{i}] {torch.cuda.get_device_name(i)}")
print(f"  ROCm (AMD)     : {'available' if torch.cuda.is_available() and 'AMD' in torch.cuda.get_device_name(0) else 'not available (or CUDA not AMD)'}")
print(f"  Metal/MPS (Apple): {'available' if torch.backends.mps.is_available() else 'not available'}")
print(f"  Vulkan (via PT): not supported — use llama.cpp/MLC-LLM for Vulkan inference")

Part 4 — Cross-Backend Parity Validation¶

When deploying the same model across multiple backends, you must verify:

Numerical parity — outputs agree within dtype-appropriate tolerance
Performance ranking — understand expected speedup per backend
Operator coverage — not all ops are supported on all backends

Expected Tolerances Across Backends¶

Backend	Precision	Typical atol
PyTorch FP32 (reference)	FP32	—
ONNX Runtime CPU	FP32	1e-5
ONNX Runtime CUDA	FP32	1e-4
CoreML FP16	FP16	1e-2
CoreML FP32	FP32	1e-4
DirectML	FP32/FP16	1e-3
Vulkan (llama.cpp Q4_K_M)	INT4 quant	0.1–0.3 (perplexity metric)

# ── 4.1 Cross-Backend Parity Report ──────────────────────────────────────────
import dataclasses
from typing import Optional

@dataclasses.dataclass
class BackendResult:
    name       : str
    output     : Optional[np.ndarray]
    latency_ms : Optional[float]
    error      : Optional[str] = None


def run_ort_backend(ep_name, ep_options=None):
    """Run ONNX Runtime with a given execution provider."""
    providers = [(ep_name, ep_options or {})] if ep_options else [ep_name]
    if ep_name not in ort.get_available_providers():
        return BackendResult(ep_name, None, None, error="EP not available")
    try:
        sess = ort.InferenceSession(str(onnx_path), providers=providers)
        inp  = {"image": dummy_input.numpy().astype(np.float32)}
        for _ in range(5):  # warmup
            sess.run(["logits"], inp)
        t0  = time.perf_counter()
        for _ in range(100):
            out = sess.run(["logits"], inp)[0]
        lat = (time.perf_counter() - t0) / 100 * 1000
        return BackendResult(ep_name, out, lat)
    except Exception as e:
        return BackendResult(ep_name, None, None, error=str(e)[:80])


# Collect results
reference = pt_out.detach().numpy()
results   = [
    run_ort_backend("CPUExecutionProvider"),
    run_ort_backend("CUDAExecutionProvider"),
    run_ort_backend("DmlExecutionProvider"),
    run_ort_backend("ROCMExecutionProvider"),
    run_ort_backend("CoreMLExecutionProvider"),
]

print(f"{'Backend':<35} {'Status':<10} {'Max diff':<12} {'Latency (ms)'}")
print("-" * 75)
for r in results:
    if r.error:
        print(f"{r.name:<35} {'SKIP':<10} {'—':<12} — ({r.error})")
    else:
        diff = np.abs(r.output - reference).max()
        ok   = "PASS" if diff < 1e-2 else "FAIL"
        print(f"{r.name:<35} {ok:<10} {diff:<12.6f} {r.latency_ms:.2f} ms")

Part 5 — Exercises¶

CoreML Compute Units (macOS): Convert a ResNet-18 to CoreML. Benchmark CPU_ONLY vs ALL. What speedup does the ANE provide? At what batch size does the GPU become faster than ANE?
ONNX FP16 optimization: Export the model with FP16 weights using onnxconverter_common. Compare parity and latency vs FP32 on CPU EP.
DirectML simulation (Windows): Install onnxruntime-directml. Run the parity matrix again. Which EP is fastest on your GPU?
Vulkan device enumeration: Install the Vulkan SDK and pip install vulkan. Run the enumeration cell. How many devices does your machine expose? Is the integrated GPU also listed?
llama.cpp Vulkan backend: Clone llama.cpp, build with -DGGML_VULKAN=ON, download a GGUF model (e.g. Llama-3.2-1B-Q4_K_M.gguf). Compare tokens/sec on CPU vs Vulkan GPU.
Cross-backend parity: Export a model with a LayerNorm op. Compare ORT CPU vs CoreML FP16 numerically. Does FP16 cause any tokens to flip in a text classification task?

Key Takeaways¶

	CoreML	DirectML	Vulkan
Best for	Apple on-device	Windows any-GPU	Cross-platform LLM inference
Entry point	`coremltools`	ONNX Runtime DML EP	`llama.cpp`, `vulkan` bindings
Precision	FP16/FP32/INT8	FP32/FP16	Quantized (Q4–Q8)
Dynamic shapes	Limited (ANE)	Yes	Yes
Debugging	Instruments / `coremltools.utils`	PIX / DX Capture	RenderDoc / validation layers

Always validate numerical parity when switching backends — FP16 and quantization introduce non-trivial differences
CoreML ANE requires static input shapes — use ct.RangeDim for bounded dynamic shapes
DirectML and Vulkan both skip the vendor-specific driver stack — great for portability, sometimes slower than native CUDA/ROCm
Vulkan’s SPIR-V compute shaders are how llama.cpp achieves GPU inference on Intel, AMD, and Apple GPUs without CUDA

Previous: lab_06_framework_validation.ipynb
Next: lab_08_benchmarking.ipynb — Benchmarking: TTFT, throughput, SLOs
Back to Overview: README.md