Lab 07 β€” GPGPU Backends: CoreML Β· DirectML Β· VulkanΒΆ

Goal: Understand and validate the three major GPGPU backends not covered in earlier labs.

Backend

Vendor

OS

Primary Use

CoreML

Apple

macOS / iOS / iPadOS

On-device inference via ANE, GPU, CPU

DirectML

Microsoft

Windows

DirectX 12 GPU acceleration for ML

Vulkan

Khronos

Cross-platform

Low-level GPU compute (llama.cpp, MLC-LLM)

PrerequisitesΒΆ

# CoreML
pip install coremltools torch torchvision

# DirectML (Windows only for execution; ONNX export works everywhere)
pip install onnx onnxruntime
# On Windows: pip install onnxruntime-directml

# Vulkan Python bindings
pip install vulkan
# System: install Vulkan SDK from https://vulkan.lunarg.com/
import platform
import sys
import json
import time
import subprocess
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn

print(f"Python   : {sys.version.split()[0]}")
print(f"PyTorch  : {torch.__version__}")
print(f"Platform : {platform.system()} {platform.machine()}")
print(f"MPS avail: {torch.backends.mps.is_available()}")
print(f"CUDA avail: {torch.cuda.is_available()}")

SYSTEM = platform.system()   # 'Darwin', 'Windows', 'Linux'
ARCH   = platform.machine()  # 'arm64', 'x86_64', 'AMD64'

DEVICE = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"DEVICE   : {DEVICE}")

Part 1 β€” CoreML (Apple)ΒΆ

1.1 What is CoreML?ΒΆ

CoreML is Apple’s on-device ML framework. It dispatches workloads across three compute units:

Unit

Hardware

Best For

ANE (Apple Neural Engine)

Dedicated ML chip

Fixed-shape inference, transformers

GPU

Metal GPU

Flexible ops, conv, matmul

CPU

ARM cores

Fallback, dynamic shapes

Compute Unit SelectionΒΆ

import coremltools as ct

ct.ComputeUnit.ALL          # ANE β†’ GPU β†’ CPU (default, best performance)
ct.ComputeUnit.CPU_AND_NE   # Skip GPU; good for power efficiency
ct.ComputeUnit.CPU_AND_GPU  # Skip ANE; good for dynamic shapes
ct.ComputeUnit.CPU_ONLY     # Deterministic, debug mode

CoreML Model FormatsΒΆ

Format

Extension

Notes

ML Program

.mlpackage

Modern (CoreML 5+), ANE-optimized

NeuralNetwork

.mlmodel

Legacy, still widely used

Conversion path: PyTorch β†’ TorchScript β†’ coremltools.convert() β†’ .mlpackage

# ── Shared model used across all three backends ──────────────────────────────
class MiniClassifier(nn.Module):
    """Small CNN for CIFAR-10-style 3x32x32 input β†’ 10 classes."""
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 4 * 4, 128), nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

model = MiniClassifier().to(DEVICE).eval()
dummy_input = torch.randn(1, 3, 32, 32, device=DEVICE)
with torch.no_grad():
    pt_out = model(dummy_input)
print(f"PyTorch device       : {DEVICE}")
print(f"PyTorch output shape : {pt_out.shape}")
print(f"PyTorch output (logits[:5]): {pt_out[0, :5].cpu().numpy().round(4)}")
# ── 1.2 Convert PyTorch β†’ CoreML ─────────────────────────────────────────────
if SYSTEM != "Darwin":
    print("CoreML requires macOS. Skipping conversion β€” showing code only.")
else:
    try:
        import coremltools as ct
        print(f"coremltools version: {ct.__version__}")

        # Step 1: TorchScript trace
        traced = torch.jit.trace(model, dummy_input)

        # Step 2: Convert to CoreML ML Program (.mlpackage)
        mlmodel = ct.convert(
            traced,
            inputs=[ct.TensorType(name="image", shape=dummy_input.shape)],
            outputs=[ct.TensorType(name="logits")],
            convert_to="mlprogram",          # modern format (ANE-compatible)
            compute_precision=ct.precision.FLOAT16,  # ANE prefers FP16
            minimum_deployment_target=ct.target.macOS13,
        )

        save_path = Path("mini_classifier.mlpackage")
        mlmodel.save(str(save_path))
        print(f"Saved: {save_path}")
        print(f"Package size: {sum(f.stat().st_size for f in save_path.rglob('*') if f.is_file()) / 1024:.1f} KB")

    except ImportError:
        print("coremltools not installed. Run: pip install coremltools")
# ── 1.3 CoreML Inference & Parity Check ──────────────────────────────────────
if SYSTEM != "Darwin":
    print("Skipping β€” macOS only.")
else:
    try:
        import coremltools as ct
        mlmodel = ct.models.MLModel("mini_classifier.mlpackage",
                                    compute_units=ct.ComputeUnit.ALL)

        # CoreML expects numpy input
        cml_input = {"image": dummy_input.numpy().astype(np.float32)}
        cml_out   = mlmodel.predict(cml_input)["logits"]

        # Compare with PyTorch FP32 reference
        pt_ref = pt_out.detach().numpy()
        abs_diff = np.abs(cml_out - pt_ref).max()
        print(f"CoreML logits[:5] : {cml_out[0, :5].round(4)}")
        print(f"PyTorch logits[:5]: {pt_ref[0, :5].round(4)}")
        print(f"Max abs difference : {abs_diff:.6f}")

        # FP16 conversion introduces ~1e-2 tolerance
        tol = 1e-2
        status = "PASS" if abs_diff < tol else "FAIL"
        print(f"Parity check (tol={tol}): {status}")

    except Exception as e:
        print(f"Error: {e}")
# ── 1.4 Benchmark: CPU_ONLY vs CPU_AND_GPU vs ALL (ANE) ──────────────────────
if SYSTEM != "Darwin":
    print("Skipping β€” macOS only.")
else:
    try:
        import coremltools as ct

        compute_units = {
            "CPU_ONLY"    : ct.ComputeUnit.CPU_ONLY,
            "CPU_AND_GPU" : ct.ComputeUnit.CPU_AND_GPU,
            "ALL (ANE)"   : ct.ComputeUnit.ALL,
        }
        cml_input = {"image": dummy_input.numpy().astype(np.float32)}
        WARMUP, RUNS = 5, 50

        print(f"{'Compute Unit':<20} {'Latency (ms)':<15} {'Throughput (inf/s)':<20}")
        print("-" * 55)

        for name, cu in compute_units.items():
            m = ct.models.MLModel("mini_classifier.mlpackage", compute_units=cu)
            for _ in range(WARMUP):
                m.predict(cml_input)
            t0 = time.perf_counter()
            for _ in range(RUNS):
                m.predict(cml_input)
            elapsed = time.perf_counter() - t0
            lat_ms  = elapsed / RUNS * 1000
            tput    = RUNS / elapsed
            print(f"{name:<20} {lat_ms:<15.2f} {tput:<20.1f}")

    except Exception as e:
        print(f"Error: {e}")
# ── 1.5 Inspect CoreML Model Spec ────────────────────────────────────────────
if SYSTEM != "Darwin":
    print("Skipping β€” macOS only.")
else:
    try:
        import coremltools as ct
        mlmodel = ct.models.MLModel("mini_classifier.mlpackage")
        spec    = mlmodel.get_spec()

        print("=== Model Description ===")
        print(f"Type             : {spec.WhichOneof('Type')}")
        print(f"CoreML version   : {spec.specificationVersion}")

        print("\n=== Inputs ===")
        for inp in mlmodel.input_description._fd_spec:
            print(f"  {inp.name}")

        print("\n=== Outputs ===")
        for out in mlmodel.output_description._fd_spec:
            print(f"  {out.name}")

        # Operator count (ML Program)
        try:
            prog = spec.mlProgram
            ops  = []
            for fn in prog.functions.values():
                for blk in fn.block_specializations.values():
                    ops.extend(op.type for op in blk.operations)
            from collections import Counter
            print("\n=== Top Operators ===")
            for op, cnt in Counter(ops).most_common(10):
                print(f"  {op:<30} {cnt}")
        except Exception:
            print("(operator inspection not available for this format)")

    except Exception as e:
        print(f"Error: {e}")

Part 2 β€” DirectML (Microsoft / Windows)ΒΆ

2.1 What is DirectML?ΒΆ

DirectML is a low-level DirectX 12 API for GPU-accelerated ML on Windows. It is the backend used by:

  • ONNX Runtime (ExecutionProvider = 'DmlExecutionProvider')

  • Windows ML (WinML) β€” for UWP/desktop app inference

  • PyTorch-DirectML (torch_directml package, experimental)

Why DirectML MattersΒΆ

Feature

DirectML

CUDA

GPU vendor

Any (NVIDIA, AMD, Intel)

NVIDIA only

OS

Windows only

Linux / Windows

Runtime

DirectX 12 driver

CUDA driver + toolkit

API level

Low-level (operator graph)

High-level (cuBLAS, cuDNN)

Python entry point

ONNX Runtime DML EP

torch.cuda

ArchitectureΒΆ

Python (PyTorch / ONNX Runtime)
        β”‚
        β–Ό
  DmlExecutionProvider  ◄── onnxruntime-directml
        β”‚
        β–Ό
   DirectML.dll  ◄── ships with Windows 10 1903+
        β”‚
        β–Ό
   DirectX 12 / WDDM 2.x
        β”‚
        β–Ό
   GPU Hardware (any vendor)

2.2 ONNX Export (cross-platform)ΒΆ

ONNX export works on any OS. DirectML execution requires Windows.

# ── 2.2 Export model to ONNX ─────────────────────────────────────────────────
onnx_available = False
try:
    import onnx
    import onnxruntime as ort
    print(f"onnx version         : {onnx.__version__}")
    print(f"onnxruntime version  : {ort.__version__}")
    onnx_available = True
except ImportError as e:
    print(f"Missing: {e}. Run: pip install onnx onnxruntime")
    print("Skipping ONNX export β€” dependency not installed.")

if onnx_available:
    onnx_path = Path("mini_classifier.onnx")

    torch.onnx.export(
        model,
        dummy_input,
        str(onnx_path),
        input_names=["image"],
        output_names=["logits"],
        dynamic_axes={"image": {0: "batch_size"}, "logits": {0: "batch_size"}},
        opset_version=17,
        verbose=False,
    )

    # Verify ONNX model
    onnx_model = onnx.load(str(onnx_path))
    onnx.checker.check_model(onnx_model)
    print(f"ONNX model saved   : {onnx_path} ({onnx_path.stat().st_size / 1024:.1f} KB)")
    print(f"ONNX opset         : {onnx_model.opset_import[0].version}")
    print(f"Nodes in graph     : {len(onnx_model.graph.node)}")
# ── 2.3 ONNX Runtime Inference β€” CPU Execution Provider ──────────────────────
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

cpu_sess = ort.InferenceSession(
    str(onnx_path),
    sess_options=sess_options,
    providers=["CPUExecutionProvider"],
)

ort_input  = {"image": dummy_input.numpy().astype(np.float32)}
ort_cpu_out = cpu_sess.run(["logits"], ort_input)[0]

pt_ref   = pt_out.detach().numpy()
abs_diff = np.abs(ort_cpu_out - pt_ref).max()
print(f"ORT CPU logits[:5] : {ort_cpu_out[0, :5].round(4)}")
print(f"PyTorch logits[:5] : {pt_ref[0, :5].round(4)}")
print(f"Max abs diff       : {abs_diff:.8f}")
print(f"Parity (tol=1e-4)  : {'PASS' if abs_diff < 1e-4 else 'FAIL'}")
# ── 2.4 DirectML Execution Provider (Windows only) ───────────────────────────
available_eps = ort.get_available_providers()
print(f"Available ONNX Runtime EPs: {available_eps}")

if "DmlExecutionProvider" in available_eps:
    dml_sess = ort.InferenceSession(
        str(onnx_path),
        providers=[("DmlExecutionProvider", {"device_id": 0})],
    )
    dml_out  = dml_sess.run(["logits"], ort_input)[0]
    abs_diff = np.abs(dml_out - pt_ref).max()
    print(f"DirectML logits[:5]: {dml_out[0, :5].round(4)}")
    print(f"Max abs diff       : {abs_diff:.8f}")
    print(f"Parity (tol=1e-2)  : {'PASS' if abs_diff < 1e-2 else 'FAIL'}")

    # Latency benchmark
    WARMUP, RUNS = 10, 200
    for _ in range(WARMUP):
        dml_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        dml_sess.run(["logits"], ort_input)
    dml_lat = (time.perf_counter() - t0) / RUNS * 1000

    for _ in range(WARMUP):
        cpu_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        cpu_sess.run(["logits"], ort_input)
    cpu_lat = (time.perf_counter() - t0) / RUNS * 1000

    print(f"\nLatency  β€” CPU     : {cpu_lat:.2f} ms")
    print(f"Latency  β€” DirectML: {dml_lat:.2f} ms")
    print(f"Speedup            : {cpu_lat / dml_lat:.2f}x")
else:
    print("DmlExecutionProvider not available on this platform.")
    print("On Windows, install: pip install onnxruntime-directml")
    print("\n── Simulating DirectML EP result using CPU EP for reference ──")

    # Cross-platform: benchmark CPU as stand-in, show what DML would report
    WARMUP, RUNS = 10, 200
    for _ in range(WARMUP):
        cpu_sess.run(["logits"], ort_input)
    t0 = time.perf_counter()
    for _ in range(RUNS):
        cpu_sess.run(["logits"], ort_input)
    cpu_lat = (time.perf_counter() - t0) / RUNS * 1000
    print(f"CPU EP latency     : {cpu_lat:.2f} ms (baseline)")
    print("Expected DML speedup on discrete GPU: 2–10x for small models, ")
    print("                                      10–50x for large transformer inference.")
# ── 2.5 ONNX Runtime CUDA EP (when available) ────────────────────────────────
# On Linux/Windows with NVIDIA GPU: use CUDAExecutionProvider instead of DML

if "CUDAExecutionProvider" in available_eps:
    cuda_sess = ort.InferenceSession(
        str(onnx_path),
        providers=[("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider"],
    )
    cuda_out  = cuda_sess.run(["logits"], ort_input)[0]
    abs_diff  = np.abs(cuda_out - pt_ref).max()
    print(f"ORT CUDA logits[:5]: {cuda_out[0, :5].round(4)}")
    print(f"Max abs diff       : {abs_diff:.8f}")
    print(f"Parity (tol=1e-3)  : {'PASS' if abs_diff < 1e-3 else 'FAIL'}")
else:
    print("CUDAExecutionProvider not available β€” skipping.")
    print("On Linux/Windows with NVIDIA GPU, install: pip install onnxruntime-gpu")
# ── 2.6 ONNX Runtime EP Compatibility Matrix ─────────────────────────────────
EP_INFO = {
    "CPUExecutionProvider"    : {"OS": "All",     "GPU vendor": "N/A",   "install": "onnxruntime"},
    "CUDAExecutionProvider"   : {"OS": "Lin/Win", "GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
    "DmlExecutionProvider"    : {"OS": "Windows", "GPU vendor": "Any",   "install": "onnxruntime-directml"},
    "ROCMExecutionProvider"   : {"OS": "Linux",   "GPU vendor": "AMD",   "install": "onnxruntime-rocm"},
    "CoreMLExecutionProvider" : {"OS": "macOS/iOS","GPU vendor": "Apple","install": "onnxruntime-silicon"},
    "TensorrtExecutionProvider": {"OS": "Lin/Win","GPU vendor": "NVIDIA","install": "onnxruntime-gpu"},
    "OpenVINOExecutionProvider": {"OS": "All",    "GPU vendor": "Intel", "install": "onnxruntime-openvino"},
}

print(f"{'Execution Provider':<35} {'OS':<12} {'GPU':<10} {'Package'}")
print("-" * 80)
for ep, info in EP_INFO.items():
    available = "βœ“" if ep in available_eps else " "
    print(f"[{available}] {ep:<33} {info['OS']:<12} {info['GPU vendor']:<10} {info['install']}")

Part 3 β€” Vulkan (Cross-Platform)ΒΆ

3.1 What is Vulkan Compute?ΒΆ

Vulkan is a low-level, cross-platform GPU API by the Khronos Group. Unlike CUDA or Metal, it:

  • Works on any Vulkan-capable GPU (NVIDIA, AMD, Intel, Apple via MoltenVK, mobile)

  • Runs on Linux, Windows, macOS (macOS via MoltenVK translation layer)

  • Requires explicit memory management and compute shaders (GLSL/SPIR-V)

ML Runtimes that Use VulkanΒΆ

Runtime

Vulkan Use Case

llama.cpp

LLM inference on any Vulkan GPU

MLC-LLM

Compiled LLM deployment

NCNN

Mobile/embedded inference

Kompute

General Vulkan ML compute

whisper.cpp

Speech recognition on Vulkan

Vulkan vs CUDA vs MetalΒΆ

Feature

Vulkan

CUDA

Metal

Portability

β˜…β˜…β˜…β˜…β˜…

β˜…β˜†β˜†β˜†β˜†

β˜…β˜…β˜†β˜†β˜†

Performance ceiling

β˜…β˜…β˜…β˜…β˜†

β˜…β˜…β˜…β˜…β˜…

β˜…β˜…β˜…β˜…β˜†

Ease of use

β˜…β˜…β˜†β˜†β˜†

β˜…β˜…β˜…β˜…β˜†

β˜…β˜…β˜…β˜†β˜†

Ecosystem (ML)

β˜…β˜…β˜…β˜†β˜†

β˜…β˜…β˜…β˜…β˜…

β˜…β˜…β˜…β˜†β˜†

macOS support

Via MoltenVK

No

Native

3.2 Vulkan Device EnumerationΒΆ

# ── 3.2 Vulkan Device Detection via vulkan Python bindings ───────────────────
try:
    import vulkan as vk
    VULKAN_AVAILABLE = True
    print("vulkan Python bindings loaded.")
except ImportError:
    VULKAN_AVAILABLE = False
    print("vulkan not installed. Run: pip install vulkan")
    print("Also install Vulkan SDK: https://vulkan.lunarg.com/")
# ── 3.3 Enumerate Vulkan Physical Devices ────────────────────────────────────
def enumerate_vulkan_devices():
    """List all Vulkan-capable GPUs and their properties."""
    if not VULKAN_AVAILABLE:
        print("Vulkan not available.")
        return []

    app_info = vk.VkApplicationInfo(
        sType=vk.VK_STRUCTURE_TYPE_APPLICATION_INFO,
        pApplicationName="zero-to-ai",
        applicationVersion=vk.VK_MAKE_VERSION(1, 0, 0),
        pEngineName="No Engine",
        engineVersion=vk.VK_MAKE_VERSION(1, 0, 0),
        apiVersion=vk.VK_API_VERSION_1_0,
    )
    create_info = vk.VkInstanceCreateInfo(
        sType=vk.VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,
        pApplicationInfo=app_info,
    )
    instance = vk.vkCreateInstance(create_info, None)

    physical_devices = vk.vkEnumeratePhysicalDevices(instance)
    devices = []

    DEVICE_TYPES = {
        vk.VK_PHYSICAL_DEVICE_TYPE_OTHER          : "Other",
        vk.VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU : "Integrated GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU   : "Discrete GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU    : "Virtual GPU",
        vk.VK_PHYSICAL_DEVICE_TYPE_CPU            : "CPU",
    }

    print(f"Found {len(physical_devices)} Vulkan device(s):\n")
    for i, dev in enumerate(physical_devices):
        props  = vk.vkGetPhysicalDeviceProperties(dev)
        mem    = vk.vkGetPhysicalDeviceMemoryProperties(dev)
        limits = props.limits

        vram_mb = sum(
            mem.memoryHeaps[j].size
            for j in range(mem.memoryHeapCount)
            if mem.memoryHeaps[j].flags & vk.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
        ) // (1024 * 1024)

        major = vk.VK_VERSION_MAJOR(props.apiVersion)
        minor = vk.VK_VERSION_MINOR(props.apiVersion)
        patch = vk.VK_VERSION_PATCH(props.apiVersion)

        info = {
            "index"          : i,
            "name"           : props.deviceName,
            "type"           : DEVICE_TYPES.get(props.deviceType, "Unknown"),
            "api_version"    : f"{major}.{minor}.{patch}",
            "driver_version" : props.driverVersion,
            "vendor_id"      : hex(props.vendorID),
            "device_id"      : hex(props.deviceID),
            "vram_mb"        : vram_mb,
            "max_workgroup_x": limits.maxComputeWorkGroupSize[0],
            "max_workgroup_y": limits.maxComputeWorkGroupSize[1],
            "max_workgroup_z": limits.maxComputeWorkGroupSize[2],
            "max_shared_mem" : limits.maxComputeSharedMemorySize,
        }
        devices.append(info)

        print(f"[{i}] {info['name']}")
        print(f"    Type          : {info['type']}")
        print(f"    Vulkan API    : {info['api_version']}")
        print(f"    Vendor ID     : {info['vendor_id']}")
        print(f"    VRAM          : {vram_mb} MB")
        print(f"    Max workgroup : {info['max_workgroup_x']} x {info['max_workgroup_y']} x {info['max_workgroup_z']}")
        print(f"    Shared mem    : {info['max_shared_mem'] // 1024} KB\n")

    vk.vkDestroyInstance(instance, None)
    return devices


vulkan_devices = enumerate_vulkan_devices()
# ── 3.4 llama.cpp Vulkan Backend ─────────────────────────────────────────────
# llama.cpp supports Vulkan as a backend for cross-platform LLM inference.
# This cell shows how to build and run it.

LLAMACPP_VULKAN_BUILD = """
# Clone and build llama.cpp with Vulkan support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Vulkan (requires Vulkan SDK installed)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run inference on any Vulkan GPU
./build/bin/llama-cli \\
    --model models/llama-3.2-1b-q4_k_m.gguf \\
    --n-gpu-layers 33 \\
    --vulkan-device 0 \\
    --prompt "Hello, world!"

# List Vulkan devices
./build/bin/llama-cli --list-devices
"""

print("=== llama.cpp Vulkan Build Instructions ===")
print(LLAMACPP_VULKAN_BUILD)

# Check if llama-cli is installed via pip (llama-cpp-python with Vulkan)
try:
    from llama_cpp import Llama
    print("llama-cpp-python is installed.")
    print("To use Vulkan backend, rebuild with:")
    print('  CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --force-reinstall')
except ImportError:
    print("llama-cpp-python not installed.")
    print("Install with Vulkan: CMAKE_ARGS=\"-DGGML_VULKAN=ON\" pip install llama-cpp-python")
# ── 3.5 Detect Vulkan via System Tools (fallback when Python bindings absent) ─
def detect_vulkan_system():
    """Use system tools to detect Vulkan support without Python bindings."""
    results = {}

    # vulkaninfo (from Vulkan SDK)
    for tool in ["vulkaninfo", "vulkaninfo --summary"]:
        try:
            out = subprocess.run(
                tool.split(),
                capture_output=True, text=True, timeout=10
            )
            if out.returncode == 0:
                results["vulkaninfo"] = out.stdout[:500]
                print("vulkaninfo output (first 500 chars):")
                print(out.stdout[:500])
                return results
        except (FileNotFoundError, subprocess.TimeoutExpired):
            continue

    # macOS: check MoltenVK
    if SYSTEM == "Darwin":
        moltenvk_paths = [
            "/usr/local/lib/libMoltenVK.dylib",
            "/opt/homebrew/lib/libMoltenVK.dylib",
            Path.home() / ".local/lib/libMoltenVK.dylib",
        ]
        for p in moltenvk_paths:
            if Path(p).exists():
                results["moltenvk"] = str(p)
                print(f"MoltenVK found: {p}")

        # Homebrew Vulkan loader
        try:
            out = subprocess.run(["brew", "list", "vulkan-loader"],
                                 capture_output=True, text=True, timeout=5)
            if out.returncode == 0:
                results["vulkan_loader"] = "homebrew"
                print("Vulkan loader installed via Homebrew")
        except FileNotFoundError:
            pass

    # Linux: check for libvulkan
    if SYSTEM == "Linux":
        try:
            out = subprocess.run(["ldconfig", "-p"],
                                 capture_output=True, text=True, timeout=5)
            vulkan_libs = [l for l in out.stdout.splitlines() if "vulkan" in l.lower()]
            if vulkan_libs:
                results["vulkan_libs"] = vulkan_libs
                print("Vulkan libraries found:")
                for lib in vulkan_libs:
                    print(f"  {lib.strip()}")
        except FileNotFoundError:
            pass

    if not results:
        print("No Vulkan installation detected.")
        print("Install Vulkan SDK: https://vulkan.lunarg.com/sdk/home")
        if SYSTEM == "Darwin":
            print("Or via Homebrew: brew install vulkan-loader molten-vk")
        elif SYSTEM == "Linux":
            print("Or via apt: sudo apt install libvulkan1 vulkan-tools")

    return results


if not VULKAN_AVAILABLE:
    detect_vulkan_system()
# ── 3.6 PyTorch Backend Selection: CUDA vs ROCm vs MPS vs Vulkan ─────────────
# PyTorch itself does not expose a Vulkan compute path for general tensors;
# Vulkan is used by llama.cpp / MLC-LLM / NCNN as their own runtime.
# This cell compares PyTorch's available backends for context.

def select_best_torch_device():
    """Return the best available PyTorch device with fallback chain."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        props  = torch.cuda.get_device_properties(0)
        backend = f"CUDA (NVIDIA) β€” {props.name}, {props.total_memory // 1024**3} GB"
    elif torch.backends.mps.is_available():
        device  = torch.device("mps")
        backend = "Metal (Apple MPS)"
    else:
        device  = torch.device("cpu")
        backend = f"CPU ({platform.processor() or 'unknown'})"
    return device, backend


device, backend = select_best_torch_device()
print(f"Selected backend: {backend}")
print(f"torch.device    : {device}")

# Show all backends and their status
print("\n=== PyTorch Backend Status ===")
print(f"  CUDA (NVIDIA)  : {'available' if torch.cuda.is_available() else 'not available'}")
if torch.cuda.is_available():
    print(f"    Device count : {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"    [{i}] {torch.cuda.get_device_name(i)}")
print(f"  ROCm (AMD)     : {'available' if torch.cuda.is_available() and 'AMD' in torch.cuda.get_device_name(0) else 'not available (or CUDA not AMD)'}")
print(f"  Metal/MPS (Apple): {'available' if torch.backends.mps.is_available() else 'not available'}")
print(f"  Vulkan (via PT): not supported β€” use llama.cpp/MLC-LLM for Vulkan inference")

Part 4 β€” Cross-Backend Parity ValidationΒΆ

When deploying the same model across multiple backends, you must verify:

  1. Numerical parity β€” outputs agree within dtype-appropriate tolerance

  2. Performance ranking β€” understand expected speedup per backend

  3. Operator coverage β€” not all ops are supported on all backends

Expected Tolerances Across BackendsΒΆ

Backend

Precision

Typical atol

PyTorch FP32 (reference)

FP32

β€”

ONNX Runtime CPU

FP32

1e-5

ONNX Runtime CUDA

FP32

1e-4

CoreML FP16

FP16

1e-2

CoreML FP32

FP32

1e-4

DirectML

FP32/FP16

1e-3

Vulkan (llama.cpp Q4_K_M)

INT4 quant

0.1–0.3 (perplexity metric)

# ── 4.1 Cross-Backend Parity Report ──────────────────────────────────────────
import dataclasses
from typing import Optional

@dataclasses.dataclass
class BackendResult:
    name       : str
    output     : Optional[np.ndarray]
    latency_ms : Optional[float]
    error      : Optional[str] = None


def run_ort_backend(ep_name, ep_options=None):
    """Run ONNX Runtime with a given execution provider."""
    providers = [(ep_name, ep_options or {})] if ep_options else [ep_name]
    if ep_name not in ort.get_available_providers():
        return BackendResult(ep_name, None, None, error="EP not available")
    try:
        sess = ort.InferenceSession(str(onnx_path), providers=providers)
        inp  = {"image": dummy_input.numpy().astype(np.float32)}
        for _ in range(5):  # warmup
            sess.run(["logits"], inp)
        t0  = time.perf_counter()
        for _ in range(100):
            out = sess.run(["logits"], inp)[0]
        lat = (time.perf_counter() - t0) / 100 * 1000
        return BackendResult(ep_name, out, lat)
    except Exception as e:
        return BackendResult(ep_name, None, None, error=str(e)[:80])


# Collect results
reference = pt_out.detach().numpy()
results   = [
    run_ort_backend("CPUExecutionProvider"),
    run_ort_backend("CUDAExecutionProvider"),
    run_ort_backend("DmlExecutionProvider"),
    run_ort_backend("ROCMExecutionProvider"),
    run_ort_backend("CoreMLExecutionProvider"),
]

print(f"{'Backend':<35} {'Status':<10} {'Max diff':<12} {'Latency (ms)'}")
print("-" * 75)
for r in results:
    if r.error:
        print(f"{r.name:<35} {'SKIP':<10} {'β€”':<12} β€” ({r.error})")
    else:
        diff = np.abs(r.output - reference).max()
        ok   = "PASS" if diff < 1e-2 else "FAIL"
        print(f"{r.name:<35} {ok:<10} {diff:<12.6f} {r.latency_ms:.2f} ms")

Part 5 β€” ExercisesΒΆ

  1. CoreML Compute Units (macOS): Convert a ResNet-18 to CoreML. Benchmark CPU_ONLY vs ALL. What speedup does the ANE provide? At what batch size does the GPU become faster than ANE?

  2. ONNX FP16 optimization: Export the model with FP16 weights using onnxconverter_common. Compare parity and latency vs FP32 on CPU EP.

  3. DirectML simulation (Windows): Install onnxruntime-directml. Run the parity matrix again. Which EP is fastest on your GPU?

  4. Vulkan device enumeration: Install the Vulkan SDK and pip install vulkan. Run the enumeration cell. How many devices does your machine expose? Is the integrated GPU also listed?

  5. llama.cpp Vulkan backend: Clone llama.cpp, build with -DGGML_VULKAN=ON, download a GGUF model (e.g. Llama-3.2-1B-Q4_K_M.gguf). Compare tokens/sec on CPU vs Vulkan GPU.

  6. Cross-backend parity: Export a model with a LayerNorm op. Compare ORT CPU vs CoreML FP16 numerically. Does FP16 cause any tokens to flip in a text classification task?

Key TakeawaysΒΆ

CoreML

DirectML

Vulkan

Best for

Apple on-device

Windows any-GPU

Cross-platform LLM inference

Entry point

coremltools

ONNX Runtime DML EP

llama.cpp, vulkan bindings

Precision

FP16/FP32/INT8

FP32/FP16

Quantized (Q4–Q8)

Dynamic shapes

Limited (ANE)

Yes

Yes

Debugging

Instruments / coremltools.utils

PIX / DX Capture

RenderDoc / validation layers

  • Always validate numerical parity when switching backends β€” FP16 and quantization introduce non-trivial differences

  • CoreML ANE requires static input shapes β€” use ct.RangeDim for bounded dynamic shapes

  • DirectML and Vulkan both skip the vendor-specific driver stack β€” great for portability, sometimes slower than native CUDA/ROCm

  • Vulkan’s SPIR-V compute shaders are how llama.cpp achieves GPU inference on Intel, AMD, and Apple GPUs without CUDA

Previous: lab_06_framework_validation.ipynb
Next: lab_08_benchmarking.ipynb β€” Benchmarking: TTFT, throughput, SLOs
Back to Overview: README.md