Basic Multi-GPU Computation in TensorFlowΒΆ

Credits: Forked from TensorFlow-Examples by Aymeric Damien

Training deep neural networks is computationally expensive, and distributing work across multiple GPUs can dramatically reduce training time. TensorFlow’s tf.device() context manager lets you explicitly place operations on specific hardware devices – CPU or any available GPU. This notebook demonstrates the fundamental pattern: split independent computations across GPUs, then combine results on the CPU.

The example computes A^n + B^n for large 10,000 x 10,000 matrices, first using a single GPU and then splitting the two matrix power operations across two GPUs. The timing comparison shows the speedup from parallelism. This same principle scales to data-parallel training of neural networks, where each GPU processes a different mini-batch and gradients are averaged before updating weights.

Setup: Refer to the setup instructions

This tutorial requires your machine to have 2 GPUs

  • β€œ/cpu:0”: The CPU of your machine.

  • β€œ/gpu:0”: The first GPU of your machine

  • β€œ/gpu:1”: The second GPU of your machine

  • For this example, we are using 2 GTX-980

import numpy as np
import tensorflow as tf
import datetime
#Processing Units logs
log_device_placement = True

#num of multiplications to perform
n = 10
# Example: compute A^n + B^n on 2 GPUs

# Create random large matrix
A = np.random.rand(1e4, 1e4).astype('float32')
B = np.random.rand(1e4, 1e4).astype('float32')

# Creates a graph to store results
c1 = []
c2 = []

# Define matrix power
def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))
# Single GPU computing

with tf.device('/gpu:0'):
    a = tf.constant(A)
    b = tf.constant(B)
    #compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Runs the op.
    sess.run(sum)
t2_1 = datetime.datetime.now()
# Multi GPU computing
# GPU:0 computes A^n
with tf.device('/gpu:0'):
    #compute A^n and store result in c2
    a = tf.constant(A)
    c2.append(matpow(a, n))

#GPU:1 computes B^n
with tf.device('/gpu:1'):
    #compute B^n and store result in c2
    b = tf.constant(B)
    c2.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n

t1_2 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Runs the op.
    sess.run(sum)
t2_2 = datetime.datetime.now()
print "Single GPU computation time: " + str(t2_1-t1_1)
print "Multi GPU computation time: " + str(t2_2-t1_2)