are fortunate sufficient to have entry to a system with an Nvidia Graphical Processing Unit (Gpu). Do you know there may be an absurdly simple methodology to make use of your GPU’s capabilities utilizing a Python library supposed and predominantly used for machine studying (ML) purposes?
Don’t fear in the event you’re lower than velocity on the ins and outs of ML, since we gained’t be utilizing it on this article. As an alternative, I’ll present you how you can use the PyTorch library to entry and use the capabilities of your GPU. We’ll evaluate the run instances of Python packages utilizing the favored numerical library NumPy, operating on the CPU, with equal code utilizing PyTorch on the GPU.
Earlier than persevering with, let’s rapidly recap what a GPU and Pytorch are.
What’s a GPU?
A GPU is a specialised digital chip initially designed to quickly manipulate and alter reminiscence to speed up the creation of pictures in a body buffer supposed for output to a show gadget. Its utility as a fast picture manipulation gadget was based mostly on its capability to carry out many calculations concurrently, and it’s nonetheless used for that objective.
Nevertheless, GPUs have lately change into invaluable in machine studying, giant language mannequin coaching and improvement. Their inherent capability to carry out extremely parallelizable computations makes them ultimate workhorses in these fields, as they make use of complicated mathematical fashions and simulations.
What’s PyTorch?
PyTorch is an open-source machine studying library developed by Fb’s AI Analysis Lab (FAIR). It’s extensively used for pure language processing and laptop imaginative and prescient purposes. Two of the primary causes that Pytorch can be utilized for GPU operations are,
- One among PyTorch’s core knowledge constructions is the Tensor. Tensors are much like arrays and matrices in different programming languages, however are optimised for operating on a GPU.
- Pytorch has CUDA assist. PyTorch seamlessly integrates with CUDA, a parallel computing platform and programming mannequin developed by NVIDIA for basic computing on its GPUS. This enables PyTorch to entry the GPU {hardware} instantly, accelerating numerical computations. CUDA will allow builders to make use of PyTorch to write down software program that absolutely utilises GPU acceleration.
In abstract, PyTorch’s assist for GPU operations by way of CUDA and its environment friendly tensor manipulation capabilities make it a superb instrument for creating GPU-accelerated Python features with excessive computational calls for.
As we’ll present in a while, you don’t have to make use of PyTorch to develop machine studying fashions or practice giant language fashions.
In the remainder of this text, we’ll arrange our improvement surroundings, set up PyTorch and run by way of a number of examples the place we’ll evaluate some computationally heavy PyTorch implementations with the equal numpy implementation and see what, if any, efficiency variations we discover.
Pre-requisites
You want an Nvidia GPU in your system. To test your GPU, difficulty the next command at your system immediate. I’m utilizing the Home windows Subsystem for Linux (WSL).
$ nvidia-smi
>>
(base) PS C:Usersthoma> nvidia-smi
Fri Mar 22 11:41:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61 Driver Model: 551.61 CUDA Model: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Identify TCC/WDDM | Bus-Id Disp.A | Risky Uncorr. ECC |
| Fan Temp Perf Pwr:Utilization/Cap | Reminiscence-Utilization | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti WDDM | 00000000:01:00.0 On | N/A |
| 32% 24C P8 9W / 285W | 843MiB / 12282MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Sort Course of identify GPU Reminiscence |
| ID ID Utilization |
|=========================================================================================|
| 0 N/A N/A 1268 C+G ...tilityHPSystemEventUtilityHost.exe N/A |
| 0 N/A N/A 2204 C+G ...ekyb3d8bbwePhoneExperienceHost.exe N/A |
| 0 N/A N/A 3904 C+G ...calMicrosoftOneDriveOneDrive.exe N/A |
| 0 N/A N/A 7068 C+G ...CBS_cw5n
and so on ..
If that command isn’t recognised and also you’re certain you may have a GPU, it in all probability means you’re lacking an NVIDIA driver. Simply comply with the remainder of the directions on this article, and it must be put in as a part of that course of.
Whereas PyTorch set up packages can embrace CUDA libraries, your system should nonetheless set up the suitable NVIDIA GPU drivers. These drivers are mandatory to your working system to speak with the graphics processing unit (GPU) {hardware}. The CUDA toolkit consists of drivers, however in the event you’re utilizing PyTorch’s bundled CUDA, you solely want to make sure that your GPU drivers are present.
Click on this hyperlink to go to the NVIDIA web site and set up the most recent drivers appropriate together with your system and GPU specs.
Organising our improvement surroundings
As a greatest apply, we should always arrange a separate improvement surroundings for every undertaking. I exploit conda, however use no matter methodology fits you.
If you wish to go down the conda route and don’t have already got it, you could set up Miniconda (advisable) or Anaconda first.
Please notice that, on the time of writing, PyTorch presently solely formally helps Python variations 3.8 to three.11.
#create our check surroundings
(base) $ conda create -n pytorch_test python=3.11 -y
Now activate your new surroundings.
(base) $ conda activate pytorch_test
We now have to get the suitable conda set up command for PyTorch. It will rely in your working system, chosen programming language, most popular package deal supervisor, and CUDA model.
Fortunately, Pytorch supplies a helpful net interface that makes this simple to arrange. So, to get began, head over to the Pytorch web site at…
Click on on the Get Began
hyperlink close to the highest of the display screen. From there, scroll down a bit of till you see this,

Click on on every field within the applicable place to your system and specs. As you do, you’ll see that the command within the Run this Command
output subject modifications dynamically. If you’re achieved making your decisions, copy the ultimate command textual content proven and kind it into your command window immediate.
For me, this was:-
(pytorch_test) $ conda set up pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
We’ll set up Jupyter, Pandas, and Matplotlib to allow us to run our Python code in a pocket book with our instance code.
(pytroch_test) $ conda set up pandas matplotlib jupyter -y
Now sort in jupyter pocket book
into your command immediate. It’s best to see a jupyter pocket book open in your browser. If that doesn’t occur mechanically, you’ll probably see a screenful of knowledge after the jupyter pocket book
command.
Close to the underside, there will likely be a URL that you must copy and paste into your browser to provoke the Jupyter Pocket book.
Your URL will likely be totally different to mine, however it ought to look one thing like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da
Testing our setup
The very first thing we’ll do is check our setup. Please enter the next right into a Jupyter cell and run it.
import torch
x = torch.rand(5, 3)
print(x)
It’s best to see the same output to the next.
tensor([[0.3715, 0.5503, 0.5783],
[0.8638, 0.5206, 0.8439],
[0.4664, 0.0557, 0.6280],
[0.5704, 0.0322, 0.6053],
[0.3416, 0.4090, 0.6366]])
Moreover, to test in case your GPU driver and CUDA are enabled and accessible by PyTorch, run the next instructions:
import torch
torch.cuda.is_available()
This could output True
if all is OK.
If every little thing is okay, we are able to proceed to our examples. If not, return and test your set up processes.
NB Within the timings under, I ran every of the Numpy and PyTorch processes a number of instances in succession and took one of the best time for every. This does favour the PyTorch runs considerably as there’s a small overhead on the very first invocation of every PyTorch run however, general, I believe it’s a fairer comparability.
Instance 1 — A easy array math operation.
On this instance, we arrange two giant, equivalent one-dimensional arrays and carry out a easy addition to every array factor.
import numpy as np
import torch as pt
from timeit import default_timer as timer
#func1 will run on the CPU
def func1(a):
a+= 1
#func2 will run on the GPU
def func2(a):
a+= 2
if __name__=="__main__":
n1 = 300000000
a1 = np.ones(n1, dtype = np.float64)
# needed to make this array a lot smaller than
# the others as a result of gradual loop processing on the GPU
n2 = 300000000
a2 = pt.ones(n2,dtype=pt.float64)
begin = timer()
func1(a1)
print("Timing with CPU:numpy", timer()-start)
begin = timer()
func2(a2)
#await all calcs on the GPU to finish
pt.cuda.synchronize()
print("Timing with GPU:pytorch", timer()-start)
print()
print("a1 = ",a1)
print("a2 = ",a2)
Timing with CPU:numpy 0.1334826999955112
Timing with GPU:pytorch 0.10177790001034737
a1 = [2. 2. 2. ... 2. 2. 2.]
a2 = tensor([3., 3., 3., ..., 3., 3., 3.], dtype=torch.float64)
We see a slight enchancment when utilizing PyTorch over Numpy, however we missed one essential level. We haven’t used the GPU as a result of our PyTorch tensor knowledge remains to be in CPU reminiscence.
To maneuver the info to the GPU reminiscence, we have to add the gadget='cuda'
directive when creating the tensor. Let’s do this and see if it makes a distinction.
# Similar code as above besides
# to get the array knowledge onto the GPU reminiscence
# we modified
a2 = pt.ones(n2,dtype=pt.float64)
# to
a2 = pt.ones(n2,dtype=pt.float64,gadget='cuda')
After re-running with the modifications we get,
Timing with CPU:numpy 0.12852740001108032
Timing with GPU:pytorch 0.011292399998637848
a1 = [2. 2. 2. ... 2. 2. 2.]
a2 = tensor([3., 3., 3., ..., 3., 3., 3.], gadget='cuda:0', dtype=torch.float64)
That’s extra prefer it, a larger than 10x velocity up.
Instance 2—A barely extra complicated array operation.
For this instance, we’ll multiply multi-dimensional matrices utilizing the built-in matmul operations obtainable within the PyTorch and Numpy libraries. Every array will likely be 10000 x 10000 and comprise random floating-point numbers between 1 and 100.
# NUMPY first
import numpy as np
from timeit import default_timer as timer
# Set the seed for reproducibility
np.random.seed(0)
# Generate two 10000x10000 arrays of random floating level numbers between 1 and 100
A = np.random.uniform(low=1.0, excessive=100.0, dimension=(10000, 10000)).astype(np.float32)
B = np.random.uniform(low=1.0, excessive=100.0, dimension=(10000, 10000)).astype(np.float32)
# Carry out matrix multiplication
begin = timer()
C = np.matmul(A, B)
# As a result of giant dimension of the matrices, it is not sensible to print them completely.
# As an alternative, we print a small portion to confirm.
print("A small portion of the end result matrix:n", C[:5, :5])
print("With out GPU:", timer()-start)
A small portion of the end result matrix:
[[25461280. 25168352. 25212526. 25303304. 25277884.]
[25114760. 25197558. 25340074. 25341850. 25373122.]
[25381820. 25326522. 25438612. 25596932. 25538602.]
[25317282. 25223540. 25272242. 25551428. 25467986.]
[25327290. 25527838. 25499606. 25657218. 25527856.]]
With out GPU: 1.4450852000009036
Now for the PyTorch model.
import torch
from timeit import default_timer as timer
# Set the seed for reproducibility
torch.manual_seed(0)
# Use the GPU
gadget = 'cuda'
# Generate two 10000x10000 tensors of random floating level
# numbers between 1 and 100 and transfer them to the GPU
#
A = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(gadget)
B = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(gadget)
# Carry out matrix multiplication
begin = timer()
C = torch.matmul(A, B)
# Look forward to all present GPU operations to finish (synchronize)
torch.cuda.synchronize()
# As a result of giant dimension of the matrices, it is not sensible to print them completely.
# As an alternative, we print a small portion to confirm.
print("A small portion of the end result matrix:n", C[:5, :5])
print("With GPU:", timer() - begin)
A small portion of the end result matrix:
[[25145748. 25495480. 25376196. 25446946. 25646938.]
[25357524. 25678558. 25675806. 25459324. 25619908.]
[25533988. 25632858. 25657696. 25616978. 25901294.]
[25159630. 25230138. 25450480. 25221246. 25589418.]
[24800246. 25145700. 25103040. 25012414. 25465890.]]
With GPU: 0.07081239999388345
The PyTorch run was 20 instances higher this time than the NumPy run. Nice stuff.
Instance 3 — Combining CPU and GPU code.
Generally, not your entire processing may be achieved on a GPU. An on a regular basis use case for that is graphing knowledge. Positive, you’ll be able to manipulate your knowledge utilizing the GPU, however usually the subsequent step is to see what your ultimate dataset appears to be like like utilizing a plot.
You may’t plot knowledge if it resides within the GPU reminiscence, so you could transfer it again to CPU reminiscence earlier than calling your plotting features. Is it definitely worth the overhead of transferring giant chunks of information from the GPU to the CPU? Let’s discover out.
On this instance, we are going to clear up this polar equation for values of θ between 0 and 2π in (x, y) coordinate phrases after which plot out the ensuing graph.

Don’t get too hung up on the mathematics. It’s simply an equation that, when transformed to make use of the x, y coordinate system and solved, appears to be like good when plotted.
For even a number of million values of x and y, Numpy can clear up this in milliseconds, so to make it a bit extra fascinating, we’ll use 100 million (x, y) coordinates.
Right here is the numpy code first.
%%time
import numpy as np
import matplotlib.pyplot as plt
from time import time as timer
begin = timer()
# create an array of 100M thetas between 0 and 2pi
theta = np.linspace(0, 2*np.pi, 100000000)
# our authentic polar system
r = 1 + 3/4 * np.sin(3*theta)
# calculate the equal x and y's coordinates
# for every theta
x = r * np.cos(theta)
y = r * np.sin(theta)
# see how lengthy the calc half took
print("Completed with calcs ", timer()-start)
# Now plot out the info
begin = timer()
plt.plot(x,y)
# see how lengthy the plotting half took
print("Completed with plot ", timer()-start)
Right here is the output. Would you may have guessed beforehand that it could seem like this? I certain wouldn’t have!

Now, let’s see what the equal PyTorch implementation appears to be like like and the way a lot of a speed-up we get.
%%time
import torch as pt
import matplotlib.pyplot as plt
from time import time as timer
# Be sure that PyTorch is utilizing the GPU
gadget = 'cuda'
# Begin the timer
begin = timer()
# Creating the theta tensor on the GPU
theta = pt.linspace(0, 2 * pt.pi, 100000000, gadget=gadget)
# Calculating r, x, and y utilizing PyTorch operations on the GPU
r = 1 + 3/4 * pt.sin(3 * theta)
x = r * pt.cos(theta)
y = r * pt.sin(theta)
# Transferring the end result again to CPU for plotting
x_cpu = x.cpu().numpy()
y_cpu = y.cpu().numpy()
pt.cuda.synchronize()
print("Completed with calcs", timer() - begin)
# Plotting
begin = timer()
plt.plot(x_cpu, y_cpu)
plt.present()
print("Completed with plot", timer() - begin)
And our output once more.

The calculation half was about 10 instances greater than the numpy calculation. The information plotting took across the identical time utilizing each the PyTorch and NumPy variations, which was anticipated for the reason that knowledge was nonetheless in CPU reminiscence then, and the GPU performed no additional half within the processing.
However, general, we shaved about 40% off the whole run-time, which is great.
Abstract
This text has demonstrated how you can leverage an NVIDIA GPU utilizing PyTorch—a machine studying library usually used for AI purposes—to speed up non-ML numerical Python code. It compares normal NumPy (CPU-based) implementations with GPU-accelerated PyTorch equivalents to indicate the efficiency advantages of operating tensor-based operations on a GPU.
You don’t should be doing machine studying to learn from PyTorch. When you can entry an NVIDIA GPU, PyTorch supplies a easy and efficient approach to considerably velocity up computationally intensive numerical operations—even in general-purpose Python code.