
Imagine this: you wrote a script to process some data on your laptop, went out for coffee, and when you return fifteen minutes later, barely 10% of the script has completed.
Why is the script running so slowly? Which part is slowing it down? Is it reading the data, processing it, or saving it? How can you speed up execution? Is the script really that slow?
A tool called a profiler can help answer all these questions.
What is a profiler?
A profiler is a tool that runs code and collects information about the time required to call each function, the number of calls, and the hierarchy of function calls.
By analyzing the tool's output, you can understand which part of the code takes the most time to execute (called a "bottleneck") and perhaps even figure out how to solve the problem. Finding and eliminating bottlenecks significantly improves program performance.
Example of a problem
Let's say we have a large text file and we need to find multiple occurrences of a specific pattern. First, let's generate a large file of random alphanumeric characters (line by line):
import random
import string
def generate_random_string(length):
"""Generate a random string of lowercase letters and digits."""
letters_and_digits = string.ascii_letters + string.digits
return ''.join(random.choice(letters_and_digits) for _ in range(length))
def generate_random_file(filename, num_lines, line_length):
"""Generate a file of random lines with the given number and length."""
with open(filename, 'w') as f:
for i in range(num_lines):
random_line = generate_random_string(line_length) + '\n'
f.write(random_line)
if __name__ == '__main__':
# Генерируем 1000000 строк по 1000 символов каждая
generate_random_file('random.txt', 1_000_000, 1000)
Now let's define our initial function to work from - it reads the file line by line, and then counts the number of occurrences of the words boband rob, preceded by a number:
import re
def baseline():
num_total_matches = 0
pattern1 = r"[0-9]{1}bob"
pattern2 = r"[0-9]{1}rob"
with open('random.txt', 'rt') as f:
for line in f:
line = line.lower() # преобразуем весь текст в нижний регистр
for pattern in [pattern1, pattern2]:
# Находим все вхождения паттерна в строке
matches = re.findall(pattern, line)
# Считаем количество совпадений
num_matches = len(matches)
num_total_matches += num_matches
return num_total_matches
For example, abc1robdef02bobthere are two occurrences of [in] a string.
We execute the function baseline, calculate the number of occurrences (in my case, 10861), and measure the execution time—on my machine, it was 32 seconds.
How can I speed up the code?
Potential targets for improvement
There are four parts of the code that could potentially slow it down. They are:
- Using raw strings instead of compiled regular expression objects .
- Using two separate search operations instead of one combined regular expression.
- Convert each string to lowercase instead of using the case-insensitivity flag .
- Reading a file line by line, rather than in large blocks.
1) Slow raw patterns? 2) Slow I/O? 3) Slow lowercase conversion? 4) Slow double lookups?
How can I tell which of these is the bottleneck? Use a profiler.
Python profilers
You can rejoice – you don't have to write anything yourself. Python already has two built-in profiling modules – cProfileand profile. They perform the same task, but one cProfileis written in C, while the profileother is written in pure Python, but in our case, it makes no difference. We can use them as is, but we'll make our lives much easier if we leverage a few external tools.
We need a quick and convenient way to profile a section of code (for example, a function) and save the results to a file. The module profilehooksprovides a simple decorator that we can use to wrap the function we're interested in:
from profilehooks import profile
# stdout=False -> не выводить ничего в терминал
# filename -> путь к файлу с результатами профилирования
@profile(stdout=False, filename='baseline.prof')
def baseline():
...
It can be installed with a simple command pip install profilehooks.
And we need to visualize this file in a human-readable way. For this, I use two tools.
▍ SnakeViz — simple and fast
Snakeviz is a browser-based interactive visualizer for Python profiling results. It's incredibly easy to install ( pip install snakeviz) and use ( snakeviz <path-to-profiling-output>). Let's take a look at the profiling results for our function baseline.
Snakeviz's interactive results are represented by icicle (left) and sunburst (right, less readable) graphs. You can hover and click on any function to view details. From top to bottom, the nested call hierarchy is shown, and the line length represents the relative time spent by the code executing them.
Snakeviz also displays an interactive table with function execution time statistics.
We see that the largest amount of execution time is spent inside the functionfindall, that is, performing regular expression searches. This means that if we want to speed up our code, we should focus on speeding up this function, as it is the bottleneck, not other parts of the code.
Let's check the results in the second tool.
▍ gprof2dot — readable and flexible
Gprof2dot provides a more readable flowchart visualization and saves to image files, making it easy to share results (and, if necessary, automate them). However, it is not interactive and requires Graphviz to be installed on the system .
To install gprof2dot, simply run the command pip install gprof2dot.
To generate an output image with the profiling results, use the following command:
python -m gprof2dot -f pstats <profiling-results-file> | dot -Tpng -o output.png
First, we represent the function call hierarchy as a graph in the format dot, and then generate an image—a visualization of this graph. The command dotsupports various output formats, including , .jpgand .svgthe results gprof2dotare also highly customizable.
Let's take a look at what they look like.
A graph representation of the function execution hierarchy in our code. The percentage shows the total amount of execution time spent within a function, as well as (%), but only the function's own code is included. The latter value is the number of function calls in the code.
Now the picture looks much more readable; the graph shows that regular expression searches account for 88% of the total execution time, so we need to make them faster.
If you don't have access to Graphviz (commanddot), you can use the corresponding Python package for Graphviz (pip install graphviz), as well as a simple Python script to generate the results.
- We will save the results gprof2dotto a file .dot:
python -m gprof2dot -f pstats file.prof > file.dot
- Let's generate an image from this file .dotusing the following code:
import graphviz
def make_png(input_file_name, output_file_name):
dot = graphviz.Source.from_file(input_file_name)
dot.render(outfile=output_file_name)
if __name__ == '__main__':
make_png('file.dot', 'file.svg') # также поддерживает .png, .jpg и так далее.
Code improvements
Let's speed up the code by combining two regular expressions into one:
def single_pattern():
num_total_matches = 0
pattern = r"[0-9][rb]ob" # Теперь у нас только одна операция поиска вместо двух
with open('random.txt', 'rt') as f:
for line in tqdm(f, total=1_000_000):
line = line.lower()
# Находим все вхождения паттерна в строке
matches = re.findall(pattern, line)
# Считаем количество совпадений
num_matches = len(matches)
num_total_matches += num_matches
return num_total_matches
Let's run it and get 15 seconds - the speed has increased approximately twofold!
Profiling results for the new code. Note that the % of time spent by findall has decreased (88% -> 81.5%).
Profiling the new code confirmed that we made the right choice – the percentage of time spent by the main functionfindallhas decreased, while for all other functions it has increased (also approximately twofold).
If we implement all three other ideas instead of focusing on the bottleneck, the execution time will still remain 32 seconds, the same as in the original code! We would have wasted so much time and effort without achieving any improvement! This is why it's important to focus on bottlenecks.
def wasted_efforts():
num_total_matches = 0
# Используем скомпилированные регулярные выражения
pattern1 = re.compile(r"[0-9]{1}rob", flags=re.IGNORECASE)
# Также используем нечувствительность к регистру вместо изменения строк
pattern2 = re.compile(r"[0-9]{1}bob", flags=re.IGNORECASE)
with open('random.txt', 'rt') as f:
# Загружаем строки в блок и обрабатываем их за один проход
chunk = []
for line in tqdm(f, total=1_000_000):
chunk.append(line)
if len(chunk) == 1000:
chunk_str = ''.join(chunk)
for pattern in [pattern1, pattern2]:
# Находим все вхождения паттерна в строке
matches = re.findall(pattern, chunk_str)
# Считаем количество совпадений
num_matches = len(matches)
num_total_matches += num_matches
chunk = []
# Но несмотря на наши усилия, код всё равно такой же медленный, как и оригинальный
return num_total_matches
▍ Multiprocessing
The function findallstill remains the bottleneck. However, in our contrived example, there's only one simple thing we can do to further improve it: parallelize the code.
Matches in different strings are independent of each other, so we can search them in parallel. How do we do this?
The easiest way is to create a Process Pool —an object that manages multiple parallel Python processes running simultaneously. If we have a list of values and a function to apply to each of them, we can simply call the pool method map, and it will execute the function in parallel for all values.
from multiprocessing import Pool
def calc_num_matches(string):
# Создаём отдельную функцию для объекта Pool
# Она будет параллельно применяться к каждой строки в файле
pattern = r"[0-9]{1}[rb]ob"
return len(re.findall(pattern, string))
def chunks_single_pool():
num_total_matches = 0
pool = Pool(8) # количество процессов должно быть <= количества ядер вашего CPU
with open('random.txt', 'rt') as f:
chunk = []
# Загружаем строки в список
for line in tqdm(f, total=1_000_000):
line = line.lower()
chunk.append(line)
if len(chunk) == 1000:
# Затем мы параллельно применяем к каждой строке функцию `calc_num_matches`
num_ind_matches = pool.map(calc_num_matches, chunk)
# И складываем все независимые совпадения
num_matches = sum(num_ind_matches)
num_total_matches += num_matches
chunk = []
return num_total_matches
Let's see... 6.1 seconds! That's a new record. 2.2 times faster than the optimized solution, and about 5 times faster than the original code! Let's take a look at the profiling results:
Multiprocessor code profiling results. Note that the share of regex searches continues to decrease (green blocks).
Snakeviz shows the same picture under multiprocessing.
Both results confirm that regex search now takes about 2/3 of the total time - much less than in the original version.
Subtleties
▍ Multiprocessing
An important feature is that the profiler no longer knows what's happening in the subprocesses performing the regex search. It simply shows that 4 seconds were spent executing them, but it has no access to the code executed in the subprocesses, because they are essentially separate programs, loosely coupled to our main Python process.
If you want to profile what's happening in your program's subprocess, you need to place a decorator inside the function/code executed by the subprocess @profile.
▍ Multithreading
We typically don't need to write multithreaded code directly in Python because its use is limited by the Global Interpreter Lock (GIL) , which allows only one thread of Python code to execute at a time (in a single Python process).
However, if you work with many non-Python libraries (e.g., NumPy, PyTorch, scipy) or with heavy I/O (e.g., web communications), most of your time will be spent outside the Python interpreter executing code written in C, C++, Fortran, and so on.
In these two cases, using multiple threads can be practical, so keep in mind that both Python's built-in profiling modules— profile and cProfile —profile only the main application thread. If you want to profile code executed in other threads, you can try running the profiler in the function the threads will execute, or use third-party profiling tools like Yappi or VizTracer .
GPU computing
If you're using a GPU for computation, keep in mind that it's a separate device in the system, operating asynchronously with the CPU. Therefore, when executing code on the GPU, you need to be very careful when measuring execution time, as your Python code doesn't know what's happening on the GPU. It can only command something to be done and wait for the GPU to complete it.
Let's consider the following code:
import torch
from profilehooks import profile
@profile(filename='gpu_test.prof', stdout=False)
def compute_big_sum(tensor: torch.Tensor):
# выполняем много затратных вычислений, результатом которых становится одно число
a = tensor.sum()
b = tensor.pow(2).sum()
c = tensor.sqrt().sum()
sum_all = a + b + c
sum_all_cpu = sum_all.cpu() # передаём значение в память CPU
# возвращаем сумму всех операций (скалярную)
return sum_all_cpu
if __name__ == '__main__':
# создаём большую матрицу случайных чисел, уже находящуюся в памяти GPU
X = torch.rand((10000, 10000), dtype=torch.float64, device='cuda:0')
res = compute_big_sum(X)
print(float(res))
We create a large 10000x10000 matrix on PyTorch (on the GPU) and then profile a function that computes the following:
- The sum of all elements.
- The sum of the squares of all elements.
- The sum of the square roots of all elements.
- The sum of all previous sums.
This function returns a single number as output.
Which line of code do you think will take the longest?
Profiling results for the compute_big_sum
function. Surprisingly, this is the method.cpu()that transfers data from GPU memory to CPU memory! But that doesn't make any sense. We're only transferring 8 bytes of data! Is memory transfer really that slow?
No.
Since PyTorch internally runs CUDA calculations on the GPU, which is a separate device from the CPU, when using a function like.sum()ortorch.pow(), Python simply tells the GPU to start the calculations, but doesn't wait for them to complete! That is, the Python interpreter starts the calculations and immediately moves on to the next line of code.
The wait occurs in the method.cpu()that transfers data from GPU memory to CPU memory. It has to wait until the results of all previous calculations are available in GPU memory, meaning that all the functions we applied to our large matrix are finished executing.
But how do we profile this?
- Use the built-in PyTorch profiler , which has support for profiling operations across multiple devices.
- If you don't care how long each individual GPU operation takes (or just one), you can add a timeout to your code right after all GPU computations start torch.cuda.synchronize(), which will force PyTorch to wait until all GPU operations are completed.
Let's see what the profiling results would look like with this line:
Profiling results when using torch.cuda.synchronize(). It now .cpu()takes less than 1% of the total execution time, and almost all of the execution time is spent waiting for the GPU to complete its calculations, as expected.
The code for our modified functioncompute_big_sumwill look like this:
def compute_big_sum(tensor: torch.Tensor):
a = tensor.sum()
b = tensor.pow(2).sum()
c = tensor.sqrt().sum()
sum_all = a + b + c
torch.cuda.synchronize() # добавляем это после всех операций с GPU
# чтобы обеспечить корректность измерений времени
sum_all_cpu = sum_all.cpu()
return sum_all_cpu