Multiprocessing#
Multiprocessing is a Python package that allows processes to be spawned in Python. Check the official documentation.
import time
import random
import numpy as np
import multiprocessing
from multiprocessing import Process
Single process#
To create a process from scratch, use multiprocessing.Process
class.
The
start()
method starts the execution;The
joint()
method is used to synchronise the parent process with the child processes. It allows the parent process to wait for the child processes to finish before continuing.
The following example runs the same function in different processes. The first time it iterates two times more iteration then the second.
def count(N: int, process_name: str):
st_time = time.time()
for i in range(N):
((i+10)/25)**(1/2)
en_time = time.time()
print(f"{process_name} if finished {en_time - st_time}")
iter = 10**8
p1 = Process(target=count, args=(iter, "first"))
p2 = Process(target=count, args=(int(iter/2), "second"))
p1.start()
p2.start()
print("Processes were started")
p1.join()
p2.join()
print("Processes were joined")
Processes were started
second if finished 2.0598533153533936
first if finished 4.119459390640259
Processes were joined
So, although we started first
process earlier, it was executed later, confirming that we achieved parallel computation.
print("Processes were started")
was executed immediately after the processes were started, but print("Processes were joined")
was executed only when both processes were finished - this shows us that the main process was stuck by the join
method of the child processes.
Pool#
multiprocessing.Pool
is more common to use.
The following function defines a function that creates an array of 1,000,000 floats and then calculates the minimum, maximum and average values over them. In the next cell we will try to run it multiple times, just in cycle and then in multiprocesses.
from multiprocessing import Pool
def gen_random(_):
my_array = [random.random() for _ in range(1_000_000)]
return (min(my_array), max(my_array), np.mean(my_array))
First - classiscal option only cycle that starts function 10 times.
%%timeit
[gen_random(None) for i in range(10)]
1.08 s ± 54.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now lets run it in 10 therads.
%%timeit
pool = Pool(processes=10)
results = pool.map(gen_random, [None] * 10)
pool.close()
pool.join()
511 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The increase in speed is obvious.
But let’s make sure that the solution via multiprocessing.Pool
leads to the expected results.
pool = Pool(processes=10)
results = pool.map(gen_random, [None] * 10)
results
[(4.888834341798542e-07, 0.9999993695614751, 0.500253121167094),
(1.5193695950266317e-06, 0.9999995713653611, 0.5004648527204844),
(8.338447909927993e-08, 0.9999982350763111, 0.49987126646962265),
(3.533316917048168e-06, 0.9999988394522739, 0.5003074522538739),
(6.854293412850154e-07, 0.9999992469319836, 0.5001294510917905),
(7.25034506876554e-08, 0.9999993853360523, 0.5000197205058422),
(2.1432668373400077e-07, 0.9999985629946924, 0.500037866675235),
(3.241340156279193e-07, 0.9999995020233481, 0.4998192310331728),
(3.97997433898567e-08, 0.9999970648221497, 0.49967546355687814),
(1.1574031938410556e-06, 0.999999502338194, 0.4999944256310727)]
Start methods#
Processes can be started in different ways - different start methods. There are 3 startup methods: spawn
, fork
and forkserver
. In practice, what matters is what the child process inherits from the parent. For more details check:
The following cell defines the python module that prints file descriptors that is created by the process, in linux generally it defines the resources of the process.
Note: it must be defined in the separate module as for example the spawn
method must to be able to load it.
%%writefile multiprocessing_files/show_resources.py
import os
pid = os.getpid()
def show_resources():
pid = os.getpid()
print(os.listdir(f"/proc/{pid}/fd"))
Writing multiprocessing_files/show_resources.py
The resources available in the main process are listed below.
from multiprocessing_files.show_resources import show_resources
show_resources()
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '69', '71', '73', '82', '83', '84', '86', '93', '94', '99', '103', '106', '107', '108']
The following cell creates a function that runs show_resources
with different startup methods.
def start_process(start_method):
context = multiprocessing.get_context(start_method)
p = context.Process(target=show_resources)
p.start()
p.join()
The following cells show descriptos in different contexts.
start_process("spawn")
['0', '1', '2', '3', '4', '5', '63', '64']
start_process('fork')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '82', '83', '84', '86', '93', '94', '99', '103', '106', '107', '108']
start_process('forkserver')
['0', '1', '2', '3', '4', '5', '6', '10', '11']
As a result there, is a different list of file descriptors, which practically defines the behaviour of the process.
Command line#
It’s useful to know that in multiprocessing all child processes inherit sys.argv
from the parent process. sys.argv
is a list that contains the command line arguments passed to the program.
The only way to somhow deal with this is to reset sys.argv
at the beginning of the process.
The following cell shows an application that prints the arguments passed to it and then launches the new process that also prints it the values of the CLI arguments.
%%writefile /tmp/some_code.py
import sys
import multiprocessing
def input():
print("Sys arg child", sys.argv)
print("Sys arg parent", sys.argv)
p = multiprocessing.Process(target=input)
p.start()
p.join()
Overwriting /tmp/some_code.py
Running this script with some parameters will return the same list as from the parent process and also from the child process.
!python3 /tmp/some_code.py --this_argument 10
Sys arg parent ['/tmp/some_code.py', '--this_argument', '10']
Sys arg child ['/tmp/some_code.py', '--this_argument', '10']
The following cell shows the trick of resetting sys.argv
at the start of the process. It prints sys.argv
before and after the process starts, to make sure the parent’s sys.argv
of the parent process isn’t changed.
import sys
print(sys.argv)
def inp():
sys.argv = ["this is my argv"]
print(sys.argv)
p = multiprocessing.Process(target=inp)
p.start()
p.join()
print(sys.argv)
['/home/fedor/.virtualenvs/python/lib/python3.12/site-packages/ipykernel_launcher.py', '--f=/run/user/1000/jupyter/runtime/kernel-v3f54d051c24f29b5db57337cf7bac857dfaa600e4.json']
['this is my argv']
['/home/fedor/.virtualenvs/python/lib/python3.12/site-packages/ipykernel_launcher.py', '--f=/run/user/1000/jupyter/runtime/kernel-v3f54d051c24f29b5db57337cf7bac857dfaa600e4.json']