Multiprocessing

Multiprocessing#

Multiprocessing is a Python package that allows processes to be spawned in Python. Check the official documentation.

import time
import random
import numpy as np

import multiprocessing
from multiprocessing import Process

Single process#

To create a process from scratch, use multiprocessing.Process class.

  • The start() method starts the execution;

  • The joint() method is used to synchronise the parent process with the child processes. It allows the parent process to wait for the child processes to finish before continuing.


The following example runs the same function in different processes. The first time it iterates two times more iteration then the second.

def count(N: int, process_name: str):
    st_time = time.time()
    for i in range(N):
        ((i+10)/25)**(1/2)
    en_time = time.time()
    print(f"{process_name} if finished {en_time - st_time}")

iter = 10**8
p1 = Process(target=count, args=(iter, "first"))
p2 = Process(target=count, args=(int(iter/2), "second"))

p1.start()
p2.start()

print("Processes were started")

p1.join()
p2.join()

print("Processes were joined")
Processes were started
second if finished 2.0598533153533936
first if finished 4.119459390640259
Processes were joined

So, although we started first process earlier, it was executed later, confirming that we achieved parallel computation.

print("Processes were started") was executed immediately after the processes were started, but print("Processes were joined") was executed only when both processes were finished - this shows us that the main process was stuck by the join method of the child processes.

Pool#

multiprocessing.Pool is more common to use.

The following function defines a function that creates an array of 1,000,000 floats and then calculates the minimum, maximum and average values over them. In the next cell we will try to run it multiple times, just in cycle and then in multiprocesses.

from multiprocessing import Pool
def gen_random(_):
    my_array = [random.random() for _ in range(1_000_000)]
    return (min(my_array), max(my_array), np.mean(my_array))

First - classiscal option only cycle that starts function 10 times.

%%timeit
[gen_random(None) for i in range(10)]
1.08 s ± 54.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now lets run it in 10 therads.

%%timeit
pool = Pool(processes=10)
results = pool.map(gen_random, [None] * 10)
pool.close()
pool.join()
511 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The increase in speed is obvious.

But let’s make sure that the solution via multiprocessing.Pool leads to the expected results.

pool = Pool(processes=10)
results = pool.map(gen_random, [None] * 10)
results
[(4.888834341798542e-07, 0.9999993695614751, 0.500253121167094),
 (1.5193695950266317e-06, 0.9999995713653611, 0.5004648527204844),
 (8.338447909927993e-08, 0.9999982350763111, 0.49987126646962265),
 (3.533316917048168e-06, 0.9999988394522739, 0.5003074522538739),
 (6.854293412850154e-07, 0.9999992469319836, 0.5001294510917905),
 (7.25034506876554e-08, 0.9999993853360523, 0.5000197205058422),
 (2.1432668373400077e-07, 0.9999985629946924, 0.500037866675235),
 (3.241340156279193e-07, 0.9999995020233481, 0.4998192310331728),
 (3.97997433898567e-08, 0.9999970648221497, 0.49967546355687814),
 (1.1574031938410556e-06, 0.999999502338194, 0.4999944256310727)]

Start methods#

Processes can be started in different ways - different start methods. There are 3 startup methods: spawn, fork and forkserver. In practice, what matters is what the child process inherits from the parent. For more details check:


The following cell defines the python module that prints file descriptors that is created by the process, in linux generally it defines the resources of the process.

Note: it must be defined in the separate module as for example the spawn method must to be able to load it.

%%writefile multiprocessing_files/show_resources.py
import os
pid = os.getpid()

def show_resources():
    pid = os.getpid()
    print(os.listdir(f"/proc/{pid}/fd"))
Writing multiprocessing_files/show_resources.py

The resources available in the main process are listed below.

from multiprocessing_files.show_resources import show_resources
show_resources()
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '69', '71', '73', '82', '83', '84', '86', '93', '94', '99', '103', '106', '107', '108']

The following cell creates a function that runs show_resources with different startup methods.

def start_process(start_method):
    context = multiprocessing.get_context(start_method)
    p = context.Process(target=show_resources)
    p.start()
    p.join()

The following cells show descriptos in different contexts.

start_process("spawn")
['0', '1', '2', '3', '4', '5', '63', '64']
start_process('fork')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '82', '83', '84', '86', '93', '94', '99', '103', '106', '107', '108']
start_process('forkserver')
['0', '1', '2', '3', '4', '5', '6', '10', '11']

As a result there, is a different list of file descriptors, which practically defines the behaviour of the process.

Command line#

It’s useful to know that in multiprocessing all child processes inherit sys.argv from the parent process. sys.argv is a list that contains the command line arguments passed to the program.

The only way to somhow deal with this is to reset sys.argv at the beginning of the process.


The following cell shows an application that prints the arguments passed to it and then launches the new process that also prints it the values of the CLI arguments.

%%writefile /tmp/some_code.py
import sys
import multiprocessing

def input():
    print("Sys arg child", sys.argv)

print("Sys arg parent", sys.argv)

p = multiprocessing.Process(target=input)
p.start()
p.join()
Overwriting /tmp/some_code.py

Running this script with some parameters will return the same list as from the parent process and also from the child process.

!python3 /tmp/some_code.py --this_argument 10
Sys arg parent ['/tmp/some_code.py', '--this_argument', '10']
Sys arg child ['/tmp/some_code.py', '--this_argument', '10']

The following cell shows the trick of resetting sys.argv at the start of the process. It prints sys.argv before and after the process starts, to make sure the parent’s sys.argv of the parent process isn’t changed.

import sys

print(sys.argv)

def inp():
    sys.argv = ["this is my argv"]
    print(sys.argv)

p = multiprocessing.Process(target=inp)
p.start()
p.join()

print(sys.argv)
['/home/fedor/.virtualenvs/python/lib/python3.12/site-packages/ipykernel_launcher.py', '--f=/run/user/1000/jupyter/runtime/kernel-v3f54d051c24f29b5db57337cf7bac857dfaa600e4.json']
['this is my argv']
['/home/fedor/.virtualenvs/python/lib/python3.12/site-packages/ipykernel_launcher.py', '--f=/run/user/1000/jupyter/runtime/kernel-v3f54d051c24f29b5db57337cf7bac857dfaa600e4.json']