Exploiting Multiprocessing and Multithreading in Python as a Data Scientist

Published in

Analytics Vidhya

11 min readJan 27, 2020

We are living in an age where every day about 500 million tweets are being sent out, every hour about 4 million photos are being uploaded to Instagram, every minute about 300 hours of video is being uploaded to YouTube, every second about 750,000 messages are being sent in WhatsApp, every millisecond about 2900 emails are being sent and the list goes on. The amount of data collected per day by 2025 is estimated to be about 463 exabytes!. Developing machine learning or deep learning algorithms that consume such voluminous data is going to be a challenge for data scientists. At this point, it becomes quite imperative to develop these algorithms efficiently so that they can process the data at a faster rate (of course we need better hardware to run these algorithms, but let’s just leave that to NVIDIA folks!). And this is where parallel computing comes into picture. As the name says, it a type of computation in which execution of calculations or processes are done simultaneously, resulting in a significant boost in the performance of the algorithm. Multiprocessing and Threading are two inbuilt modules in Python that allows us to perform parallel computing. In this article, we will explore how data scientists can make use of these modules to speed up their pipelines. Towards the end of this article, we will apply these techniques to two well-known areas in deep learning namely, Computer Vision and Natural Language Processing and see the benefits of parallel computing.

Some Preliminaries

Before diving deep in to parallel computing let’s get our basics straight. To begin with, let’s see what program, process and threads mean in the context of a computer.

Program: It is a collection of instructions, an executable file, residing in the secondary storage of a computer that performs a specific task when executed. Since they reside in the secondary storage they persist even when the system is turned off. For example, in a Windows machine Calculator is a program that resides in C:\windows\system32\calc.exe, likewise if you are in a Linux system ls command is a binary file stored in /bin/ls.

Process: It is a running instance of a program and is always stored in the RAM, each process has its own memory, data and other resources needed to execute the program. There could be multiple processes associated with the same program, for example, each tab of your google chrome is a process by itself (you can verify it by opening Task Manager from control panel if you are using a windows machine) of the program chrome.exe residing in your secondary memory. Since processes are stored in the RAM they are lost once the system is turned off. Also, only one process can be executed at any instance of time by a single CPU, oh wait, then how did uni-processor computers, with just one processor could even work, since in a computer there are many processes (say web browser, music player, text editor, etc.) running simultaneously? just hold on to that question for some time.

Threads: It is an entity that resides within a process, every process is started with a single thread called the primary thread. It is the basic unit to which an operating system assigns the processor’s time. One or more threads can be spawned within a process and all of them share the same memory space allocated to the process and get executed within the scope of a process.

Just to give a sense of what we were discussing till now, below are the screenshots of my computer’s task manager and the resource monitor showing the number of processes and threads associated with each process that was running in my system at the instance the screenshot was taken.

Threads associated with each process is highlighted in this picture — screenshot of the Resource Monitor

More about this can be found here and here.

Multiprocessing and Multithreading

Now that we have some understanding of the basics, we are ready to see what Multiprocessing and Multithreading are:

Multiprocessing is the ability of the system to support more than one process simultaneously, if you refer to the above figure, you can see that my PC was running 331 processes at the instance the screenshot was taken. Here comes the interesting part, we already stated that a single CPU can only run one process at any instance of time, then how come my PC was able to run 331 processes? definitely, my PC doesn’t have 331 CPUs, and this is where we need to understand context switching. It is the ability of the operating system to pause and save the state of the currently running process and switch to another process for execution so that it can resume operation on the previously paused process at a later point. In reality, these switchings between processes happens at a very high speed such that no one will be able to discern it, which creates an illusion to us that our system is running multiple processes simultaneously, which is not the case in reality (and that’s how a computer with just one CPU could work, thanks to context switching). Every process has its own memory space and is quite independent, so even if one process gets corrupted, it does not have any impact on the other processes, but that comes with a cost, since every process has its own memory space, sharing objects between them becomes difficult, this is why we have IPC which is quite complicated and more overheads are involved.

Multithreading is the ability of an OS to spawn multiple threads within a process to execute it. All the threads of a process will run in the same memory space so sharing objects between different threads is quite easy. Having said that, allowing multiple threads to access the same objects inside a process could result in race conditions. When multiple threads try to change the value of a single variable simultaneously, it could result in unpredictable behaviors depending on how the OS switches contexts between different processes, an interesting example of undesirable effects of race condition can be found here. In order to address this pitfall, computers use Mutual Exclusion (mutex) Locks which prevents multiple threads to modify the value of the same variable concurrently. Unfortunately, CPython’s memory management is not thread-safe, so in order to prevent racing conditions, Python uses Global Interpreter Lock (GIL) as a mutex. GIL ensures only one thread in a process is in the state of execution at any instance of time, therefore no two threads can modify a variable simultaneously. For any thread to begin its execution, it has to acquire the GIL, and there is only one GIL available for a process. Once a thread finishes its execution it releases the GIL so that it can be acquired by other threads. More discussions about GIL can be found here.

Having discussed about multiprocessing and multithreading to some extent, we still need to answer one important question which is: How to choose between the two? The answer really depends on what type of operation you will be doing. The GIL in Python which prevents race conditions has its own drawbacks and its existence is always debated in the Python community. Since only one of the threads can acquire a GIL at any instance of time, only the thread that has acquired the GIL will be in the state of execution and all the remaining threads will be in the idle state waiting to acquire GIL and this makes ‘all’ multi-threaded operations in python single-threaded. The figure below explains this drawback.

The figure has three threads fighting to acquire the GIL to start their execution. — Threads fighting to acquire the GIL

At first, thread one acquires the GIL and starts running, meanwhile, threads two and three are just idling, once thread one releases GIL it is acquired by thread two which renders threads one and three idle. So essentially it just a single-threaded operation, but it is just executed in multiple threads. However, the GIL is always released when you are doing an I/O bound operations like downloading content from the network, waiting for user’s input, writing to a database, etc. in these scenarios, the CPU is just waiting for resources from external agents to start working on them. And multithreading is a very good solution to speed up your pipeline in these cases. On the other hand, there are CPU intensive tasks like matrix multiplications, sorting, searching, graph traversal, etc., these operations require lots of attention from CPU and are very good candidates for multiprocessing. And it is also important to note that spawning a process comes with more overhead than spawning a thread.

A funny illustration showing the power of GIL — That’s GIL for you

Examples

Okay, enough of theory let’s write some code and put our theory in practice.

Note:For those of you who are interested in following along this article with the codes, kindly refer here.

To begin with, let’s consider few toy examples before jumping into real-world use cases. As usual, imports first.

import os
import time
import cv2
import dlib
import glob
import random
import nltk
import numpy as np
from functools import reduce
import concurrent.futures
import xml.etree.ElementTree as ET
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

Let’s write a simple function which just delays the execution for a given number of seconds.

def sleep_fun(seconds):
    print(“Sleeping for {} second(s)”.format(seconds))
    time.sleep(seconds)

Since this is not a CPU intensive task this is a good candidate for multithreading, let’s execute this code with and without multithreading to see the differences.

sleep_times = [1,2,3]
start = time.time()
for i in sleep_times:
    sleep_fun(i)
end = time.time()
print(“Series computation: {} secs “.format(end — start))Sleeping for 1 second(s)
Sleeping for 2 second(s)
Sleeping for 3 second(s)
Series computation: 6.0069239139556885 secsstart = time.time()
with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(sleep_fun, sleep_times) 
end = time.time()
print(“Multithreading computation: {} secs “.format(end — start))Sleeping for 1 second(s)
Sleeping for 2 second(s)
Sleeping for 3 second(s)
Multithreading computation: 3.0123729705810547 secs

we can very well see that when we use multithreading the code takes just about three seconds to complete its execution as against six seconds when we do it the regular way. Unlike the series execution, where sleep_fun is called thrice one after another after the completion of the previous call, in multithreaded execution sleep_fun is called thrice concurrently, so the total time taken to complete the execution of the code is just about three seconds (sleep_fun(1) and sleep_fun(2) would have been completed already by the time sleep_fun(3) is done).

Now let’s move to another toy example, consider the following function:

def calculation(number):
    random_list = random.sample(range(10000000), number)
    return reduce(lambda x, y: x*y, random_list)

This function essentially finds the product of all the elements in a list. Keep in mind that there is a lot of number crunching involved in this function so it is CPU intensive. Due to the nature of the operation involved, and the existence of GIL, multithreading is definitely not a good option here. Let’s see what happens when we try executing this function in series and multithreading mode.

numbers = [200000, 200000, 200000]
start = time.time()
for i in numbers:
    result = calculation(i)
end = time.time()
print(“Series computation: {} sec”.format(end — start))Series computation: 39.342278718948364 secstart = time.time()
with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(calculation, numbers)   
end = time.time()
print("Multithreading computation: {} secs ".format(end - start))Multithreading computation: 40.37465953826904 secs

As expected, multithreading was no better than series execution, in fact, it is slightly slower than the latter, this is because there is always some amount of overhead involved in spawning a thread and collating their results. Now let’s try to use multiprocessing and see how it can help us.

start = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(calculation, numbers) 
end = time.time()
print(“MultiProcessing computation: {} secs “.format(end — start))MultiProcessing computation took 13.569030284881592 secs

Ta-da…the code execution is done in just one-third of the time it took when we ran it in series/multithreaded way, this is because three processes were spawned for executing three calls of the function calculation. Since all the three processes have their own memory space there is no GIL involved, due to which all the three processes were executed concurrently.

An Example from Computer Vision

Now let’s move to more useful examples that could be of real use to data scientists. One of the most common tasks involved in computer vision is face detection, where the objective is to detect if there is a face in an image, and if yes, output the bounding box of the same. And dlib is one of the most popular libraries used for this purpose. For this exercise let’s work with ten images, each of them containing one or more people in them. Let’s being by writing a function that takes in the path of an image and detects all the faces in it, draws a bounding box around the face and writes them to our disk.

def face_detection(image_path):
    image_name = os.path.basename(image_path)
    image = cv2.imread(image_path)
    face_rect = image.copy()
    faces = face_detector(image)
    if len(faces) !=0:
        for face in faces:
            x1 = face.left()
            y1 = face.top()
            x2 = face.right()
            y2 = face.bottom()
            face_rect = cv2.rectangle(face_rect, (x1,y1), (x2,y2), (255,0,0), 5)
        cv2.imwrite("./Datasets/face_processed/" + image_name, face_rect)

As usual let’s compare the results of executing this function in series, multithreading, multiprocessing mode.

face_detector = dlib.get_frontal_face_detector()
images = list(glob.iglob(“./Datasets/face_raw/*.jpg”))
images.sort()start = time.time()
for i in images:
    face_detection(i)
end = time.time()
print("Series computation: {} seconds".format(end - start))Series computation 17.05126118659973 seconds
start = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(face_detection, images)    
end = time.time()
print("Multiprocessing computation: {} sec".format(end - start))Multiprocessing computation: 3.2741615772247314 secstart = time.time()
with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(face_detection, images)  
end = time.time()
print("Multithreading computation: {} sec".format(end - start))Multithreading computation: 14.531288146972656 sec

frontal_face_detector works by computing the Histogram of Oriented Gradients (HOG) features, combined with a linear classifier, an image pyramid, and a sliding window detection scheme, that’s a lot of task for the CPU. And so it’s not surprising to see why multiprocessing was performing the best among the three. Using the multiprocessing module resulted in a speed-up of about 5.2 times than the regular naive execution and that’s the power of multiprocessing!

An Example from Natural Language Processing

For those who work in the NLP space, I have got you covered as well, here is a task that is quite common in the Natural Language Processing, processing XML files. We have two helper functions below that parses an XML file, stems them and writes them to our disk.

def stemSentence(sentence, stemmer):
    token_words=word_tokenize(sentence)
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(stemmer.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)def xml_process(xml_path):
    
    try:
        root = ET.parse(xml_path).getroot()
        posts = []
        file_name = os.path.basename(xml_path)[:-4] + ".txt"
        
        for i in root.iter("post"):
            posts.append(i.text)
        porter = PorterStemmer()    
        sentences = map(lambda x: stemSentence(x, porter), posts)
        
        for i in sentences:
            with open("./Datasets/blog_processed/" + file_name, "a+") as file:
                file.write(i)
                file.write("\n")
    except:
        pass

Now let’s try our luck with series, multithreading and multiprocessing techniques.

xml_files = glob.glob(“./Datasets/blog_xml/*.xml”)
start = time.time()
for i in xml_files:
    xml_process(i)
end = time.time()
print(“Series computation: {} seconds”.format(end — start))Series computation: 8.3497314453125 secondsstart = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(xml_process, xml_files)    
end = time.time()
print("Multiprocessing computation: {} sec".format(end - start))Multiprocessing computation: 1.6208791732788086 secstart = time.time()
with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(xml_process, xml_files)  
end = time.time()
print("Multithreading computation: {} sec".format(end - start))Multithreading computation: 9.766543626785278 sec

Again, the results are not to our surprise considering the fact the operation that we performed was CPU intensive.

Key takeaway — when in doubt always use multiprocessing!

Hope this article helps some of you in speeding up your data pipelines.

May the force be with you!