đ 100 Times Faster Natural Language Processing in Python
So, how can we speed up these loops?
Fast Loops in Python with a bit of Cython
Letâs work this out on a simple example. Say we have a large set of rectangles that we store as a list of Python objects, e.g. instances of a Rectangle
class. The main job of our module is to iterate over this list in order to count how many rectangles have an area larger than a specific threshold.
Our Python module is quite simple and looks like this:
The check_rectangles
function is our bottleneck! It loops over a large number of Python objects and this can be rather slow as the Python interpreter does a lot of work under the hood at each iteration (looking for the area method in the class, packing and unpacking arguments, calling the Python APIâŠ.).
Here comes Cython to help us speed up our loop.
The Cython language is a superset of Python that contains two kind of objects:
- Python objects are the objects we manipulate in regular Python like numbers, strings, lists, class instancesâŠ
- Cython C objects are C or C++ objects like double, int, float, struct, vectors
A fast loop is simply a loop in a Cython program within which we only access Cython CÂ objects.
A straightforward approach to designing such a loop is to define C structures that will contain all the things we need during our computation: in our case, the lengths and widths of our rectangles.
We can then store our list of rectangles in a C array of such structures that we will pass to our check_rectangle
function. This function now has to accept a C array as input and thus will be defined as a Cython function by using the cdef
keyword instead of def
(note that cdef
is also used to define Cython C objects).
Here is how the fast Cython version of our Python module looks like:
Here we used a raw array of C pointers but you can also choose other options, in particular C++ structures like vectors, pairs, queues and the like. In this snippet, I also used the convenient Pool() memory management object of cymem to avoid having to free the allocated C array manually. When Pool is garbage collected by Python, it automatically frees the memory we allocated using it.
A good reference on the practical usage of Cython in NLP is the Cython Conventions page of spaCyâs API.
đ©âđš Letâs Try that Code!
There are many ways you can test, compile and distribute Cython code! Cython can even be used directly in a Jupyter Notebook like Python.
First install Cython with pip install cython
First Tests in Jupyter
Load the Cython
extension in a Jupyter notebook with %load_ext Cython
.
Now you can write Cython code like Python code by using the magic command %%cython
.
If you have a compilation error when you execute a Cython cell, be sure to check Jupyter terminal output to see the full message.
Most of the time youâll be missing a-+
tag after %%cython
to compile to C++ (for example if you use spaCy Cython API) or an import numpy
if the compiler complains about NumPy.
As I mentioned in the beginning, check the Jupyter Notebook accompanying this post, it has all the examples we discuss running in Jupyter.
Writing, Using and Distributing Cython Code
Cython code is written in .pyx files. These files are compiled to C or C++ files by the Cython compiler and then to byte-code level with the systemâs C compiler. The byte-code level files can then be used by the Python interpreter.
You can load a .pyx file directly in Python by using pyximport
:
>>> import pyximport; pyximport.install()
>>> import my_cython_module
You can also build your Cython code as a Python package and import/distribute it as a regular Python package as detailed here. This can take some time to get working, in particular on all platforms. If you need a working example, spaCyâs install script is a rather comprehensive one.
Before we move to some NLP, letâs quickly talk about the def
, cdef
and cpdef
keywords, because they are the main things you need to grab to start using Cython.
You can use three types of functions in a Cython program:
- Python functions, which are defined with the usual keyword
def
. They take as input and output Python objects. Internally they can use both Python and C/C++ objects and can call both Cython and Python functions. - Cython functions defined with the
cdef
keyword. They can take as input, use internally and output both Python and C/C++ objects. These functions are not accessible from the Python-space (i.e. the Python interpreter and other pure Python modules that would import your Cython module) but they can be imported by other Cython modules. - Cython functions defined with the
cpdef
keyword are like thecdef
Cython functions but they are also provided with a Python wrapper so they can be called from the Python-space (with Python objects as inputs and outputs) as well as from other Cython modules (with C/C++ or Python objects as inputs).
The cdef
keyword has another use which is to type Cython C/C++ objects in the code. Unless you type your objects with this keyword, they will be considered as Python objects (and thus slow to access).