Boosting numpy: Why BLAS Matters & build the numpy with MKL
OpenBlas和MKL的加速作用
I recently noticed that the same code on the same machine had vastly different run times in different virtual environments. This looked a little suspicious to me. Digging deeper, I realized that the reason for this was that numpy
was linked against different BLAS libraries in different virtual environments, which had tremendous effects on the performance. However, taking a closer look on what implementation is actually used under the hood on your machine, and possibly compiling an implementation that is tailored to your very system, can yield significant speed–ups almost for free
BLAS & LAPACK
BLAS is the low–level part of your system that is responsible for efficiently performing numerical linear algebra, i.e. all the heavy number crunching. More precisely:
Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits.
Together with something like LAPACK, it is the cornerstone of today’s scientific software packages.
LAPACK (Linear Algebra Package) is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition.
Now, as mentioned above, BLAS & LAPACK are merely defining a standard interface, but what really matters is the implementation you choose. There are many different implementations under different licenses, which all have their advantages and disadvantages. It can sometimes be difficult to choose the most beneficial implementation for your machine.
Different Implementations
Default BLAS & LAPACK
By default, you will most likely only have the standard BLAS & LAPACK implementation on your system (e.g. Arch Linux packages blas
/cblas
and lapack
). While this situation is certainly sufficient if you don’t care about performance and works out of the box, is has considerable shortcomings compared to more recent implementation, e.g. not fully supporting multi–core computations. We will see how its overall performance compares to other implementations below.
Automatically Tuned Linear Algebra Software (ATLAS)
The main purpose of the ATLAS project is to provide a portable high–performance implementation for many different platforms. To achieve this, they use a complex mechanism to automatically determine the optimal parameters for the respective hardware during compilation. ATLAS has become quite popular in recent years, since it uses the BSD license and is platform independent. When compiled properly, it probably provides a reasonably good BLAS library for most performance–critical applications on most platforms.
On Arch Linux you can use the atlas-lapack
package from the repository, which will download and compile ATLAS on your machine. However, before starting the compilation, make sure to disable CPUthrottling in your OS, since this will mess up the timing when benchmarking ATLAS during compilation. To do this, I had to run sudo cpupower frequency-set -g performance
on my machine. Additionally, I had to disable intel_pstate
which actually required me to reboot my system. After this, the compilation process ran smooth and ATLAS determined the optimal settings for my machine, which ended up taking a couple of hours.
Intel Math Kernel Library (Intel MKL)
The Intel Math Kernel Library (Intel MKL) is a proprietary library that is hand–optimized specifically for Intel processors. As of February 2017, it is free of charge for some use cases. You can read more about the Community Licensing here.
If you are using Arch Linux you can obtain it by installing the intel-mkl
package, otherwise search your repository or download it directly from Intel’s developer page. The compilation process is straightforward, just follow the instructions provided.
OpenBLAS
OpenBLAS is another popular open–source implementation that is based on a fork of GotoBLAS2. They claim in their FAQ that OpenBLAS achieves a performance comparable to Intel MKL for Intel’s Sandy Bridge CPUs. Furthermore, OpenBLAS is well–known for its multi-threading features and apparently scales very nicely with the number of cores.
Obtain a copy from openblas.net or GitHub and follow the installation instructions. This will walk you through the compilation and installation process.
Time for Benchmarks
A reliable comparison can only be based on hard numbers. For this purpose, I drafted a small Python script based on an existing script that I found in this Stackoverflow thread. Here it is:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
import numpy as np
from time import time
# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)
size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
This performs some matrix multiplication, vector–vector multiplication, singular value decomposition(SVD), Cholesky factorization and Eigendecomposition, and averages the timing results (which are of course arbitrary) over multiple runs. This is how you can find out which BLAS implementation numpy is using under the hood:
import numpy as np
np.__config__.show()
You can run the script successively using different BLAS implementations, either in different virtual environments or using a different BLAS configuration in site.cfg
. For a fair comparison, you should set your CPU to a constant clock frequency and not perform any other heavy lifting while the scripts are running, especially when they make use of multiple cores on your machine.
Let’s find out how the different implementation compare in terms of performance on my laptop!
Default BLAS & LAPACK
Dotted two 4096x4096 matrices in 64.22 s.
Dotted two vectors of length 524288 in 0.80 ms.
SVD of a 2048x1024 matrix in 10.31 s.
Cholesky decomposition of a 2048x2048 matrix in 6.74 s.
Eigendecomposition of a 2048x2048 matrix in 53.77 s.
ATLAS
Dotted two 4096x4096 matrices in 3.46 s.
Dotted two vectors of length 524288 in 0.73 ms.
SVD of a 2048x1024 matrix in 2.02 s.
Cholesky decomposition of a 2048x2048 matrix in 0.51 s.
Eigendecomposition of a 2048x2048 matrix in 29.90 s.
Intel MKL
Dotted two 4096x4096 matrices in 2.44 s.
Dotted two vectors of length 524288 in 0.75 ms.
SVD of a 2048x1024 matrix in 1.34 s.
Cholesky decomposition of a 2048x2048 matrix in 0.40 s.
Eigendecomposition of a 2048x2048 matrix in 10.07 s.
OpenBLAS
Dotted two 4096x4096 matrices in 3.97 s.
Dotted two vectors of length 524288 in 0.74 ms.
SVD of a 2048x1024 matrix in 1.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.46 s.
Eigendecomposition of a 2048x2048 matrix in 32.95 s.
Today’s Lesson
We can see that there are tremendous differences in run time. Unsurprisingly, the default implementation seems to be slowest in most regards (by far!). Intel’s Math Kernel Library (Intel MKL) performs best, which is again not truly surprising, given the fact that I use an Intel i5 processor on my machine. ATLAS and OpenBLAS are both within the same ballpark and doing reasonably well. All implementations, except for the default implementation, make use of all cores on my machine, which can explain some but not all of the performance gains. You can control the number of threads used by setting the respective environment variables (e.g. OMP_NUM_THREADS
).
The difference between slowest and fastest is quite remarkable. If the low–level math computation is actually your bottleneck, and you are not already making use of all cores on your machine, you can expect speed–ups up to a factor of 2525 (in my case) or something within that order of magnitude. This is roughly equivalent of getting the run time down from a full nightly run, i.e. from leaving the office in the afternoon to the first cup of coffee in the morning, to only half an hour.
Further Thoughts
- Always keep an eye on what BLAS library
numpy
is actually using on your machine. Usenp.__config__.show()
to find out about this. - By the way, it also matters which BLAS implementation
scipy
is linked against. - If you care about performance, whatever BLAS library you decide to use in the end should be tailored to your machine’s very hardware. It you don’t mind the trouble, go ahead and compile it yourself.
- As a rule of thumb, when your machine has a reasonably modern Intel processor, then you’re probably best off using Intel MKL.
- It seems that, once again, the Anaconda distribution saves your ass by providing sane defaults, i.e. shipping with Intel MKL built-in in my case. If you’ve obtained
numpy
throughconda
you are probably doing fine. - There are more libraries such as Eigen or Boost that should be worth to look at. If you have an AMDprocessor, take a look at ACML.
Ressources
如何從源程式build利用MKL的numpy和scipy
Building NumPy and SciPy with Intel MKL
NumPy and SciPy are extremely valuable tools for numerical methods in Python. Under the hood, NumPy relies on highly optimized C and Fortran routines. This includes a set of standardized libraries known as BLAS and LAPACK. BLAS and LAPACK are really just a standardized interface to a number of matrix routines and there are many implementations of BLAS and LAPACK available for free as well as commercially. ATLAS is the default BLAS implementation used by many Linux distributions, including Fedora. ATLAS (Automatically Tuned Linear Algebra Software) is designed to be automatically tuned for specific computer hardware during build time, and can be quite fast when this is done. However, common Linux distributions want their software to run on as many hardware systems as possible and put performance second. As a result, the default versions of ATLAS (and therefore NumPy and SciPy) that ship with these distro’s are tuned for the general case and, as a result, are quite slow.
This tutorial describes how to build NumPy and SciPy in a way this is optimized for Intel CPU’s that are at least as new as the Core2 architecture (or whichever architecture you choose). It also utilizes Intel’s version of BLAS and LAPACK included in the Intel Math Kernel Library (MKL). Although MKL is a commercial product, there are a number of options for obtaining MKL free-of-charge, as described on this page on the Intel web site. The CSU CS department also has an installed version that you may use in /s/parsons/l/sys/intel. Building NumPy and Scipy to use MKL should improve performance significantly and allow you to take advantage of multiple CPU cores when using NumPy and SciPy.
Note: We assume below that the intel development software is installed in /opt/intel (the default location for a system-wide install). You may need to replace this with the path to your install, possibly in your home directory or /s/parsons/l/sys/intel for a CS department install.
Building NumPy with MKL
First, let’s focus on NumPy. We’ll need it to build SciPy later anyway. Download the last version of NumPy from this location. Then, find a suitable location to extract the files. Here is what I do.
1 2 3 4 5 |
mkdir -p ~/src/numpy cp numpy-1.9.1.tar.gz ~/src/numpy cd ~/src/numpy tar -xvf numpy-1.9.1.tar.gz cd numpy-1.9.1 |
Now, we need to modify the site.cfg to point NumPy to our MKL installation. First, copy the default and open it in your favorite editor (vim of course!)
1 2 |
cp site.cfg.example site.cfg vim site.cfg |
Now change the following lines.
1 2 3 4 |
include_dirs = /opt/intel/mkl/include library_dirs = /opt/intel/mkl/lib/intel64 mkl_libs = mkl_def, mkl_intel_lp64, mkl_gnu_thread, mkl_core, mkl_mc3 lapack_libs = mkl_def, mkl_intel_lp64, mkl_gnu_thread, mkl_core, mkl_mc3 |
Note: the order of the MKL libraries above is important. Use the Intel Link Line Advisor if you are having problems or would like to use a different compiler or OpenMP library.
We’ll also need to set our LD_LIBRARY_PATH so that the MKL libraries can be found at run time. This should be added to your .bashrc (or whichever bash init scripts you use).
1 |
export LD_LIBRARY_PATH=/opt/intel/mkl/lib/intel64:${LD_LIBRARY_PATH} |
Next, let’s set up some flags for gcc that will make the C routines included with NumPy a bit faster.
1 2 |
export CFLAGS='-fopenmp -O2 -march=core2 -ftree-vectorize' export LDFLAGS='-lm -lpthread -lgomp' |
The -march=core2 tells gcc that it can use features that are included in core2 and newer architectures and break compatibility with older CPU’s and AMD CPU’s. This means that this version of NumPy may not work correctly on old hardware or AMD hardware. However, nearly everything in the CS department should work fine. The -ftree-vectorize flag tells gcc to perform automatic SSE vectorization for our core2 architecture. -fopenmp and -lgomp enable OpenMP functionality wherever it may be. Python disutils will install our software in ~/.local by default. If, like me, you prefer to install your software somewhere else, you will also need to set up the PYTHONPATH environment variable in your .bashrc. Next, let’s build it!
1 2 3 4 5 |
# for python2.7 python setup.py build 2>&1 | less # for python3 #python3 setup.py build 2>&1 | less |
Note: If you re-build for any reason (including for python3) you will need to clean up with python setup.py clean –all In the output that is piped to less, you should see something like this telling you that MKL was found:
1 2 3 4 5 6 7 8 9 10 11 12 |
blas_mkl_info: FOUND: library_dirs = ['/opt/intel/mkl/lib/intel64'] include_dirs = ['/opt/intel/mkl/include'] libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread'] define_macros = [('SCIPY_MKL_H', None)] FOUND: library_dirs = ['/opt/intel/mkl/lib/intel64'] include_dirs = ['/opt/intel/mkl/include'] libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread'] define_macros = [('SCIPY_MKL_H', None)] |
Next, let’s install the software
1 2 3 4 5 |
# for python2.7 default location python setup.py install --user 2>&1 | less # for python3 default location #python3 setup.py build --user 2>&1 | less |
In order to verify that everything worked, check the following two things. First, we should verify that we are able to load the correct version of NumPy from the ipython prompt. If you start ipython, import numpy as np and ?np, you should see something like this:
1 |
/s/chopin/b/grad/idfah/.local/lib/python3.3/site-packages/numpy/__init__.py |
Second, ldd should show that NumPy is using the MKL BLAS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# for python 2.7 ldd ~/.local/lib/python2.7/site-packages/numpy/core/_dotblas.so linux-vdso.so.1 => (0x00007fff7e7fe000) libm.so.6 => /lib64/libm.so.6 (0x00007fe15e6ed000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe15e4cf000) libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fe15e2b8000) libmkl_def.so => /opt/intel/mkl/lib/intel64/libmkl_def.so (0x00007fe15ccd4000) libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fe15c5b0000) libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007fe15ba54000) libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007fe15a526000) libmkl_mc3.so => /opt/intel/mkl/lib/intel64/libmkl_mc3.so (0x00007fe15887a000) libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00007fe1584b4000) libc.so.6 => /lib64/libc.so.6 (0x00007fe1580f6000) /lib64/ld-linux-x86-64.so.2 (0x00007fe15ec54000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fe157ef1000) libutil.so.1 => /lib64/libutil.so.1 (0x00007fe157cee000) # for python3 ldd ~/.local/lib/python3.3/site-packages/numpy/core/_dotblas.cpython-33m.so linux-vdso.so.1 => (0x00007fff02dfe000) libm.so.6 => /lib64/libm.so.6 (0x00007ffcfd574000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffcfd356000) libgomp.so.1 => /lib64/libgomp.so.1 (0x00007ffcfd13f000) libmkl_def.so => /opt/intel/mkl/lib/intel64/libmkl_def.so (0x00007ffcfbb5b000) libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007ffcfb437000) libmkl_gnu_thread.so => /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so (0x00007ffcfa8db000) libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007ffcf93ad000) libmkl_mc3.so => /opt/intel/mkl/lib/intel64/libmkl_mc3.so (0x00007ffcf7701000) libpython3.3m.so.1.0 => /lib64/libpython3.3m.so.1.0 (0x00007ffcf728a000) libc.so.6 => /lib64/libc.so.6 (0x00007ffcf6ecc000) /lib64/ld-linux-x86-64.so.2 (0x00007ffcfdada000) libdl.so.2 => /lib64/libdl.so.2 (0x00007ffcf6cc7000) libutil.so.1 => /lib64/libutil.so.1 (0x00007ffcf6ac4000) |
Note: By default, MKL does not always free allocated memory buffers. For long-running experiments, this can cause MKL to use all available memory and begin swapping, even for small problem sizes. In order to prevent this memory leakage, set export MKL_DISABLE_FAST_MM=true in your .bashrc or relevant startup script.
Building SciPy to use our optimized NumPy and MKL
Before building SciPy, it is important that you follow the steps above to build the optimized NumPy and keep all of the environment variables from above!. SciPy relies heavily on NumPy and should use our optimized NumPy. Once you are ready, download the latest SciPy source code from here. Again, extract the files to a suitable location.
1 2 3 4 5 |
mkdir -p ~/src/scipy cp scipy-0.14.0.tar.gz ~/src/scipy cd ~/src/scipy tar -xvf scipy-0.14.0.tar.gz cd scipy-0.14.0 |
Add the -shared flag for gcc. This is a workaround for now, might not need this in the future.
1 |
export LDFLAGS="${LDFLAGS} -shared" |
Now we can build SciPy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# python2.7 python setup.py build 2>&1 | less # python3 #python3 setup.py build 2>&1 | less Again, it should let you know in less that MKL was found blas_mkl_info: FOUND: libraries = ['mkl_def', 'mkl_intel_lp64', 'mkl_gnu_thread', 'mkl_core', 'mkl_mc3', 'pthread'] library_dirs = ['/opt/intel/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/opt/intel/mkl/include'] Finally, install SciPy with # for python2.7 python setup.py install --user 2>&1 | less # for python3 #python3 setup.py install --user 2>&1 | less |
Benchmarks
To demonstrate the potential performance improvement of using MKL, here are some quick iPython timings. These timings are done on Fedora 23 using an HP-Z800 with 12 XeonE5645 cores (2x6x2.4Ghz) and 96Gb of memory.
First, here are the timings for a 4069x4096x4096 matrix multiply and a 2048×2048 SVD using the default install of numpy linked against ATLAS:
1 2 3 4 5 6 7 8 9 10 11 12 |
> %timeit np.random.random((4096,4096)).dot(np.random.random((4096,4096))) 1 loops, best of 3: 19.5 s per loop > %timeit np.linalg.svd(np.random.random((2048,2048))) 1 loops, best of 3: 13 s per loop And, here are the same timings using numpy linked against MKL: > %timeit np.random.random((4096,4096)).dot(np.random.random((4096,4096))) 1 loops, best of 3: 2.23 s per loop > %timeit np.linalg.svd(np.random.random((2048,2048))) 1 loops, best of 3: 3.48 s per loop |
Feel free to let me know if it is faster for your problems!