The Julia Challenge in C++
The Julia Challenge in C++
Or how to write a minimal template expression engine in C++.
Recently, some folks of the Julia community were boasting about the expressiveness of the Julia programming language, compared to older languages. That’s how they started the Julia Challenge. The challenge is to create numerical functions with broadcasting for arrays, a key feature in Julia. They can be reimplemented easily in Julia thanks to the terseness of the lanuage — and using types for maximum performance. All about the challenge in this blog post by Simon Danisch:
Julia is indeed impressive work! However, as C++ developers, hearing about a strong type system, templates, and specific optimization based on the static knowledge of types hits pretty close to home, doesn’t it?
In this blog post we offer a concise solution to the Julia Challenge:
- A lazy evaluation engine for numerical algorithms
- N-dimensional containers and broadcasting, as implemented in NumPy and Julia
- The ability to seamlessly use custom types in the expression system
- Obtain SIMD-accelerated assembly
Of course, we’ve got all these features (and much more) covered with the xtensor
The Ingredients
The first component for this challenge is the n-dimensional array. We will call this class qview
(short for “quick view”, hehe). The qview
will have three members:
- a storage member (this can be either a pointer or a container), turning it into a view vs. an array
- a shape of static size N and corresponding strides
These are the basic ingredients always found in N-dimensional containers. The shape represents the bounds of the array-view, and the strides contain the step sizes to jump from one element to the next, in each dimension.
We are using a std::array
to contain the shape and the strides here to make use of the optimizations that can be enabled by knowing the dimension at compile time as well as not needing to allocate memory on the heap for the shape & strides.
The compute_stride
function computes row-major strides so that we can index into the array by computing the inner product of the strides with the index. The boolean expression (shape[i] != 1
) is important to allow for broadcasting without adding much special case code. We will look at this in more detail later.
The access operation for the qview
class computes the offset in the linear memory block as the inner product of strides and index. We special case if we have an index that is too long by “shaving off” the first argument and recursively calling the compute_offset
function, until we hit the correct number of indices. This is, again, done to support broadcasting natively.
An example on how to use and index into the array is the following short main function:
int main() { std::vector<double> data = {1,2,3, 4,5,6, 7,8,9}; qview<std::vector<double>, 2> array(data, {3, 3}); std::cout << array(0, 2) << std::endl; ...}
In order to do anything useful with this container in the numerical sense, functions are needed! And, since we’re aiming for speed, we’d like these functions to be lazy and have them perform loop fusion. To achieve that, we will need to write a simple template expression engine!
Our qfunction
class consists of a functor, and the argument tuple and a computed shape. The shape is computed according to the NumPy broadcasting rules. For example, the following shape would be a broadcast result:
A: 4 1 5B: 4 5 1======== 4 5 5 A: 2 3B: 2 2 3======== 2 2 3
You can read more about the broadcasting rules in the NumPy documentation. The rules are fairly simple: Dimensions have to either match, or be one, and shapes are extended to the left with implicit ones.
We need to use the std::make_index_sequence
trick to be able to unpack the args
tuple in the accessor. This applies the lambda to the elements selected at the index and returns the result.
Now it’s possible to implement an overloaded operator that, instead of evaluating something, returns a lazy, unevaluated qfunction
:
template <class L, class R>auto operator+(L& l, R& r) { return qfunction([](const auto& lhs, const auto& rhs) { return lhs + rhs; }, l, r);}
Now, if qview + qview
is called, the return type is an unevaluated qfunction! Since we’ve defined a catch-all operator+
(note: this is not how you should do it in production code if you want to avoid getting bogus error messages ;) we can use this operation on our qview
:
int main() { std::vector<double> data = {1,2,3, 4,5,6, 7,8,9}; qview<std::vector<double>, 2> array(data, {3, 3}); std::cout << array(0, 2) << std::endl; auto func = array + array; std::cout << func(0, 2) << std::endl; // prints: 6 ...}
Now we just need to find a way to evaluate this into another container, in order to assign the result to a new data structure. We use a crazy trick for that: The recursive template for loop!
Since we have static knowledge about the dimensions at compile time, we also know how many for loops we have to nest to iterate through all elements in the function.
When the recursive_for is running, it will execute the lambda in the assign_func and assign the values computed in array + array
to the left-hand side result array.
It’s possible to create macros for all other desired functors now, such as sin
, cos
or the arithmetic operations. Additionally, it’s necessary to realize that qfunctions can be nested (because a qfunction “looks” the same as a qview), so that a + a + a
returns a qfunction<plus, qfunction<plus, qfunction<plus qview, qview>, qview>, qview>…
There are a bunch of other things necessary that I won’t mention in detail:
- Scalar arguments need to be detected and wrapped in a qview with empty shape (a
std::array<ptrdiff_t, 0>
) - There is a
operator=(qfunction)
assignment operator that will do the recursive-for assignment explained above
These are furthermore the basic ingredients, and we can “beat” the challenge now:
qview<std::vector<double>, 2> a(d1, {1000, 1000});qview<std::vector<double>, 1> b(d2, {1000});double c = 0.4;qview<std::vector<double>, 2> res(dres, {1000, 1000});
// the challenge code:res = a + b - sin(c);
Benchmarking this function shows that the generated assembly is 12x faster than the Julia reference implementation! (The slowness of Julia is due to a compiler bug that hopefully will be fixed soon! https://github.com/JuliaLang/julia/issues/29285#issuecomment-424011387)
Not bad as a result for fairly short code. And the best is that thanks to operator overloading and auto
lambdas (or template functors, whatever way you want to look at them), this code is entirely generic, and can work with user-defined types just as well!
For the challenge, we went a step further and implemented (simple) iterators to use STL algorithms. To do this, we created an empty iterator class that holds an std::array index of the size of the shape and increments it according to an iteration order (in our case row-major), and then, upon deserialization unpacks the contents of the index container into the round-brackets operator.
This allows the user to call functions such as std::accumulate(a.begin(), a.end(), 0.0)
etc.
The real deal: xtensor
Note that most of these ideas are nothing new (for me), because we’ve already implemented it in xtensor! This was just a fun exercise to see how concise the main ingredients of xtensor can be written in the fancy new C++17 (note that xtensor is C++14, so we need to use polyfills for things like if constexpr
).
But you should definitely check it out at https://github.com/QuantStack/xtensor if you’re interested in a library that’s serious about scientific computing and uses modern components such as xsimd for extremely fast math functions.
You can also follow me on twitter to stay up to date with xtensor and C++: http://twitter.com/wuoulf/