1. 程式人生 > >Memory Access Patterns Are Important

Memory Access Patterns Are Important

The random heap walk produced an interesting result.  This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies.  For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration.  The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency.  By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem.  I’d like to test this on a much larger heap with Sandy Bridge.The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency.  Let’s look at these cache sub-systems in a little detail to try and understand the observed results.

Hardware Components

We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden.  In this section I’ll try and cover the major components used to hide latency that our hardware and systems software friends have put in place.  We will investigate these latency hiding components and use the Linux 

perf and Lightweight Performance Countersutilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs.  Performance counters are CPU specific and what I’ve used here are specific to Sandy Bridge.

Data Caches
Processors typically have 2 or 3 layers of data cache.  Each layer as we move out is progressively larger with increasing latency.  The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.

Data caches are effectively hardware hash tables with a fixed number of slots for each hash value.  These slots are known as “ways”.  An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location.  Within these slots the data caches do not store words, they store cache-lines of multiple words.  For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine.  This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

Data caches are typically evicted in a LRU manner.  Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted.  This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.

perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,626,053 L1-dcache-loads                                            
       274,255,164 L1-dcache-misses
         #   18.32% of all L1-dcache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,537,057,965 L1-dcache-loads                                            
     1,570,105,933 L1-dcache-misses
         #  102.15% of all L1-dcache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,321,888,497 L1-dcache-loads                                           
     1,780,223,433 L1-dcache-misses
         #   41.19% of all L1-dcache hits  

likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
|     L2_RQSTS_MISS     | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
|     Metric      |  core 2   |
+-----------------+-----------+
|   Runtime [s]   |  2.15481  |
|       CPI       | 0.867293  |
| L2 request rate |  0.18028  |
|  L2 miss rate   | 0.0546988 |
|  L2 miss ratio  | 0.303409  |
+-----------------+-----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
|   L3_LAT_CACHE_MISS    | 2.59059e+07 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
|     L2_RQSTS_MISS     | 1.5547e+09  |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 6.87876  |
|       CPI       | 1.10714  |
| L2 request rate | 0.22925  |
|  L2 miss rate   | 0.104502 |
|  L2 miss ratio  | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
|   L3_LAT_CACHE_MISS    | 1.72918e+08 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
|     L2_RQSTS_MISS     | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   |  17.474  |
|       CPI       |  6.4418  |
| L2 request rate | 0.71973  |
|  L2 miss rate   | 0.220838 |
|  L2 miss ratio  | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
|   L3_LAT_CACHE_MISS    | 1.34832e+09 |
+------------------------+-------------+

Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.

Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses.  Virtual memory systems do this by mapping pages.  We need to know the offset for a given page and its size for any memory operation.  Typically page sizes are 4KB and are gradually moving to 2MB and greater.  Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages.  The translation of virtual memory pages to physical pages is maintained by the page table.  This translation can require multiple accesses to the page table which is a huge performance penalty.  To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache.  A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache.  By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.

perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,128,634 dTLB-loads
           310,901 dTLB-misses
              #    0.02% of all dTLB cache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,551,585,263 dTLB-loads
           340,230 dTLB-misses
              #    0.02% of all dTLB cache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,031,344,537 dTLB-loads
     1,345,807,418 dTLB-misses
              #   33.38% of all dTLB cache hits

Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.

Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers.  This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length.  The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.

likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+

java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+

Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.

Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks.  Memory is organised into rows and columns.  To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word.  The row is typically a page in size and loaded into a row buffer.  Even at this stage the hardware is still helping hide the latency.  A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.

Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket.  This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers.  Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket.  The presence of these interconnects gives rise to the non-uniform nature of server memory access.  In a 2-socket system memory may be local or 1 hop away.  On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.

What does this mean for algorithms?

The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns.  If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.

Is there anything we can do about this when designing algorithms and data-structures?  Yes there is a lot we can do.  If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster.  For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing.  Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.

Research is advancing on algorithmic approaches that work in harmony with cache sub-systems.  One area I find fascinating is Cache Oblivious Algorithms.  The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel.  This article is a great illustration of the performance benefits that can be gained.

Conclusion

To achieve great performance it is important to have sympathy for the cache sub-systems.  We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches.  When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm.  This is not what we were taught in algorithm theory when studying computer science.  The last decade has seen some fundamental changes in technology.  For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.

Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky.  If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster.  The point is performance will be veryunpredictable when you walk around memory in a random fashion.  I’ve also included the L3 cache counters for more detail in all the tests.

private static final long LARGE_PRIME_INC = 70368760954879L; RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK; } };
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     9,444,928,682 dTLB-loads
     4,371,982,327 dTLB-misses
         #   46.29% of all dTLB cache hits 

     9,390,675,639 L1-dcache-loads
     1,471,647,016 L1-dcache-misses
         #   15.67% of all L1-dcache hits  

+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09  |
|     L2_RQSTS_MISS     | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 55.0094  |
|       CPI       | 17.0801  |
| L2 request rate | 1.10108  |
|  L2 miss rate   | 0.362611 |
|  L2 miss ratio  | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
|        Event           |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS      | 2.62346e+07 |
+------------------------+-------------+