PDF(1708 KB)
A survey of data layout optimization for cache hierarchy
ZHANG Yi, ZHANG Yuling, YANG Xuecong
Integrated Circuits and Embedded Systems ›› 2025, Vol. 25 ›› Issue (12) : 40-51.
PDF(1708 KB)
PDF(1708 KB)
A survey of data layout optimization for cache hierarchy
Memory access latency remains a major bottleneck for many applications on modern processors. To optimize memory access performance, it is crucial to exploit the locality of reference in memory accesses. Data layout optimization techniques, through operations such as merging, splitting, and reorganizing data structures, can significantly improve the locality of memory access. This paper first provides an overview of the technological background of memory architecture and data organization involved in layout optimization techniques. Then introduces the key issues that data orchestration techniques aim to address, the core ideas behind these techniques, and the main technologies upon which their implementation relies. Given the significant differences in storage and access patterns across various types of data, this paper focuses on systematically summarizing and categorizing relevant research, comparing the strengths and weaknesses of different approaches, and analyzing promising future research directions.
cache / program locality / data layout / data structure splicing
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
\n In this article, we present a novel linear time algorithm for\n data remapping\n, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a\n novel application of this algorithm as a key step in optimizing the design of an embedded memory system.\n Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing high-performance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization,\n we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access, and others\n. The proposed methodology will become increasingly important, especially as the needs for application specific embedded architectures become prevalent. In addition, we demonstrate the wide applicability of data remapping using several existing microprocessors, such as the Pentium and UltraSparc. Namely, we show that remapping can achieve a performance improvement of 20% on the average. Similarly, for a parametric research HPL-PD microprocessor, which characterizes the new Itanium machines, we achieve a performance improvement of 28% on average. All of our results are achieved using applications from the DIS, Olden and SPEC2000 suites of integer and floating point benchmarks.\n
|
| [6] |
|
| [7] |
|
| [8] |
We present and evaluate a simple, yet efficient optimization technique that improves memory-hierarchy performance for pointer-centric applications by up to 24% and reduces cache misses by up to 35%. This is achieved by selecting an improved ordering for the data members of pointer-based data structures. Our optimization is applicable to all type-safe programming languages that completely abstract from physical storage layout; examples of such languages are Java and Oberon. Our technique does not involve programmers in the optimization process, but runs fully automatically, guided by dynamic profiling information that captures which paths through the program are taken with that frequencey. The algorithm first strives to cluster data members that are accessed closely after one another onto the same cache line, increasing spatial locality. Then, the data members that have been mapped to a particular cache line are ordered to minimize load latency in case of a cache miss.
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
\n In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit\n data locality\n. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both\n temporal\n and\n spatial\n reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.\n
|
| [14] |
|
| [15] |
|
| [16] |
Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker.
|
| [17] |
Our analysis shows that many \"big-memory\" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.
|
| [18] |
\n While the memory of most machines is organized as a hierarchy, program data are laid out in a uniform address space. This paper defines a model of\n reference affinity\n, which measures how close a group of data are accessed together in a reference trace. It proves that the model gives a hierarchical partition of program data. At the top is the set of all data with the weakest affinity. At the bottom is each data element with the strongest affinity. Based on the theoretical model, the paper presents\n k-distance analysis\n, a practical test for the hierarchical affinity of source-level data. When used for array regrouping and structure splitting,\n k\n -distance analysis consistently outperforms data organizations given by the programmer, compiler analysis, frequency profiling, statistical clustering, and all other methods we have tried.\n
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
KHAN, TANVIR AHMED. Dmon: Efficient detection and correction of data locality problems using selective profiling[C]// 15th Symposium on Operating Systems Design and Implementation ({OSDI} 21), 2021.
|
| [32] |
|
| [33] |
|
| [34] |
MCMICHEN, TOMMY. Representing data collections in an SSA form[C]// 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2024.
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
WIMMER, CHRISTIAN,HANSPETER MöSSENBöCK. Automatic array inlining in Java virtual machines[C]// Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, 2008.
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
STRZODKA, ROBERT. Abstraction for AoS and SoA layout in C++[C]// GPU Computing Gems Jade Edition.Morgan Kaufmann, 2012:429-441.
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
/
| 〈 |
|
〉 |