A survey of data layout optimization for cache hierarchy

ZHANG Yi, ZHANG Yuling, YANG Xuecong

Integrated Circuits and Embedded Systems ›› 2025, Vol. 25 ›› Issue (12) : 40-51.

PDF(1708 KB)
PDF(1708 KB)
Integrated Circuits and Embedded Systems ›› 2025, Vol. 25 ›› Issue (12) : 40-51. DOI: 10.20193/j.ices2097-4191.2025.0089
Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application

A survey of data layout optimization for cache hierarchy

Author information +
History +

Abstract

Memory access latency remains a major bottleneck for many applications on modern processors. To optimize memory access performance, it is crucial to exploit the locality of reference in memory accesses. Data layout optimization techniques, through operations such as merging, splitting, and reorganizing data structures, can significantly improve the locality of memory access. This paper first provides an overview of the technological background of memory architecture and data organization involved in layout optimization techniques. Then introduces the key issues that data orchestration techniques aim to address, the core ideas behind these techniques, and the main technologies upon which their implementation relies. Given the significant differences in storage and access patterns across various types of data, this paper focuses on systematically summarizing and categorizing relevant research, comparing the strengths and weaknesses of different approaches, and analyzing promising future research directions.

Key words

cache / program locality / data layout / data structure splicing

Cite this article

Download Citations
ZHANG Yi , ZHANG Yuling , YANG Xuecong. A survey of data layout optimization for cache hierarchy[J]. Integrated Circuits and Embedded Systems. 2025, 25(12): 40-51 https://doi.org/10.20193/j.ices2097-4191.2025.0089

References

[1]
YE LOUIS, MIESZKO LIS, ALEXANDRA FEDOROVA. A unifying abstraction for data structure splicing[C]// Proceedings of the International Symposium on Memory Systems, 2019.
[2]
PETRANK EREZ, DROR RAWITZ. The hardness of cache conscious data placement[J]. ACM SIGPLAN Notices, 2002, 37(1):101-112.
[3]
CHILIMBI TRISHULM, BOB DAVIDSON, JAMES R LARUS. Cache-conscious structure definition[C]// Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation,1999.
[4]
CHILIMBI TRISHULM, MARK D HILL, JAMES R LARUS. Cache-conscious structure layout[C]// Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation,1999.
[5]
RABBAH RODRICM, KRISHNA V PALEM. Data remapping for design space optimization of embedded memory systems[J]. ACM Transactions on Embedded Computing Systems (TECS), 2003, 2(2):186-218.
\n In this article, we present a novel linear time algorithm for\n data remapping\n, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a\n novel application of this algorithm as a key step in optimizing the design of an embedded memory system.\n Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing high-performance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization,\n we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access, and others\n. The proposed methodology will become increasingly important, especially as the needs for application specific embedded architectures become prevalent. In addition, we demonstrate the wide applicability of data remapping using several existing microprocessors, such as the Pentium and UltraSparc. Namely, we show that remapping can achieve a performance improvement of 20% on the average. Similarly, for a parametric research HPL-PD microprocessor, which characterizes the new Itanium machines, we achieve a performance improvement of 28% on average. All of our results are achieved using applications from the DIS, Olden and SPEC2000 suites of integer and floating point benchmarks.\n
[6]
TRUONG DAN N, FRANCOISBODIN,ANDRÉ SEZNEC. Improving cache behavior of dynamically allocated data structures[C]// Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques.IEEE,1998.
[7]
CHILIMBI TRISHULM, JAMES R LARUS. Using generational garbage collection to implement cache-conscious data placement[J]. ACM SIGPLAN Notices, 1998, 34(3):37-48.
[8]
KISTLER THOMAS, MICHAEL FRANZ. Automated data-member layout of heap objects to improve memory-hierarchy performance[J]. ACM Transactions on Programming Languages and Systems (TOPLAS), 2000, 22(3):490-505.
We present and evaluate a simple, yet efficient optimization technique that improves memory-hierarchy performance for pointer-centric applications by up to 24% and reduces cache misses by up to 35%. This is achieved by selecting an improved ordering for the data members of pointer-based data structures. Our optimization is applicable to all type-safe programming languages that completely abstract from physical storage layout; examples of such languages are Java and Oberon. Our technique does not involve programmers in the optimization process, but runs fully automatically, guided by dynamic profiling information that captures which paths through the program are taken with that frequencey. The algorithm first strives to cluster data members that are accessed closely after one another onto the same cache line, increasing spatial locality. Then, the data members that have been mapped to a particular cache line are ordered to minimize load latency in case of a cache miss.
[9]
DOLBY JULIAN. Automatic inline allocation of objects[C]// Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation,1997.
[10]
CALDER BRAD. Cache-conscious data placement[C]// Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems,1998.
[11]
GANNON DENNIS, WILLIAM JALBY, KYLE GALLIVAN. Strategies for cache and local memory management by global program transformation[C]// International Conference on Supercomputing.Berlin, Heidelberg: Springer Berlin Heidelberg,1987.
[12]
WOLF MICHAEL E, MONICA S LAM. A data locality optimizing algorithm[C]// Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation,1991.
[13]
MCKINLEY KATHRYNS, STEVE CARR, CHAU-WEN TSENG. Improving data locality with loop transformations[J]. ACM Transactions on Programming Languages and Systems (TOPLAS), 1996, 18(4):424-453.
\n In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit\n data locality\n. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both\n temporal\n and\n spatial\n reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.\n
[14]
ABEL ANDREAS, JAN REINEKE. Measurement-based modeling of the cache replacement policy[C]// 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).IEEE, 2013.
[15]
MAURICE CLÉMENTINE. Reverse engineering Intel last-level cache complex addressing using performance counters[C]// International Symposium on Recent Advances in Intrusion Detection. Cham: Springer International Publishing, 2015.
[16]
BASU ARKAPRAVA, MARK D HILL, MICHAEL M SWIFT. Reducing memory reference energy with opportunistic virtual caching[J]. ACM SIGARCH Computer Architecture News, 2012, 40(3):297-308.
Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker.
[17]
BASU ARKAPRAVA. Efficient virtual memory for big memory servers[J]. ACM SIGARCH Computer Architecture News, 2013, 41(3):237-248.
Our analysis shows that many \"big-memory\" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.
[18]
ZHONG YUTAO. Array regrouping and structure splitting using whole-program reference affinity[J]. ACM SIGPLAN Notices, 2004, 39(6):255-266.
\n While the memory of most machines is organized as a hierarchy, program data are laid out in a uniform address space. This paper defines a model of\n reference affinity\n, which measures how close a group of data are accessed together in a reference trace. It proves that the model gives a hierarchical partition of program data. At the top is the set of all data with the weakest affinity. At the bottom is each data element with the strongest affinity. Based on the theoretical model, the paper presents\n k-distance analysis\n, a practical test for the hierarchical affinity of source-level data. When used for array regrouping and structure splitting,\n k\n -distance analysis consistently outperforms data organizations given by the programmer, compiler analysis, frequency profiling, statistical clustering, and all other methods we have tried.\n
[19]
ADHIANTO L. HPCToolkit:Tools for performance analysis of optimized parallel programs[J]. Concurrency and Computation: Practice and Experience, 2010, 22:685-701.
[20]
DRONGOWSKI P J. Instruction-based sampling:A new performance analysis technique for AMD family 10h processors[EB/OL]. [2025-11]. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf.
[21]
INTEL Corporation. Intel 64 and IA-32 architectures software developer’s manual, Volume 3B: System programming guide, Part 2, Number 253669-032US[EB/OL]. [2025-11]. http://www.intel.com/Assets/PDF/manual/253669.pdf.
[22]
LIU XU, JOHN MELLOR-CRUMMEY. A data-centric profiler for parallel programs[C]//Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, 2013.
[23]
LIU XU, KAMALSHARMA, JOHN MELLOR-CRUMMEY. ArrayTool:a lightweight profiler to guide array regrouping[C]// Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014.
[24]
ROY PROBIR, XU LIU. Structslim:A lightweight profiler to guide structure splitting[C]// 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO),2016.
[25]
YU CHAO. LWPTool:A lightweight profiler to guide data layout optimization[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(11):2489-2502.
[26]
HUNDT ROBERT, SANDYA MANNARSWAMY, DHRUVA CHAKRABARTI. Practical structure layout optimization and advice[C]// International Symposium on Code Generation and Optimization (CGO'06),IEEE, 2006.
[27]
MOON SUNGDO. SYZYGY-a framework for scalable cross-module IPO[C]// International Symposium on Code Generation and Optimization (CGO 2004).IEEE,2004.
[28]
CURIAL STEPHEN. Mpads:memory-pooling-assisted data splitting[C]// Proceedings of the 7th International Symposium on Memory Management, 2008.
[29]
ZHONG YUTAO, WENTAO CHANG. Sampling-based program locality approximation[C]// Proceedings of the 7th International Symposium on Memory Management, 2008.
[30]
YAN JIANIAN, WENGUANG CHEN, WEIMIN ZHENG. PMU guided structure data-layout optimization[J]. Tsinghua Science and Technology, 2011, 16(2):145-150.
[31]
KHAN, TANVIR AHMED. Dmon: Efficient detection and correction of data locality problems using selective profiling[C]// 15th Symposium on Operating Systems Design and Implementation ({OSDI} 21), 2021.
[32]
SALVADOR ROHWEDDER,CAIO, et al. JOAO P L DE CARVALHO, Region-Based Data Layout via Data Reuse Analysis[C]// Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction, 2024.
[33]
MAMATHA ANANDA, CHAITANYA. PreFix: Optimizing the Performance of Heap-Intensive Applications[C]// Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, 2025.
[34]
MCMICHEN, TOMMY. Representing data collections in an SSA form[C]// 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2024.
[35]
YAN JIANIAN. ASLOP:A field-access affinity-based structure data layout optimizer[J]. Science China Information Sciences, 2011, 54(9):1769-1783.
[36]
EIMOURI TAEES. Using field access frequency to optimize layout of objects in the JVM[C]// Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016.
[37]
DOLBY JULIAN, ANDREW CHIEN. An evaluation of automatic object inline allocation techniques[J]. ACM SIGPLAN Notices, 1998, 33(10):1-20.
[38]
DOLBY JULIAN, ANDREW CHIEN. An automatic object inlining optimization and its evaluation[C]// Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation,2000.
[39]
CHILIMBI TRISHULM, MARK D HILL, JAMES R LARUS. Making pointer-based data structures cache conscious[J]. Computer, 2002, 33(12):67-74.
[40]
WIMMER, CHRISTIAN,HANSPETER MöSSENBöCK. Automatic array inlining in Java virtual machines[C]// Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, 2008.
[41]
LEUNG SHUN-TAKALBERT. Array restructuring for cache locality[D]. Washington: University of Washington,1996.
[42]
DING CHEN, KEN KENNEDY. Inter-array data regrouping[C]// International Workshop on Languages and Compilers for Parallel Computing.Berlin, Heidelberg: Springer Berlin Heidelberg,1999.
[43]
PUGH WILLIAM, EVAN ROSSER. Iteration space slicing for locality[C]// International Workshop on Languages and Compilers for Parallel Computing.Berlin, Heidelberg: Springer Berlin Heidelberg,1999.
[44]
KANDEMIR MAHMUT, J RAMANUJAM, ALOK CHOUDHARY. Improving cache locality by a combination of loop and data transformations[J]. IEEE Transactions on Computers, 2002, 48(2):159-167.
[45]
DE LA LUZ, VICTOR,MAHMUT KANDEMIR. Array regrouping and its use in compiling data-intensive, embedded applications[J]. IEEE Transactions on Computers, 2004, 53(1):1-19.
[46]
DING CHEN, YUTAO ZHONG. Predicting whole-program locality through reuse distance analysis[C]// Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation,2003.
[47]
FARIA NUNO, RUI SILVA, JOAO L SOBRAL. Impact of data structure layout on performance[C]// 2013 21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing.IEEE, 2013.
[48]
STRZODKA, ROBERT. Abstraction for AoS and SoA layout in C++[C]// GPU Computing Gems Jade Edition.Morgan Kaufmann, 2012:429-441.
[49]
HOMANN HOLGER, FRANCOIS LAENEN. SoAx:A generic C++ Structure of Arrays for handling particles in HPC codes[J]. Computer Physics Communications, 2018, 224:325-332.
[50]
PHARR MATT, WILLIAM R MARK. ispc:A SPMD compiler for high-performance CPU programming[C]// 2012 Innovative Parallel Computing (InPar).IEEE, 2012.
[51]
SPRINGER MATTHIAS, YAOZHU SUN, HIDEHIKO MASUHARA. Inner array inlining for structure of arrays layout[C]// Proceedings of the 5th ACM SIGPLAN International Workshop on Libraries,Languages,and Compilers for Array Programming, 2018.
[52]
RADTKE PAWELK, TOBIAS WEINZIERL. Annotation-Guided AoS-to-SoA Conversions and GPU Offloading With Data Views in C++[J]. Concurrency and Computation:Practice and Experience, 2025, 37(21-22):e70199.
[53]
WEBER NICOLAS, MICHAEL GOESELE. Auto-Tuning Complex Array Layouts for GPUs[C]// EGPGV@EuroVis, 2014.
[54]
KOFLER KLAUS, BIAGIO COSENZA, THOMAS FAHRINGER. Automatic data layout optimizations for GPUs[C]// European Conference on Parallel Processing.Berlin, Heidelberg: Springer Berlin Heidelberg, 2015.
[55]
LACHAIZE RENAUD, BAPTISTE LEPERS, VIVIEN QUÉMA. A Memory for Multicore Systems[C]// 2012 USENIX Annual Technical Conference (USENIX ATC 12),2012.
PDF(1708 KB)

Accesses

Citation

Detail

Sections
Recommended

/