使用多播自适应路由加速缓存一致性片上网络的监听及监听响应过程

胡东伟, 巴晓辉, 刘耿亭, 王力男, 雷岳俊

集成电路与嵌入式系统 ›› 2025, Vol. 25 ›› Issue (8) : 81-90.

PDF(1302 KB)
PDF(1302 KB)
集成电路与嵌入式系统 ›› 2025, Vol. 25 ›› Issue (8) : 81-90. DOI: 10.20193/j.ices2097-4191.2025.0044
新兴计算芯片设计研究专刊

使用多播自适应路由加速缓存一致性片上网络的监听及监听响应过程

作者信息 +

Accelerating the snooping & snooping response process of cache-coherent network-on-chip with multicast adaptive routing

Author information +
文章历史 +

摘要

针对众核CPU芯片中缓存一致性片上网络(Network-on-Chip, NoC)缓存一致性监听及监听响应过程耗时过长的问题,提出多播和自适应路由两种技术来加速该过程。根据这两种技术的需求,设计了片上网络监听请求、监听响应数据包格式,并进一步设计实现了监听请求通道和监听响应通道的NoC路由器和8×8网络。设计实践表明,按照文中所提的NoC路由器在22 nm工艺下大小为85 940.3 μm2或103 518.5 μm2,8×8的监听请求及监听响应网络大小为5.57 mm2,复杂度可接受。通过仿真实验比较了单播和多播、确定性路由和自适应路由4种配置下监听及监听响应过程的耗时。结果表明,在监听请求消息需要监听全部252个处理器核心时,所提技术可使1个监听请求消息的监听及监听响应过程耗时减少45%,且远小于DDR/HBM的访问延迟。若进一步在一致性节点(Point of Coherency, PoC)处采用Outstanding技术,所提技术可使32个监听请求消息的监听及监听响应过程耗时减少73%。仿真结果证实了所提多播和自适应路由技术的有效性。

Abstract

In Cache-Coherent Network-on-Chip (NoC) of many-core CPU, the snooping and snooping response Process (SNP Process) incurs long latency. To address this, two techniques: multicast routing and adaptive routing are proposed in this paper. According to the requirements of these two techniques, the NoC packet formats for Snooping Request Channel (SNP REQ Ch) and Snooping Response Channel (SNP RESP Ch) are proposed, and furthermore, the NoC routers of SNP REQ Ch and SNP RESP Ch are VLSI implemented. The implementation results show that the routers for both SNP REQ Ch and SNP RESP Ch are of 85 940.3 μm2 or 103 518.5 μm2, while an 8×8 network occupies 5.57 mm2, which is feasible for large-scale chips. Simulations are employed to compare the latencies of 4 configurations: unicast determined routing, unicast adaptive routing, multicast determined routing, and multicast adaptive routing. The simulation results show that the latency of SNP Process with multicast adaptive routing could be cut by 45% for a single snooping request comparing to that with unicast determined routing, resulting in a much shorter latency than DDR/HBM access, and by 73% for 32 consecutive snooping requests with outstanding technique employed at the Point of Coherency (PoC), which validate the effectiveness of the proposed techniques.

关键词

片上网络 / 缓存一致性 / 自适应路由 / 多播路由

Key words

network-on-chip / cache coherency / adaptive routing / multicast routing

引用本文

导出引用
胡东伟, 巴晓辉, 刘耿亭, . 使用多播自适应路由加速缓存一致性片上网络的监听及监听响应过程[J]. 集成电路与嵌入式系统. 2025, 25(8): 81-90 https://doi.org/10.20193/j.ices2097-4191.2025.0044
HU Dongwei, BA Xiaohui, LIU Gengting, et al. Accelerating the snooping & snooping response process of cache-coherent network-on-chip with multicast adaptive routing[J]. Integrated Circuits and Embedded Systems. 2025, 25(8): 81-90 https://doi.org/10.20193/j.ices2097-4191.2025.0044
中图分类号: TN915   

参考文献

[1]
Intel Delivers Cutting-Edge Process Technologies to the Data Center with Intel 18A and Advanced Chiplet Packaging[OL]. https://www.intel.cn/content/dam/www/central-libraries/us/en/documents/2024-02/intel-tech-clearwater-wp.pdf.
[2]
胡东伟, 尚德龙, 张勇, 等. 时钟及面积优化的可配置片上网络路由器[J]. 西安电子科技大学学报, 2022, 49(2):130-139.DOI: 10.19665/j.issn1001-2400.2022.02.015.
HU D, SHANG D, ZHANG Y, et al. Timing and area optimized re-configurable network-on-chip router[J]. Journal of XIDIAN University, 2022, 49(2):130-139.DOI: 10.19665/j.issn1001-2400.2022.02.015 (in Chinese).
[3]
HOPPNER S, YAN Y, DIXIUS A, et al. The SpiNNaker 2 Processing Element Architecture for Hybrid Digital Neuromorphic Computing[J/OL]. 2021.https://doi.org/10.48550/arXiv.2103.08392.
[4]
LYER R, DE V, ILLIKKAL R, et al. Advances in Microprocessor Cache Architectures Over the Last 25 Years[J]. IEEE Micro, 2021, 11/12, 78-88. DOI: 10.1109/MM.2021.3114903.
[5]
张阿敏. 基于片上网络的众核高速缓存一致性研究[D]. 合肥: 合肥工业大学, 2018.
ZHANG A M. Research on Consistency of Many-Core Cache Based on On-Chip Network[D]. Hefei: Hefei University of Technology, 2018 (in Chinese).
[6]
FURBER S, BOGDON P. SpiNNaker: A Spiking Neural Network Architecture[M]. Hanover: Now Publishers, 2020:1-350.
[7]
ROS A, Acacio M E, Garcia J M. A Direct Coherence Protocol for Many-Core Chip Multiprocessors[J]. IEEE Transactions on Parallel and Distributed Systems, 2010, 21(12):1779-1792.DOI: 10.1109/TPDS.2010.43.
[8]
DUTTA K K, TANKSALE P N, DAS S. A Fairness Conscious Cache Replacement Policy for Last Level Cache[C]// 2021 Design,Automation & Test in Europe Conference & Exhibition (DATE),Grenoble, France. 2021:695-700.DOI: 10.23919/DATE51398.2021.9474096.
[9]
SHUKLA S, CHAUDHURI M. Sharing-aware Efficient Private Caching in Many-core Server Processors[C]// 2017 IEEE 35th International Conference on Computer Design, Boston,MA,USA, 2017: 485-492. DOI: 10.1109/ICCD.2017.85.
[10]
ZHAO X, LIU Y, ADILEH A, et al. LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs[J]. IEEE Computer Architecture Letters, 2017, 16(1):42-45.DOI: 10.1109/LCA.2016.2611663.
[11]
HAN L, AN J, GAO D, et al. A Survey on Cache Coherence for tiled many-core processor[C]// IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC), 2012:114-118. DOI: 10.1109/ICSPCC.2012.6335721.
[12]
AMBA CHI Architecture Specification[S/OL]. https://developer.arm.com/documentation/ihi0050/latest/.
[13]
ZHANG S, VIJAYARAGHAVAN M, ARVIND. Weak Memory Moedls: Balancing Definitional Simplicity and Implementation Flexibility[C]// 26th International Conference on Parallel Architecture and Compilation Techniques,Portland,OR,USA, 2017:288-302. DOI: 10.1109/PACT.2017.29.
[14]
The RISC-V Instruction Set Manual: Volume I[S/OL]. https://lists.riscv.org/g/tech-unprivileged/attachment/535/0/unpriv-isa-asciidoc.pdf.
[15]
HU D, LEI Y, WANG Y, et al. Review on the Usage of Synchronous and Asynchronous FIFOs in Digital Systems Design[J]. Engineering, 2024(3):61-82.DOI: 10.4236/eng.2024.163007.
[16]
AMBA AXI Protocol Specification[S/OL]. https://developer.arm.com/documentation/ihi0022/latest/.
[17]
胡东伟, 巴晓辉, 雷岳俊, 等. 空间耦合最短路径自适应路由的路径条数及可缓冲的数据包数目[J]. 中央民族大学学报(自然科学版), 2023, 32(4):69-77.
HU D, BA X, LEI Y, et al. Study on the Number of Paths and Number of Buffered Packets with Adaptive Shortest-path Routing[J]. Journal of MUC (Natural Sciences Edition), 2023, 32(4):69-77 (in Chinese).
[18]
LI M, ZENG Q, JONE W. DyXY-A Proximity Congestion-Aware Deadlock-Free Dynamic Routing Method for Network on Chip[C]. 2006.
[19]
HUANG H, WANG Z, ZHANG J, et al. Shuhai:A Tool for Benchmarking High Bandwidth Memory on FPGAs[J]. IEEE Trans. On Computers, 2022, 72(5):1133-1144.DOI: 10.1109/TC.2021.3075765.
[20]
ASIFUZZAMAN K, ABUELALA M, HASSAN M, et al. Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems[C]// 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD). Munich,Germany. 2021:1-9. DOI: 10.1109/ICCAD51958.2021.9643473.
[21]
KIM H Y, KIM Y H, YU H, et al. Performance Evaluation of many-core Systems: case study with TILEPro64[J]. IET Computers & Digital Techniques, 2013, 7(4):143-154. DOI: 10.1049/iet-cdt.2012.0101.
[22]
KODAMA Y, KONDO M, SATO M. Evaluation of SPEC CPU and SPEC OMP on the A64FX[C]// 2021 IEEE International Conference on Cluster Computing (CLUSTER). Portland,OR,USA. 2021:553-561.DOI: 10.1109/Cluster48925.2021.00088.

PDF(1302 KB)

Accesses

Citation

Detail

段落导航
相关文章

/