PDF(4051 KB)
基于高效异构并行策略加速的FPGA静态时序分析算法
田春生, 赵翔宇, 王硕, 王卓立, 曹永铮, 周婧, 张瑶伟, 陈雷
集成电路与嵌入式系统 ›› 2026, Vol. 26 ›› Issue (4) : 14-25.
PDF(4051 KB)
PDF(4051 KB)
基于高效异构并行策略加速的FPGA静态时序分析算法
FPGA static timing analysis algorithm accelerated by high-efficiency heterogeneous parallelization strategy
随着现场可编程门阵列(Field Programmable Gate Array, FPGA)在高性能计算、人工智能推理以及5G通信等领域的广泛运用,其电路设计规模与时序约束复杂度持续攀升,对静态时序分析(Static Timing Analysis, STA)的运行效率提出了更高的要求。现有FPGA STA工具多依赖于单核或多核中央处理器(Central Processing Unit, CPU)架构,虽在算法层面不断优化,但在处理大规模FPGA设计时仍面临计算瓶颈与内存访问效率不足等问题。近年来,图形处理器(Graphics Processing Unit, GPU)凭借大规模并行计算能力,为提升FPGA STA性能提供了新的机遇。然而,现有GPU架构下的内存访问模式、时序图环路检测优化与异构并行加速计算策略等问题,制约了GPU加速方法在FPGA STA场景中的应用效果。针对上述问题,文中提出一种基于高效异构并行策略加速的FPGA STA算法。首先,针对传统面向对象数据结构在CPU-GPU异构架构下存在的内存访问不连续及字段交错导致带宽利用率低等问题,提出了基于数组结构体的数据结构布局策略,并结合数据重排等优化操作,有效降低了访存延迟并提升了带宽利用率,为高性能FPGA STA计算引擎提供数据基座。其次,针对时序图环路检测效率不足及鲁棒性欠佳的现状,设计了一种基于颜色传播的并行环路检测优化算法,实现了FPGA STA前处理阶段的高效加速;进一步地,提出了面向CPU-GPU异构并行架构的任务分解与时序图遍历过程的设计方法,实现了对延迟计算、层次化处理及图传播等STA 核心操作的高效加速。最后,在OpenCores与工业级FPGA测试集上的实验结果表明,相比传统CPU实现,文中方法可实现3.125~33.333倍的运行时间加速比,且整体性能优于OpenTimer工具,上述研究为大规模FPGA设计中的高效时序验证提供了可行路径与实践参考。
The widespread integration of Field Programmable Gate Arrays (FPGAs) in high-performance computing, AI inference, and 5G communications has led to an unprecedented escalation in design scale and timing constraint complexity. These trends impose stringent demands on the runtime efficiency of Static Timing Analysis (STA). Current FPGA STA tools, primarily anchored in single-core or multi-core CPU architectures, are increasingly hitting a performance wall, despite persistent algorithmic refinements, they struggle with computational bottlenecks and suboptimal memory throughput when confronted with large-scale designs. In recent years, Graphics Processing Units (GPUs) with their massive parallel computing capabilities have provided new opportunities for improving FPGA STA performance. However, challenges in memory access patterns under heterogeneous GPU architectures, the optimization for timing graph loop detection, and heterogeneous parallel acceleration strategies continue to hinder the effectiveness of current GPU-accelerated methods in FPGA STA scenarios. To address these issues, we propose an FPGA STA algorithm accelerated by an efficient heterogeneous parallel strategy. First, targeting the problem of discontinuous memory access and field interleaving in traditional object-oriented data structures under CPU-GPU heterogeneous architectures, a structure-of-arrays (SoA)-based data layout strategy is presented. Combined with data reordering operations, this approach effectively reduces memory access latency and improves bandwidth utilization, providing a data foundation for high-performance FPGA STA computational engines. Second, to overcome the limitations of low efficiency and poor robustness in timing graph loop detection, a parallel loop detection optimization algorithm based on color propagation is designed, enabling efficient acceleration in the preprocessing stage of FPGA STA. Furthermore, a task decomposition and timing graph traversal method tailored for CPU-GPU heterogeneous architectures is proposed, achieving efficient acceleration of core STA operations such as delay calculation, levelization, and graph propagation. Finally, experimental results on both the OpenCores and industrial-grade FPGA benchmarks demonstrate that, compared with traditional CPU implementations, the proposed method achieves a runtime speedup of 3.125× to 33.333×, with overall performance surpassing that of the OpenTimer tool. This research provides a practical and feasible approach for efficient timing verification in large-scale FPGA designs.
现场可编程门阵列 / 静态时序分析 / 异构计算 / 并行加速 / 电子设计自动化
FPGA / static timing analysis / heterogeneous computing / parallel acceleration / electronic design automation
| [1] |
田春生, 陈雷, 王源, 等. 面向FPGA的布局与布线技术研究综述[J]. 电子学报, 2022, 50(5):1243-1254.
随着大规模集成电路器件复杂度与容量的不断提升,现场可编程门阵列(Field Programmable Gate Array, FPGA)以高度的并行、可定制和可重构的特性得到了广泛的关注与应用. 在制约FPGA发展的众多因素中,最为关键的便是电子设计自动化(Electronic Design Automation, EDA)技术,作为FPGA EDA流程中的关键环节,布局和布线技术的研究对于FPGA的重要性不言而喻. 本文综述了面向FPGA的布局和布线技术,包括基于划分的布局、基于启发式的布局、基于解析式的布局、FPGA串行布线和FPGA并行布线等技术,分析对比了不同技术方法的优缺点,在此基础上,本文还展望了未来FPGA布局和布线技术的发展趋势,将为FPGA未来健康可持续的发展提供有力支撑.
|
| [2] |
田春生, 陈雷, 王源, 等. 基于机器学习的FPGA电子设计自动化技术研究综述[J]. 电子与信息学报, 2023, 45(1):1-13.
|
| [3] |
麦景, 王嘉睿, 邸志雄, 等. OpenPARF:基于深度学习工具包的大规模异构FPGA开源布局布线框架[J]. 电子与信息学报, 2023, 45(9):3118-3131.
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
徐宇. FPGA EDA 工具中的软件模型与优化应用研究[D]. 北京: 中国科学院大学, 2020.
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
贺旭, 王耀, 傅智勇, 等. 敏捷设计中基于机器学习的静态时序分析方法综述[J]. 计算机辅助设计与图形学学报, 2023, 35(4):640-652.
|
| [14] |
|
| [15] |
周珊, 王金波, 王晓丹. 基于时序路径的FPGA时序分析技术研究[J]. 微电子学与计算机, 2016, 33(1):76-80.
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
田春生, 陈雷, 王硕, 等. 面向大规模FPGA的粗粒度并行布线方法研究[J]. 集成电路与嵌入式系统, 2025, 25(6):68-77.
针对大规模FPGA布线过程中存在的资源开销与内存占用过大、布线算法求解效率低等问题,提出了一种资源友好型的面向大规模FPGA的粗粒度并行布线方法。首先,提出了非侵入式的数据优化技术,以减少因布线资源图而导致的资源开销与内存占用,解决因FPGA规模增大而导致的内存空间爆炸问题,为布线方法提供数据基座。其次,提出了自适应负载均衡以及高扇出线网划分技术,以解决粗粒度并行布线方法并行度低的问题,提升布线方法求解效率。实验结果表明,所提出的面向大规模FPGA的粗粒度并行布线方法可以在降低资源消耗与内存占用90%的情况下,获得3.18倍的运行时间加速比,而不会对线长与关键路径实验等性能指标造成影响。
Aiming at the problems such as excessive resource overhead, high memory consumption, and low routing efficiency in the routing process of large-scale FPGAs, a resource-friendly coarse-grained parallel routing method tailored for large-scale FPGAs is proposed. First, a non-intrusive data optimization technique is proposed to reduce the resource overhead and memory consumption caused by the routing resource graph, addressing the memory explosion problem resulting from the increasing scale of FPGAs and providing a data foundation for the routing method. Second, adaptive load balancing and high-fanout net partitioning techniques are introduced to tackle the low parallelism in coarse-grained parallel routing, thereby improving the overall routing efficiency. The experimental results show that the proposed coarse-grained parallel routing method for large-scale FPGAs can achieve a 3.18× speedup in runtime while reducing resource and memory consumption by 90%, without compromising performance metrics such as wirelength and critical path delay. |
/
| 〈 |
|
〉 |