Special Topic on IC Design Automation and High-reliability Design
TIAN Chunsheng, ZHAO Xiangyu, WANG Shuo, WANG Zhuoli, CAO Yongzheng, ZHOU Jing, ZHANG Yaowei, CHEN Lei
The widespread integration of Field Programmable Gate Arrays (FPGAs) in high-performance computing, AI inference, and 5G communications has led to an unprecedented escalation in design scale and timing constraint complexity. These trends impose stringent demands on the runtime efficiency of Static Timing Analysis (STA). Current FPGA STA tools, primarily anchored in single-core or multi-core CPU architectures, are increasingly hitting a performance wall, despite persistent algorithmic refinements, they struggle with computational bottlenecks and suboptimal memory throughput when confronted with large-scale designs. In recent years, Graphics Processing Units (GPUs) with their massive parallel computing capabilities have provided new opportunities for improving FPGA STA performance. However, challenges in memory access patterns under heterogeneous GPU architectures, the optimization for timing graph loop detection, and heterogeneous parallel acceleration strategies continue to hinder the effectiveness of current GPU-accelerated methods in FPGA STA scenarios. To address these issues, we propose an FPGA STA algorithm accelerated by an efficient heterogeneous parallel strategy. First, targeting the problem of discontinuous memory access and field interleaving in traditional object-oriented data structures under CPU-GPU heterogeneous architectures, a structure-of-arrays (SoA)-based data layout strategy is presented. Combined with data reordering operations, this approach effectively reduces memory access latency and improves bandwidth utilization, providing a data foundation for high-performance FPGA STA computational engines. Second, to overcome the limitations of low efficiency and poor robustness in timing graph loop detection, a parallel loop detection optimization algorithm based on color propagation is designed, enabling efficient acceleration in the preprocessing stage of FPGA STA. Furthermore, a task decomposition and timing graph traversal method tailored for CPU-GPU heterogeneous architectures is proposed, achieving efficient acceleration of core STA operations such as delay calculation, levelization, and graph propagation. Finally, experimental results on both the OpenCores and industrial-grade FPGA benchmarks demonstrate that, compared with traditional CPU implementations, the proposed method achieves a runtime speedup of 3.125× to 33.333×, with overall performance surpassing that of the OpenTimer tool. This research provides a practical and feasible approach for efficient timing verification in large-scale FPGA designs.