In recent years, with the widespread application of large models (such as GPT, LLaMA, DeepSeek, etc.), the computing power requirements and energy efficiency issues in the reasoning stage have become increasingly prominent. Although traditional GPU solutions can provide high throughput, they face challenges in power consumption, real-time performance and cost. FPGAs have become an important alternative for large model reasoning deployment with their customizable architecture, low latency determinism and high energy efficiency. This paper systematically reviews the network structure of large models and the reasoning implementation technology of large models on FPGAs, covering three major directions: hardware architecture adaptation, algorithm-hardware co-optimization and system-level challenges. At the hardware level, the focus is on the design of computing units and storage level optimization strategies; at the algorithm level, key technologies such as model compression, dynamic quantization and compiler optimization are analyzed. At the system level, challenges such as multi-FPGA expansion, thermal management and emerging storage-computing integrated architectures are discussed. In addition, this paper summarizes the limitations of the current FPGA reasoning ecosystem (such as insufficient tool chain maturity) and looks forward to future trends, including chiplet heterogeneous integration, photonic computing fusion and the establishment of a standardized evaluation system. The research results show that the architectural flexibility of FPGA gives it a unique advantage in the field of efficient reasoning of large models, but interdisciplinary collaboration is still needed to promote the implementation of the technology.
Compared to custom-designed chips, Field Programmable Gate Arrays (FPGAs) support flexible hardware reconfiguration and offer advantages such as shorter design cycles and lower development costs. They are widely used in fields such as communications, data centers, radar, and aerospace. The design of FPGA architectures aims to create highly programmable FPGA chips while minimizing the area and performance costs associated with reconfigurability. With the continuous evolution of application demands and process technology capabilities, FPGA architecture design is entering a new phase. This article briefly describes the basic architecture of FPGA with its evaluation, summarizes the latest developments in novel FPGA architectures and circuit design technologies, and discusses the technical challenges and development trends of novel FPGA architectures and circuit design.
With the rapid advancement of information technology and artificial intelligence, the increasingly complex functions of IoT terminal devices have resulted in significant security threats due to their limited hardware resources. To address this, this paper proposes a dual-mode configurable software PUF (Physical Unclonable Function) design based on the DSP IP core. This approach leverages the timing violation behavior characteristics of sampling registers and the combinational logic delay features within the DSP IP core of FPGA. First, the internal structure of DSP IP cores in Xilinx Artix-7 FPGA is analyzed, determining the clock cycle range for normal data transmission based on their combinational logic delay information and timing constraints. Next, two distinct operational modes are configured based on the required challenge bit length, with overclocked clocks applied to induce abnormal computational results through timing violation in the sampling registers. Finally, a hash algorithm and parity check are used to compress the abnormal data of varying bit lengths into a 1-bit PUF response. This design eliminates the need for additional bias extraction circuits and allows for flexible configuration of two different challenge bit lengths for the software PUF implementation without modifying the hardware structure. The experimental results demonstrate that both operational modes achieve a reliability of over 98%, with excellent uniqueness and resistance to machine learning attacks, thereby validating the proposed scheme's feasibility and advantages in terms of both security and practicality.
The real-time simulation of new power system puts forward higher requirements for CPU-FPGA heterogeneous computing and multi-FPGA distributed computing, in which communication efficiency may become one of the bottlenecks. Given the current limitations of Gigabit Ethernet in bandwidth and real-time performance, this paper proposes an FPGA-based lightweight design of 10 GbE high-bandwidth low-latency interface. PHY is built based on the GT to achieve low latency and high reliability. In the UDP stack, alternating caching and queuing with priority are adopted to improve data throughput and balance instantaneous load. The on-board test results show that the design achieves low hardware resource consumption, a maximum transmission bandwidth of 9.70 Gb/s, an average transmission delay of 0.45 μs and stable interactions between protocol layers without interference, which provides efficient communication support for the simulation of power system and other applications.
To address the current issues of high resource overhead and low efficiency in FPGA configuration bitstream decryption and authentication, this paper proposes the GMAC_GF32 authentication algorithm based on finite field GF(232) multiplication operations. Combined with AES encryption in CTR mode, we design and implement an efficient and highly secure FPGA configuration bitstream decryption and authentication method. The method employs a four-stage pipeline design for the AES256_CTR decryption module, ensuring that each decryption cycle aligns with the time required to transmit 128 bits of data, thereby maximizing the decryption throughput of the FPGA. Additionally, each pipeline stage enhances power side-channel security by utilizingsixteen S-Boxes operating in parallel. The authentication module improves existing verification codes to 32 bits through GF(232) operations, effectively mitigating the inefficiency of serial verification code computation, improving clock utilization. The authentication module enhances security by incorporating built-in polynomial functions to prevent the loading of malicious code streams. Experimental validation on an FPGA prototype board demonstrates that the proposed pipeline decryption approach optimizes the AES256_CTR algorithm, compressing the decryption process to four clock cycles. The authentication method significantly reduces additional authentication data volume and hidden time costs while maintaining security strength, achieving a 96.5% reduction in area resource consumption for the authentication algorithm;thereby achieving no noticeable increase in the overall decryption-authentication circuit area. The proposed method is well-suited for FPGA chip design scenarios requiring high performance and robust security.
In order to solve the problems of MCU and DSP serial calculation widely used in permanent magnet synchronous motor control, insufficient dynamic accuracy and long development cycle of complex vector control algorithms, a vector control technology of Permanent Magnet Synchronous Motors (PMSM) based on domestic FPGA is proposed. By adopting the modular design methods of Hardware Description Language (HDL) and Electronic Design Automation (EDA), the dual closed-loop feedforward PI control strategy, Space Vector Pulse Width Modulation (SVPWM) algorithm, and the underlying key modules such as coordinate transformation and encoder feedback are independently designed, and the logic and timing simulation tests of key functional modules are carried out by Modelsim respectively. Finally, a PMSM vector control hardware system based on domestic FPGA is constructed. The SVPWM waveform test, the dynamic accuracy under multiple signal inputs such as step signal and square wave signal, and the chip performance test analysis are conducted, which verified the effectiveness of the design and implementation results of the proposed vector control technology for permanent magnet synchronous motor based on domestic FPGA.
Aiming at the problems such as excessive resource overhead, high memory consumption, and low routing efficiency in the routing process of large-scale FPGAs, a resource-friendly coarse-grained parallel routing method tailored for large-scale FPGAs is proposed. First, a non-intrusive data optimization technique is proposed to reduce the resource overhead and memory consumption caused by the routing resource graph, addressing the memory explosion problem resulting from the increasing scale of FPGAs and providing a data foundation for the routing method. Second, adaptive load balancing and high-fanout net partitioning techniques are introduced to tackle the low parallelism in coarse-grained parallel routing, thereby improving the overall routing efficiency. The experimental results show that the proposed coarse-grained parallel routing method for large-scale FPGAs can achieve a 3.18× speedup in runtime while reducing resource and memory consumption by 90%, without compromising performance metrics such as wirelength and critical path delay.
The optical flow method constructs a dense motion field representation by analyzing the pixel displacements between frames, which can quantify the motion direction and velocity of objects in the scene with sub-pixel accuracy, and is a core technology for applications such as body-awareness, intelligent sensing in low-altitude economy, and localization and navigation. However, the dense optical flow algorithm faces high computational complexity, and its multi-layer pyramid structure and inter-layer data dependencies lead to problems such as inefficient access and idle computational resources, which together limit the real-time and efficient deployment of the algorithm on the edge side. In order to solve this problem, this paper proposes a real-time and efficient FPGA hardware acceleration scheme for the dense LK pyramid optical flow algorithm based on the optimization strategy of co-designing algorithms, architectures and circuits. The scheme optimizes the algorithm accuracy and hardware friendliness through batch bilinear interpolation and temporal gradient generation, optimizes the parallelism of hardware architecture through pyramid multilayer folding design, and optimizes the access efficiency of pyramid downsampling process through three-stage segmentation architecture, which significantly improves the energy efficiency and real-time performance of the dense LK optic flow computation. Measurements on the AMD KV260 platform show that the hardware accelerator achieves 102 times faster processing speed compared to high-performance CPUs, realizes 62 f/s real-time processing capability at 752×480 resolution, with an average endpoint error (AEE) of 0.522 pixel, and an average angular error (AAE) of 0.325°, providing both highly dynamic visual perception scenarios. This provides a hardware-accelerated solution with high accuracy and low latency for highly dynamic visual perception scenes.