HU Xianghong, YIN Feiyue, LIANG Kelong, FENG Zhaozhang, LIN Yuanmiao, CAI Shuting, Xiong Xiaoming
Accepted: 2026-01-09
With the rapid development of artificial intelligence and deep learning applications, tensor computing urgently demands high-efficiency and multi-precision computing hardware accelerators. Traditional general-purpose processors face energy efficiency bottlenecks when processing large-scale matrix multiplication operations, while existing dedicated accelerators often lack flexibility in supporting diverse data precision and hybrid computing modes. This paper presents a multi-precision and mixed-precision tensor processing unit (TPU), designed based on a reconfigurable architecture, which supports five data formats (INT4, INT8, FP16, BF16, FP32) and two hybrid modes (FP16+FP32, BF16+FP32).It is capable of efficiently performing matrix multiplication and accumulation across three different dimensions (m16n16k16, m32n8k16, m8n32k16). By incorporating a reconfigurable computing array, dynamic data flow control, multi-mode buffer design, and a unified floating-point processing unit, the design achieves high hardware reuse and significantly improved computational efficiency. Synthesized on the VCU118 FPGA platform at 251.13MHz, it delivers a peak theoretical performance of 257.16 GOPS/GFLOPS (INT4/INT8/FP16/BF16) and 64.29 GFLOPS (FP32). This design is well-suited for applications such as deep learning inference, autonomous driving, and medical imaging, where both computational efficiency and flexibility are critical.