Special Issue of the 9th China College IC Competition
HU Xianghong, LIANG Kelong, YIN Feiyue, FENG Zhaozhang, LIN Yuanmiao, CAI Shuting, XIONG Xiaoming
With the rapid development of artificial intelligence and deep learning applications, tensor computing urgently demands high-efficiency and multi-precision computing hardware accelerators. The traditional general-purpose processors face energy efficiency bottlenecks when processing large-scale matrix multiplication operations, while existing dedicated accelerators often lack flexibility in supporting diverse data precision and hybrid computing modes. This paper presents a multi-precision and mixed-precision tensor processing unit (TPU), designed based on a reconfigurable architecture, which supports five data formats (INT4, INT8, FP16, BF16, FP32) and two hybrid modes (FP16+FP32, BF16+FP32). It is capable of efficiently performing matrix multiplication and accumulation across three different dimensions (m16n16k16, m32n8k16, m8n32k16). By incorporating a reconfigurable computing array, dynamic data flow control, multi-mode buffer design, and a unified floating-point processing unit, the design achieves high hardware reuse and significantly improved computational efficiency. Synthesized on the VCU118 FPGA platform at 251.13 MHz, it delivers a peak theoretical performance of 257.16 GOPS/GFLOPS (INT4/INT8/FP16/BF16) and 64.29 GFLOPS (FP32). This design is well-suited for applications such as deep learning inference, autonomous driving, and medical imaging, where both computational efficiency and flexibility are critical.