一种多精度可重构张量计算单元的设计

胡湘宏; 梁克龙; 尹飞跃; 冯兆樟; 林元妙; 蔡述庭; 熊晓明

doi:10.20193/j.ices2097-4191.2025.0109

PDF(6050 KB)

集成电路与嵌入式系统 ›› 2026, Vol. 26 ›› Issue (3) : 81-89. DOI: 10.20193/j.ices2097-4191.2025.0109

第九届全国大学生集成电路创新创业大赛优秀作品专刊

一种多精度可重构张量计算单元的设计

作者信息 +

Design of multi-precision reconfigurable tensor computing unit

Author information +

文章历史 +

摘要

随着人工智能与深度学习应用的快速发展,张量计算对高能效、多精度计算硬件加速器提出了迫切需求。传统通用处理器在处理大规模矩阵乘法运算时存在能效瓶颈,而现有专用加速器往往难以灵活支持多种数据精度与混合计算模式。文中基于可重构架构设计了一款支持多精度与混合精度的张量处理单元,支持INT4、INT8、FP16、BF16、FP32五种数据精度及FP16+FP32、BF16+FP32两种混合精度模式,可高效完成3种不同维度(m16n16k16、m32n8k16、m8n32k16)的矩阵乘加运算。通过可重构计算阵列、动态数据流控制、多模式缓存设计及统一的浮点处理单元,实现了硬件复用率与计算效率的显著提升。在VCU118 FPGA平台上综合频率达251.13 MHz,算力最高达257.16 GOPS/GFLOPS(INT4/INT8/FP16/BF16)和64.29 GFLOPS(FP32)。该设计可广泛应用于深度学习推理、自动驾驶、医疗影像等对计算能效和灵活性要求较高的场景。

Abstract

With the rapid development of artificial intelligence and deep learning applications, tensor computing urgently demands high-efficiency and multi-precision computing hardware accelerators. The traditional general-purpose processors face energy efficiency bottlenecks when processing large-scale matrix multiplication operations, while existing dedicated accelerators often lack flexibility in supporting diverse data precision and hybrid computing modes. This paper presents a multi-precision and mixed-precision tensor processing unit (TPU), designed based on a reconfigurable architecture, which supports five data formats (INT4, INT8, FP16, BF16, FP32) and two hybrid modes (FP16+FP32, BF16+FP32). It is capable of efficiently performing matrix multiplication and accumulation across three different dimensions (m16n16k16, m32n8k16, m8n32k16). By incorporating a reconfigurable computing array, dynamic data flow control, multi-mode buffer design, and a unified floating-point processing unit, the design achieves high hardware reuse and significantly improved computational efficiency. Synthesized on the VCU118 FPGA platform at 251.13 MHz, it delivers a peak theoretical performance of 257.16 GOPS/GFLOPS (INT4/INT8/FP16/BF16) and 64.29 GFLOPS (FP32). This design is well-suited for applications such as deep learning inference, autonomous driving, and medical imaging, where both computational efficiency and flexibility are critical.

导出引用

胡湘宏, 梁克龙, 尹飞跃, 等. 一种多精度可重构张量计算单元的设计[J]. 集成电路与嵌入式系统. 2026, 26(3): 81-89 https://doi.org/10.20193/j.ices2097-4191.2025.0109

HU Xianghong, LIANG Kelong, YIN Feiyue, et al. Design of multi-precision reconfigurable tensor computing unit[J]. Integrated Circuits and Embedded Systems. 2026, 26(3): 81-89 https://doi.org/10.20193/j.ices2097-4191.2025.0109

中图分类号： TP872 (远距离控制和信号、远距离控制和信号系统)

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

[1]	张蓓蕾, 王国亮, 谢奋龙, 等. 深度神经网络的电力数据异常检测研究[J]. 单片机与嵌入式系统应用, 2023, 23(10): 36-39. ZHANG B L, WANG G L, XIE F L, et al. Research on Power Data Anomaly Detection Based on Deep Neural Network[J]. Microcontrollers & Embedded Systems, 2023, 23(10):36-39(in Chinese). 本文引用 [1]

[2]	陈欣, 王凌, 朱佳佳, 等. 深度卷积神经网络图像超分辨率重建方法研究[J]. 单片机与嵌入式系统应用, 2023, 23(1): 7-10. CHEN X, WANG L, ZHU J J, et al. Research on Image Super-Resolution Reconstruction Method Based on Deep Convolutional Neural Network[J]. Microcontrollers & Embedded Systems, 2023, 23(1):7-10(in Chinese). 本文引用 [1]

[3]	潘于, 田映辉, 张伟, 等. 一种节省资源的矩阵运算单元硬件微架构设计[J]. 现代电子技术, 2024, 47(5):160-166. PAN Y, TIAN Y H, ZHANG W, et al. Design of hardware microarchitecture of resource-efficient matrix operation unit[J]. Modern Electronics Technique, 2024, 47(5):160-166(in Chinese). 本文引用 [1]

[4]	周科宇, 李军. 基于深度学习的目标检测研究进展[J]. 单片机与嵌入式系统应用, 2023, 23(7):38-40,52. ZHOU K Y, LI J. Research Progress in Object Detection Based on Deep Learning[J]. Microcontrollers & Embedded Systems, 2023, 23(7):38-40,52 (in Chinese). 本文引用 [2]

[5]

陆璐, 赵容, 梁志宏, 等. 基于Matrix Core的小尺寸批量矩阵乘法设计与优化[J]. 华南理工大学学报(自然科学版), 2025, 53(9):1-11.

, ZHAO

, LIANG

Z H

, et al. Design and Optimization of Small-Batch Matrix Multiplication Based on Matrix Core[J]. Journal of South China University of Technology (Natural Science Edition), 2025, 53(9):1-11(in Chinese).

本文引用 [2]

[6]	JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]// Proceedings of the 44th annual international symposium on computer architecture, 2017:1-12. 本文引用 [1]

[7]	李明莉. 基于混合重构的高效能多精度卷积神经网络加速器设计[D]. 南京: 南京航空航天大学, 2020. LI M L. Design of CNN accelerator with high efficiency and multiple precision based on hybrid reconfiguration[D]. Nanjing: Nanjing University of Aeronautics and Astronautics and Astronautics, 2020(in Chinese). 本文引用 [1]

[8]

MAO

, LI

, CHENG

, et al. A configurable floating-point multiple-precision processing element for HPC and AI converged computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 30(2):213-226.

https://doi.org/10.1109/TVLSI.2021.3128435

https://ieeexplore.ieee.org/document/9631952/

本文引用 [1]

[9]	LI B, LI K, ZHOU J, et al. A reconfigurable processing element for multiple-precision floating/fixed-point HPC[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 71(3):1401-1405. 本文引用 [1]

[10]	LIU H, LU X, YU X, et al. A 3-D Multi-Precision Scalable Systolic FMA Architecture[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2024. 本文引用 [1]

[11]	ZHANG S, CAO J, ZHANG Q, et al. An FPGA-based reconfigurable CNN accelerator for YOLO[C]// Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET). 2020:74-78. 本文引用 [1]

[12]	CHEN X, LI J, ZHAO Y. Hardware resource and computational density efficient cnn accelerator design based on fpga[C]// 2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), 2021:204-205. 本文引用 [1]

[13]

CHEN

P C

, LIU

Y T

, ZENG

G Y

, et al. Design and implementation of an easy-to-deploy energy-efficient inference acceleration system for multi-precision neural networks[C]// Proceedings of the 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS). Abu Dhabi, United Arab Emirates, 2024:587-591.

本文引用 [1]