一种多精度可重构张量计算单元的设计

胡湘宏, 梁克龙, 尹飞跃, 冯兆樟, 林元妙, 蔡述庭, 熊晓明

集成电路与嵌入式系统 ›› 2026, Vol. 26 ›› Issue (3) : 81-89.

PDF(6050 KB)
PDF(6050 KB)
集成电路与嵌入式系统 ›› 2026, Vol. 26 ›› Issue (3) : 81-89. DOI: 10.20193/j.ices2097-4191.2025.0109
第九届全国大学生集成电路创新创业大赛优秀作品专刊

一种多精度可重构张量计算单元的设计

作者信息 +

Design of multi-precision reconfigurable tensor computing unit

Author information +
文章历史 +

摘要

随着人工智能与深度学习应用的快速发展,张量计算对高能效、多精度计算硬件加速器提出了迫切需求。传统通用处理器在处理大规模矩阵乘法运算时存在能效瓶颈,而现有专用加速器往往难以灵活支持多种数据精度与混合计算模式。文中基于可重构架构设计了一款支持多精度与混合精度的张量处理单元,支持INT4、INT8、FP16、BF16、FP32五种数据精度及FP16+FP32、BF16+FP32两种混合精度模式,可高效完成3种不同维度(m16n16k16、m32n8k16、m8n32k16)的矩阵乘加运算。通过可重构计算阵列、动态数据流控制、多模式缓存设计及统一的浮点处理单元,实现了硬件复用率与计算效率的显著提升。在VCU118 FPGA平台上综合频率达251.13 MHz,算力最高达257.16 GOPS/GFLOPS(INT4/INT8/FP16/BF16)和64.29 GFLOPS(FP32)。该设计可广泛应用于深度学习推理、自动驾驶、医疗影像等对计算能效和灵活性要求较高的场景。

Abstract

With the rapid development of artificial intelligence and deep learning applications, tensor computing urgently demands high-efficiency and multi-precision computing hardware accelerators. The traditional general-purpose processors face energy efficiency bottlenecks when processing large-scale matrix multiplication operations, while existing dedicated accelerators often lack flexibility in supporting diverse data precision and hybrid computing modes. This paper presents a multi-precision and mixed-precision tensor processing unit (TPU), designed based on a reconfigurable architecture, which supports five data formats (INT4, INT8, FP16, BF16, FP32) and two hybrid modes (FP16+FP32, BF16+FP32). It is capable of efficiently performing matrix multiplication and accumulation across three different dimensions (m16n16k16, m32n8k16, m8n32k16). By incorporating a reconfigurable computing array, dynamic data flow control, multi-mode buffer design, and a unified floating-point processing unit, the design achieves high hardware reuse and significantly improved computational efficiency. Synthesized on the VCU118 FPGA platform at 251.13 MHz, it delivers a peak theoretical performance of 257.16 GOPS/GFLOPS (INT4/INT8/FP16/BF16) and 64.29 GFLOPS (FP32). This design is well-suited for applications such as deep learning inference, autonomous driving, and medical imaging, where both computational efficiency and flexibility are critical.

关键词

张量处理单元 / 多精度计算 / 可重构架构 / 矩阵乘法 / 硬件复用

Key words

tensor processing unit / multi-precision computation / reconfigurable architecture / matrix multiplication / hardware reutilization

引用本文

导出引用
胡湘宏, 梁克龙, 尹飞跃, . 一种多精度可重构张量计算单元的设计[J]. 集成电路与嵌入式系统. 2026, 26(3): 81-89 https://doi.org/10.20193/j.ices2097-4191.2025.0109
HU Xianghong, LIANG Kelong, YIN Feiyue, et al. Design of multi-precision reconfigurable tensor computing unit[J]. Integrated Circuits and Embedded Systems. 2026, 26(3): 81-89 https://doi.org/10.20193/j.ices2097-4191.2025.0109
中图分类号: TP872 (远距离控制和信号、远距离控制和信号系统)   

参考文献

[1]
张蓓蕾, 王国亮, 谢奋龙, 等. 深度神经网络的电力数据异常检测研究[J]. 单片机与嵌入式系统应用, 2023, 23(10): 36-39.
ZHANG B L, WANG G L, XIE F L, et al. Research on Power Data Anomaly Detection Based on Deep Neural Network[J]. Microcontrollers & Embedded Systems, 2023, 23(10):36-39(in Chinese).
[2]
陈欣, 王凌, 朱佳佳, 等. 深度卷积神经网络图像超分辨率重建方法研究[J]. 单片机与嵌入式系统应用, 2023, 23(1): 7-10.
CHEN X, WANG L, ZHU J J, et al. Research on Image Super-Resolution Reconstruction Method Based on Deep Convolutional Neural Network[J]. Microcontrollers & Embedded Systems, 2023, 23(1):7-10(in Chinese).
[3]
潘于, 田映辉, 张伟, 等. 一种节省资源的矩阵运算单元硬件微架构设计[J]. 现代电子技术, 2024, 47(5):160-166.
PAN Y, TIAN Y H, ZHANG W, et al. Design of hardware microarchitecture of resource-efficient matrix operation unit[J]. Modern Electronics Technique, 2024, 47(5):160-166(in Chinese).
[4]
周科宇, 李军. 基于深度学习的目标检测研究进展[J]. 单片机与嵌入式系统应用, 2023, 23(7):38-40,52.
ZHOU K Y, LI J. Research Progress in Object Detection Based on Deep Learning[J]. Microcontrollers & Embedded Systems, 2023, 23(7):38-40,52 (in Chinese).
[5]
陆璐, 赵容, 梁志宏, 等. 基于Matrix Core的小尺寸批量矩阵乘法设计与优化[J]. 华南理工大学学报(自然科学版), 2025, 53(9):1-11.
LU L, ZHAO R, LIANG Z H, et al. Design and Optimization of Small-Batch Matrix Multiplication Based on Matrix Core[J]. Journal of South China University of Technology (Natural Science Edition), 2025, 53(9):1-11(in Chinese).
[6]
JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]// Proceedings of the 44th annual international symposium on computer architecture, 2017:1-12.
[7]
李明莉. 基于混合重构的高效能多精度卷积神经网络加速器设计[D]. 南京: 南京航空航天大学, 2020.
LI M L. Design of CNN accelerator with high efficiency and multiple precision based on hybrid reconfiguration[D]. Nanjing: Nanjing University of Aeronautics and Astronautics and Astronautics, 2020(in Chinese).
[8]
MAO W, LI K, CHENG Q, et al. A configurable floating-point multiple-precision processing element for HPC and AI converged computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 30(2):213-226.
[9]
LI B, LI K, ZHOU J, et al. A reconfigurable processing element for multiple-precision floating/fixed-point HPC[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 71(3):1401-1405.
[10]
LIU H, LU X, YU X, et al. A 3-D Multi-Precision Scalable Systolic FMA Architecture[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2024.
[11]
ZHANG S, CAO J, ZHANG Q, et al. An FPGA-based reconfigurable CNN accelerator for YOLO[C]// Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET). 2020:74-78.
[12]
CHEN X, LI J, ZHAO Y. Hardware resource and computational density efficient cnn accelerator design based on fpga[C]// 2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), 2021:204-205.
[13]
CHEN P C, LIU Y T, ZENG G Y, et al. Design and implementation of an easy-to-deploy energy-efficient inference acceleration system for multi-precision neural networks[C]// Proceedings of the 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS). Abu Dhabi, United Arab Emirates, 2024:587-591.

基金

国家自然科学基金(62301165)
广州市科技计划项目(2023B01J0007)

PDF(6050 KB)

Accesses

Citation

Detail

段落导航
相关文章

/