PDF(6050 KB)
一种多精度可重构张量计算单元的设计
胡湘宏, 梁克龙, 尹飞跃, 冯兆樟, 林元妙, 蔡述庭, 熊晓明
集成电路与嵌入式系统 ›› 2026, Vol. 26 ›› Issue (3) : 81-89.
PDF(6050 KB)
PDF(6050 KB)
一种多精度可重构张量计算单元的设计
Design of multi-precision reconfigurable tensor computing unit
随着人工智能与深度学习应用的快速发展,张量计算对高能效、多精度计算硬件加速器提出了迫切需求。传统通用处理器在处理大规模矩阵乘法运算时存在能效瓶颈,而现有专用加速器往往难以灵活支持多种数据精度与混合计算模式。文中基于可重构架构设计了一款支持多精度与混合精度的张量处理单元,支持INT4、INT8、FP16、BF16、FP32五种数据精度及FP16+FP32、BF16+FP32两种混合精度模式,可高效完成3种不同维度(m16n16k16、m32n8k16、m8n32k16)的矩阵乘加运算。通过可重构计算阵列、动态数据流控制、多模式缓存设计及统一的浮点处理单元,实现了硬件复用率与计算效率的显著提升。在VCU118 FPGA平台上综合频率达251.13 MHz,算力最高达257.16 GOPS/GFLOPS(INT4/INT8/FP16/BF16)和64.29 GFLOPS(FP32)。该设计可广泛应用于深度学习推理、自动驾驶、医疗影像等对计算能效和灵活性要求较高的场景。
With the rapid development of artificial intelligence and deep learning applications, tensor computing urgently demands high-efficiency and multi-precision computing hardware accelerators. The traditional general-purpose processors face energy efficiency bottlenecks when processing large-scale matrix multiplication operations, while existing dedicated accelerators often lack flexibility in supporting diverse data precision and hybrid computing modes. This paper presents a multi-precision and mixed-precision tensor processing unit (TPU), designed based on a reconfigurable architecture, which supports five data formats (INT4, INT8, FP16, BF16, FP32) and two hybrid modes (FP16+FP32, BF16+FP32). It is capable of efficiently performing matrix multiplication and accumulation across three different dimensions (m16n16k16, m32n8k16, m8n32k16). By incorporating a reconfigurable computing array, dynamic data flow control, multi-mode buffer design, and a unified floating-point processing unit, the design achieves high hardware reuse and significantly improved computational efficiency. Synthesized on the VCU118 FPGA platform at 251.13 MHz, it delivers a peak theoretical performance of 257.16 GOPS/GFLOPS (INT4/INT8/FP16/BF16) and 64.29 GFLOPS (FP32). This design is well-suited for applications such as deep learning inference, autonomous driving, and medical imaging, where both computational efficiency and flexibility are critical.
张量处理单元 / 多精度计算 / 可重构架构 / 矩阵乘法 / 硬件复用
tensor processing unit / multi-precision computation / reconfigurable architecture / matrix multiplication / hardware reutilization
| [1] |
张蓓蕾, 王国亮, 谢奋龙, 等. 深度神经网络的电力数据异常检测研究[J]. 单片机与嵌入式系统应用, 2023, 23(10): 36-39.
|
| [2] |
陈欣, 王凌, 朱佳佳, 等. 深度卷积神经网络图像超分辨率重建方法研究[J]. 单片机与嵌入式系统应用, 2023, 23(1): 7-10.
|
| [3] |
潘于, 田映辉, 张伟, 等. 一种节省资源的矩阵运算单元硬件微架构设计[J]. 现代电子技术, 2024, 47(5):160-166.
|
| [4] |
周科宇, 李军. 基于深度学习的目标检测研究进展[J]. 单片机与嵌入式系统应用, 2023, 23(7):38-40,52.
|
| [5] |
陆璐, 赵容, 梁志宏, 等. 基于Matrix Core的小尺寸批量矩阵乘法设计与优化[J]. 华南理工大学学报(自然科学版), 2025, 53(9):1-11.
|
| [6] |
|
| [7] |
李明莉. 基于混合重构的高效能多精度卷积神经网络加速器设计[D]. 南京: 南京航空航天大学, 2020.
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
/
| 〈 |
|
〉 |