高效Winograd卷积硬件设计及其量化方案

严峥, 张宸硕, 白一川, 杜源, 杜力

集成电路与嵌入式系统 ›› 2025, Vol. 25 ›› Issue (8) : 41-52.

PDF(14567 KB)
PDF(14567 KB)
集成电路与嵌入式系统 ›› 2025, Vol. 25 ›› Issue (8) : 41-52. DOI: 10.20193/j.ices2097-4191.2025.0042
新兴计算芯片设计研究专刊

高效Winograd卷积硬件设计及其量化方案

作者信息 +

Design of efficient Winograd convolution hardware and its quantization scheme

Author information +
文章历史 +

摘要

卷积是CNN网络中常见的运算,卷积中的乘累加运算功耗较高,限制了许多CNN硬件加速器的性能,减少卷积的乘法次数是提高CNN加速器性能的有效途径之一。作为一种快速卷积算法,Winograd算法可以减少卷积中高达75%的乘法。然而,Winograd卷积中的权重分布显著不同,导致为了保持相似的精度需要更长的量化位宽,从而抵消了因减少乘法次数带来的硬件优化效果。针对这一问题进行定量分析,提出了一种新的Winograd卷积量化方案,实现了小于1%的精度损失。为了进一步降低硬件成本,将近似乘法器应用于Winograd卷积。与传统卷积计算块相比,Winograd计算块节省了27.3%的面积,近似乘法器在Winograd计算块中应用节省了39.6%的面积,且性能损失不明显。

Abstract

Convolution is the most common operation in CNN networks, and the power consumption of multiplication and accumulation operations in convolution is high, which limits the performance of many CNN hardware accelerators. Reducing the number of multiplications in convolution is one of the effective ways to improve the performance of CNN accelerators. As a fast convolution algorithm, Winograd algorithm could reduce up to 75% multiplications in convolution. However, the weights of the model for Winograd convolution have a significantly different distribution, which results in longer quantization bit width to maintain similar accuracy and neutralizes the hardware reduction brought by the reduction of multiplications. In this paper, we analyze this problem quantitively and propose a new quantization scheme for Winograd convolution. The quantized Winograd computation hardware module is implemented with accuracy loss less than 1%. To further reduce the hardware cost, we apply the approximate multiplier (AM) to Winograd convolution. Compared with the conventional convolution computation block, the Winograd block saves 27.3% of the area, and the application of the approximate multiplier in Winograd block saves 39.6% of the area without significant performance loss.

关键词

卷积神经网络 / Winograd算法 / 模型量化 / 近似乘法器 / 硬件加速器

Key words

convolution neural networks / Winograd algorithm / model quantization / approximate multiplier / hardware accelerator

引用本文

导出引用
严峥, 张宸硕, 白一川, . 高效Winograd卷积硬件设计及其量化方案[J]. 集成电路与嵌入式系统. 2025, 25(8): 41-52 https://doi.org/10.20193/j.ices2097-4191.2025.0042
YAN Zheng, ZHANG Chenshuo, BAI Yichuan, et al. Design of efficient Winograd convolution hardware and its quantization scheme[J]. Integrated Circuits and Embedded Systems. 2025, 25(8): 41-52 https://doi.org/10.20193/j.ices2097-4191.2025.0042
中图分类号: TN492 (专用集成电路)   

参考文献

[1]
HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:770-778.
[2]
MOLLAHOSSEINI A, CHAN D, MAHOOR M H. Going deeper in facial expression recognition using deep neural networks[C]// 2016 IEEE Winter conference on applications of computer vision (WACV). IEEE, 2016:1-10.
[3]
EN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6):1137-1149.
[4]
REDMON J, FARHADI A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.
[5]
ABDEL-HAMID O, MOHAMED A, JIANG H, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on audio,speech,and language processing, 2014, 22(10):1533-1545.
[6]
LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2015:3431-3440.
[7]
YUAN Z, LIU Y, YUE J, et al. STICKER:An energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-nm CMOS[J]. IEEE Journal of Solid-State Circuits, 2019, 55(2):465-477.
[8]
MARCHESI M, ORLANDI G, PIAZZA F, et al. Fast neural networks without multipliers[J]. IEEE transactions on Neural Networks, 1993, 4(1):53-62.
Multilayer perceptrons (MLPs) with weight values restricted to powers of two or sums of powers of two are introduced. In a digital implementation, these neural networks do not need multipliers but only shift registers when computing in forward mode, thus saving chip area and computation time. A learning procedure, based on backpropagation, is presented for such neural networks. This learning procedure requires full real arithmetic and therefore must be performed offline. Some test cases are presented, concerning MLPs with hidden layers of different sizes, on pattern recognition problems. Such tests demonstrate the validity and the generalization capability of the method and give some insight into the behavior of the learning algorithm.
[9]
LIN Z, COURBARIAUX M, MEMISEVIC R, et al. Neural networks with few multiplications[J]. arXiv preprint arXiv:1510.03009, 2015.
[10]
SARWAR S S, VENKATARAMANI S, ANKIT A, et al. Energy-efficient neural computing with approximate multipliers[J]. ACM Journal on Emerging Technologies in Computing Systems (JETC), 2018, 14(2):1-23.
[11]
STRASSEN V. Gaussian elimination is not optimal[J]. Numerische mathematik, 1969, 13(4):354-356.
[12]
KOUYA T. Accelerated multiple precision matrix multiplication using Strassen's algorithm and Winograd's variant[J]. JSIAM Letters, 2014(6):81-84.
[13]
GROCHOW J A, MOORE C. Designing Strassen's algorithm[J]. arXiv preprint arXiv:1708.09398, 2017.
[14]
PAN V Y. Computation schemes for a product of matrices and for the inverse matrix[J]. Uspekhi Matematicheskikh Nauk, 1972, 27(5):249-250.
[15]
ALI M, YIN B, KUNAR A, et al. Reduction of multiplications in convolutional neural networks[C]// 2020 39th Chinese control conference (CCC). IEEE, 2020:7406-7411.
[16]
PARKER D A, PARHI K K. Area-efficient parallel FIR digital filter implementations[C]// Proceedings of International Conference on Application Specific Systems, Architectures and Processors:ASAP'96. IEEE, 1996:93-111.
[17]
WANG J, LIN J, WANG Z. Efficient hardware architectures for deep convolutional neural network[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2017, 65(6):1941-1953.
[18]
WINOGRAD S. Arithmetic complexity of computations[M]. Siam, 1980.
[19]
WANG X, WANG C, ZHOU X. Work-in-progress: Optimising FPGA-based neural network accelerators using fast winograd algorithm[C]// 2018 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). IEEE, 2018:1-2.
[20]
MAJI P, MUNDY A, DASIKA G, et al. Efficient winograd or cook-toom convolution kernel implementation on widely used mobile cpus[C]// 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). IEEE, 2019:1-5.
[21]
DICECCO R, LACEY G, VASILJEVIC J, et al. Caffeinated FPGAs: FPGA framework for convolutional neural networks[C]// 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 2016:265-268.
[22]
LIANG Y, LU L, XIAO Q, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 39(4):857-870.
[23]
PARK H, KIM D, AHN J, et al. Zero and data reuse-aware fast convolution for deep neural networks on GPU[C]// Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, 2016:1-10.
[24]
LAI P W, ARAFAT H, ELANGO V, et al. Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs[C]// 20th Annual international conference on high performance computing. IEEE, 2013:139-148.
[25]
DI X, YANG H, HUANG Z, et al. An operation-minimized FPGA accelerator design by dynamically exploiting sparsity in CNN Winograd transform[C]// 2019 32nd IEEE International System-on-Chip Conference (SOCC). IEEE, 2019:50-55.
[26]
LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:4013-4021.
[27]
WINOGRAD S. On multiplication of polynomials modulo a polynomial[J]. SIAM Journal on Computing, 1980, 9(2):225-229.
[28]
MADISETTI V K, YOUNG I T. The Digital Signal Processing Handbook-3 Volume Set[M]. CRC press, 2018.
[29]
COURBARIAUX M, BENGIO Y, DAVID J P. Low precision arithmetic for deep learning[C]// ICLR (Workshop), 2015.
[30]
NARAYANAMOORTHY S, MOGHADDAM H A, LIU Z, et al. Energy-efficient approximate multiplication for digital signal processing and classification applications[J]. IEEE transactions on very large scale integration (VLSI) systems, 2014, 23(6):1180-1184.
[31]
HORE A, ZIOU D. Image quality metrics: PSNR vs. SSIM[C]// 2010 20th international conference on pattern recognition, IEEE, 2010:2366-2369.

PDF(14567 KB)

Accesses

Citation

Detail

段落导航
相关文章

/