PDF(14567 KB)
PDF(14567 KB)
PDF(14567 KB)
高效Winograd卷积硬件设计及其量化方案
Design of efficient Winograd convolution hardware and its quantization scheme
卷积是CNN网络中常见的运算,卷积中的乘累加运算功耗较高,限制了许多CNN硬件加速器的性能,减少卷积的乘法次数是提高CNN加速器性能的有效途径之一。作为一种快速卷积算法,Winograd算法可以减少卷积中高达75%的乘法。然而,Winograd卷积中的权重分布显著不同,导致为了保持相似的精度需要更长的量化位宽,从而抵消了因减少乘法次数带来的硬件优化效果。针对这一问题进行定量分析,提出了一种新的Winograd卷积量化方案,实现了小于1%的精度损失。为了进一步降低硬件成本,将近似乘法器应用于Winograd卷积。与传统卷积计算块相比,Winograd计算块节省了27.3%的面积,近似乘法器在Winograd计算块中应用节省了39.6%的面积,且性能损失不明显。
Convolution is the most common operation in CNN networks, and the power consumption of multiplication and accumulation operations in convolution is high, which limits the performance of many CNN hardware accelerators. Reducing the number of multiplications in convolution is one of the effective ways to improve the performance of CNN accelerators. As a fast convolution algorithm, Winograd algorithm could reduce up to 75% multiplications in convolution. However, the weights of the model for Winograd convolution have a significantly different distribution, which results in longer quantization bit width to maintain similar accuracy and neutralizes the hardware reduction brought by the reduction of multiplications. In this paper, we analyze this problem quantitively and propose a new quantization scheme for Winograd convolution. The quantized Winograd computation hardware module is implemented with accuracy loss less than 1%. To further reduce the hardware cost, we apply the approximate multiplier (AM) to Winograd convolution. Compared with the conventional convolution computation block, the Winograd block saves 27.3% of the area, and the application of the approximate multiplier in Winograd block saves 39.6% of the area without significant performance loss.
卷积神经网络 / Winograd算法 / 模型量化 / 近似乘法器 / 硬件加速器
convolution neural networks / Winograd algorithm / model quantization / approximate multiplier / hardware accelerator
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
Multilayer perceptrons (MLPs) with weight values restricted to powers of two or sums of powers of two are introduced. In a digital implementation, these neural networks do not need multipliers but only shift registers when computing in forward mode, thus saving chip area and computation time. A learning procedure, based on backpropagation, is presented for such neural networks. This learning procedure requires full real arithmetic and therefore must be performed offline. Some test cases are presented, concerning MLPs with hidden layers of different sizes, on pattern recognition problems. Such tests demonstrate the validity and the generalization capability of the method and give some insight into the behavior of the learning algorithm.
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
/
| 〈 |
|
〉 |