面向稀疏矩阵向量乘法的GPU性能建模和算法优化

马澄宇; 李锁兰; 刘一诺; 赵文哲; 任鹏举; 夏天

doi:10.20193/j.ices2097-4191.2025.0081

集成电路与嵌入式系统 ›› 0 DOI: 10.20193/j.ices2097-4191.2025.0081

面向稀疏矩阵向量乘法的GPU性能建模和算法优化

马澄宇, 李锁兰, 刘一诺, 赵文哲, 任鹏举, 夏天

作者信息 +

GPU Performance Modeling and Algorithm Optimization for SpMV

Ma, Chengyu, Li, Suolan, Liu, Yinuo, Zhao, Wenzhe, Ren, Pengju, Xia, Tian

Author information +

文章历史 +

摘要

针对GPU平台上稀疏矩阵向量乘（SpMV）操作的性能瓶颈问题，本文提出了一种基于行重分割的优化算法及其配套性能评估模型。该方法首先基于矩阵行长度与计算资源分配之间的量化映射关系，通过设定动态阈值将原始矩阵划分为长行和短行子矩阵，分别采用线程级和线程块级并行策略进行计算，从而有效缓解了GPU SIMT执行特性与稀疏矩阵非规则数据分布之间的矛盾。为量化预处理过程中引入的额外开销，文中分别建立了针对Atomic Conflict和Padding的性能损失模型，将额外的访存和计算转化为可计算的开销函数。基于上述模型，本文构建了参数空间搜索算法，通过预先获取硬件性能指标和矩阵非零元分布信息，快速在参数集合中搜索得到最优预处理参数。实验结果表明，该优化算法在多种典型稀疏矩阵数据集上均优于传统的GPU稀疏计算库cuSPARSE，在部分场景下性能提升达1.26×及1.17×。此外，参数搜索开销较低，且该方法具备良好的通用性，可适配不同的输入矩阵与GPU硬件架构。

Abstract

To address the performance bottleneck of Sparse Matrix-Vector Multiplication (SpMV) on GPU platforms, this paper proposes an optimization algorithm based on row re-segmentation and its accompanying performance evaluation model. The method first establishes a quantitative mapping relationship between matrix row lengths and computational resource allocation. By setting dynamic thresholds, the original matrix is partitioned into long-row and short-row submatrices, which are then computed using thread-level and thread-block-level parallel strategies respectively. This approach effectively alleviates the inherent conflict between GPU SIMT execution characteristics and irregular data distribution in sparse matrices. To quantify the additional overhead introduced during preprocessing, performance penalty models for Atomic Conflict and Padding are developed, transforming extra memory access and computation into computable cost functions. Building upon these models, a parameter space search algorithm is constructed that rapidly identifies optimal preprocessing parameters within predefined parameter sets by leveraging pre-acquired hardware performance metrics and matrix non-zero element distribution information. Experimental results demonstrate that the proposed optimization algorithm outperforms traditional GPU sparse computation library cuSPARSE across multiple benchmark sparse matrix datasets, achieving performance improvements of up to 1.26× and 1.17× in specific scenarios. Furthermore, the parameter search process incurs low overhead, and the method exhibits strong generalizability, demonstrating adaptability to diverse input matrices and GPU hardware architectures.

导出引用

马澄宇, 李锁兰, 刘一诺, 赵文哲, 任鹏举, 夏天. 面向稀疏矩阵向量乘法的GPU性能建模和算法优化[J]. 集成电路与嵌入式系统. 0 https://doi.org/10.20193/j.ices2097-4191.2025.0081

Ma, Chengyu, Li, Suolan, Liu, Yinuo, Zhao, Wenzhe, Ren, Pengju, Xia, Tian. GPU Performance Modeling and Algorithm Optimization for SpMV[J]. Integrated Circuits and Embedded Systems. 0 https://doi.org/10.20193/j.ices2097-4191.2025.0081