Home Browse Just Accepted

Just Accepted

Note: The articles listed below have been peer-reviewed and accepted for publication in this journal. These articles have not yet been scheduled for a specific issue; their content and layout may undergo minor changes in the final published version. Please refer to the final published version as the definitive one. This journal has assigned each of these articles a unique and persistent DOI. You may use the DOI to cite this article directly.
Please wait a minute...
  • Select all
    |
  • zong, pengchen, qu, shaoru, zhao, wenzhe, ren, pengju, xia, tian
    Accepted: 2025-11-20
    Indirect memory accesses, prevalent in data-intensive applications like graph processing and sparse linear algebra, exhibit irregular patterns that severely degrade cache performance due to their low spatial/temporal locality. Traditional stride-based prefetchers fail to capture such patterns where target addresses are dynamically computed through index arrays (e.g., x[a[i]]). This paper proposes the dynamic multi-pattern-aware prefetcher (DMP) to address these challenges. DMP introduces a lightweight shifted differential matching mechanism to autonomously identify indirect access patterns by comparing index data sequences with target address sequence. Implemented on the open-source XuanTie C910 RISC-V processor, DMP reduces L1 data cache miss rates by 27.3% and achieves speedups of 1.07–1.22× for Sparse Matrix-Vector Multiplication (SpMV) algorithm. This work provides a hardware-efficient solution for non-contiguous memory access patterns in modern processors.
  • Ma, Chengyu, Li, Suolan, Liu, Yinuo, Zhao, Wenzhe, Ren, Pengju, Xia, Tian
    Accepted: 2025-11-20
    To address the performance bottleneck of Sparse Matrix-Vector Multiplication (SpMV) on GPU platforms, this paper proposes an optimization algorithm based on row re-segmentation and its accompanying performance evaluation model. The method first establishes a quantitative mapping relationship between matrix row lengths and computational resource allocation. By setting dynamic thresholds, the original matrix is partitioned into long-row and short-row submatrices, which are then computed using thread-level and thread-block-level parallel strategies respectively. This approach effectively alleviates the inherent conflict between GPU SIMT execution characteristics and irregular data distribution in sparse matrices. To quantify the additional overhead introduced during preprocessing, performance penalty models for Atomic Conflict and Padding are developed, transforming extra memory access and computation into computable cost functions. Building upon these models, a parameter space search algorithm is constructed that rapidly identifies optimal preprocessing parameters within predefined parameter sets by leveraging pre-acquired hardware performance metrics and matrix non-zero element distribution information. Experimental results demonstrate that the proposed optimization algorithm outperforms traditional GPU sparse computation library cuSPARSE across multiple benchmark sparse matrix datasets, achieving performance improvements of up to 1.26× and 1.17× in specific scenarios. Furthermore, the parameter search process incurs low overhead, and the method exhibits strong generalizability, demonstrating adaptability to diverse input matrices and GPU hardware architectures.
  • ZHANG, JIANFENG
    Accepted: 2025-11-20
    Traditional hardware verification relies on manual analysis of waveform signals, which faces problems such as low efficiency, easy errors, and difficulty in tracing transaction level behavior. Therefore, this paper proposes an auxiliary tool for verifying CHI protocol in multi-core processors based on VCD data and PyVCD library, which can improve the efficiency of transaction waveform analysis. VCD (Value Change Dump) is an international standard Verilog waveform data file format, and PyVCD is an open-source pure Python code library used for parsing VCD files. Our method exports waveform data of specified signals from various simulation tools through TCL scripts and converts it to VCD format. Further use the PyVCD library to perform algorithm analysis on waveforms, implement waveform structured parsing and transaction reconstruction algorithms, and aggregate distributed Flit data into a complete sequence of transaction objects. Obtain waveform data and combine discrete Flits from different nodes and channels into a complete transaction. After obtaining the transaction object sequence, convert the transaction object into an ASCII string, generate a character signal sequence, and create a VCD file for viewing transaction level waveforms in waveform software; Analyze the performance parameters of transactions in the protocol; Developed the Goldmemory tool to analyze the transaction object sequence of multiple nodes in the system and automatically identify data errors and other scenarios. The platform based on this method has been deployed in multi-core processor engineering, and through waveform analysis of CHI transactions, the efficiency of simulation verification has been greatly improved. At the same time, it can quickly locate performance bottlenecks in architecture design to achieve rapid iterative optimization of the architecture.
  • Liangshun Wu, Tao Tao
    Accepted: 2025-11-20
    As neural network models become increasingly complex, Network-on-Chip (NoC) plays a critical communication role in heterogeneous computing systems. However, traditional NoC simulation tools generally lack support for heterogeneous computing units such as matrix processing units and RISC-V programmable cores, making it difficult to meet the requirements of large-scale AI tasks in terms of real-time performance, throughput, and energy efficiency. To address these challenges, this paper proposes and implements a behavior-level NoC simulation framework for heterogeneous computing. The framework features high-precision node modeling, a dynamic pipelining mechanism, a hybrid task-aware routing algorithm, and full-path visualization and debugging capabilities. Experimental results demonstrate that the proposed framework significantly outperforms traditional methods in average latency, throughput, and visualization debugging efficiency. Notably, it exhibits greater stability and scalability in scenarios involving hybrid task flows and hardware faults, providing strong support for the design and optimization of NoC in next-generation intelligent computing platforms.
  • Accepted: 2025-11-20
    To address the issues of inconsistent abstraction levels in algorithm accelerator models, complex verification environment construction, and cross-toolchain and multi-language collaboration, an agile verification platform for algorithm accelerators based on AST and DPI was designed and implemented. Using the AST parsing algorithm model for the syntax tree structure of the C program, specific algorithm functions are mapped to the SV DPI interface to generate UVM reference models and direct test vectors. The RTL code is automatically parsed to generate a UVM-based verification environment, and the reference model is compared with the actual output through the generated DPI interface to verify the correctness of the function. This platform effectively lowers the verification threshold for algorithm accelerator verification personnel. With a set of automated tools, it can directly create a verification environment that is suitable for industrial output, significantly shortening the verification cycle.
  • Guo Tao, Zhou Haiyang, Fan Xiaochang, Yu Yuxin
    Accepted: 2025-11-20
    In response to the demand for intelligent equipment electronic systems, this article designs a neural network accelerator soft core and supporting quantitative compilation software based on the programmable logic in the "Hongxin" intelligent reconfigurable platform. It realizes the unified quantitative compilation and deployment of neural network models for self-developed accelerator soft cores, and expands the functions of the "Hongtu" embedded real-time operating system, achieving support for hardware accelerated operation of neural networks. Through experimental testing, the performance of the neural network accelerator soft core is comparable to that of the AMD Xilinx DPU soft core. The performance of the "Hongtu" embedded real-time operating system running ResNet18 and ResNet50 is four times higher than that of the AMD Xilinx PetaLinux environment, enhancing the artificial intelligence capabilities of the "Hongxin" intelligent reconfigurable platform.
  • Accepted: 2025-11-20
    This paper presents the embedded deployment of the PVAC model, designed to predict the risk of ventilator-associated complications (VAC) in patients with acute respiratory failure. The PVAC model employs the USMOTE(0.9) algorithm to address imbalanced data and integrates an AdaBoost classifier, achieving an accuracy of 71.11% and a precision of 68.89%. To overcome the limitations of existing AI medical systems that rely on cloud servers, we implemented a fully embedded deployment of the PVAC model using the PYNQ-Z2 development board. This solution offers three key advantages: offline standalone operation, hardware acceleration for improved computational efficiency, and cost-effectiveness.Experimental results demonstrate that the hardware-software co-design approach significantly reduces the total execution time from 46.3 ms to 10.2 ms, achieving a speedup of 78%. Meanwhile, the ARM processor's workload decreases dramatically from 98% to 28%, with only a 0.2% drop in prediction accuracy, effectively preserving the model's original performance. This study not only validates the feasibility of embedding the PVAC model but also provides a reference for the localized deployment of other medical AI applications. Future work may focus on further optimizing the decision tree structure, leveraging the dynamic reconfigurability of FPGAs to support more complex models, extending the capability to process temporal signals, and developing low-power modes to extend device usage time, thereby enhancing the system's practicality and applicability.
  • Accepted: 2025-11-20
    针对无人系统在自主可控、实时响应与智能协同方面的核心需求,本文提出一种基于锐华嵌入式实时操作系统(ReWorks)的全栈国产化解决方案。通过构建“昇腾AI大脑+锐华实时小脑”的双核异构架构,结合ROS2通信框架与microROS嵌入式扩展,实现智能决策与硬实时控制的深度协同。在龙芯2K1000、飞腾D2000等国产硬件平台上验证表明:该方案中断响应时间和任务切换延迟均≤2μs,中断响应时间优于Linux 500倍以上,为水下机器人、无人机等无人系统应用场景提供全栈自主可控的技术路径。
  • Accepted: 2025-11-20
    地下石油管道因腐蚀、疲劳、蠕变、冲刷及磨损减薄等一系列因素导致断裂,从而引起巨大损失。针对地下管道断裂的探测问题,设计了基于合成孔径原理的圆环阵列超声周向目标探测电路。该电路包括超声激励模块、阵列探头控制电路、超声波收发电路、数据采集存储电路。超声波激励模块负责将连续的方波激励信号放大加到超声探头两端,阵列控制电路控制着超声发射模块,在某时刻采用一发多收的工作模式,而超声波接收电路承担着信号放大和检测部分,最后由数据采集模块采集存储数据并上传PC进行处理。实验结果表明,在空气介质中该电路可对180°范围内的任何物体进行探测,且探测距离大于2m。
  • Accepted: 2025-11-19
    The transponder is a critical The transponder is a critical component of the launch vehicle measurement system, capable of receiving and coherently relaying two C-band velocity measurement signals. To achieve high-precision coherent signal relay, the project team utilized an FPGA hardware platform and implemented methods such as improved quantization accuracy, innovative quantization approaches for relay ratios, cross-relay operation modes, and rational allocation of signal processing time. These measures enabled the design of high-precision coherent relay software for velocity measurement signals. Taking the commonly used 200 kHz Doppler frequency shift as an example, the designed velocity measurement accuracy has reached 0.0023 Hz. Additionally, unlike the independent operation modes where Channel A's main station transmits and Channel A's main/auxiliary stations receive, and Channel B's main station transmits and Channel B's main/auxiliary stations receive, the system now supports bidirectional non-common-source velocity measurement. When either Channel A or B fails to receive signals normally, the design allows any main station of Channel A or B to transmit, while the main/auxiliary stations of both channels synchronously receive. This enhances the system's velocity measurement accuracy under abnormal conditions.
  • Accepted: 2025-11-17
    In high-speed storage devices based on FPGA, the cascading capability between devices is crucial for the compatibility and scalability of the devices. Therefore, this paper designs a cascading storage system for high-speed storage devices based on FPGA, which integrates the high bandwidth of FPGA-based high-speed storage devices and the flexible scalability of general storage devices. Experimental results show that under the "one master and multiple slaves" management mode of global clock synchronization and token polling, this cascading storage system can maintain a storage bandwidth of 6.40GB/s. In the continuous writing and replay tests of large-scale data, the data is stably written and there are no error codes in the verification, effectively achieving the transparent expansion of the storage system.
  • Accepted: 2025-11-17
    本文通过对QEMU的虚拟机与宿主机映射的页表机制进行研究,并深入分析了其中的页表填充原理以及读写指令如何触发区分对不同类型内存的处理流程的原理。通过在页属性中增加新的标志位,并在页表填充和指令对内存读写的helper函数中对该位进行对应的设置和判定,从而实现了对具有某一属性地址的定位,并进入特定的回调函数。参照QEMU自带的leon3例程中添加外设的流程,设计了动态库的接口函数包括设备创建、初始化、读写回调函数等。分析了QEMU对MMIO外设的读写流程和传参特征,得出外设定位原理及回调函数所需的基本参数,在此基础上,设计并给出了对片外MMIO型外设的动态库中读写回调函数的精确调用位置。本文最后通过实验对研究的正确性和速度敏感性进行了分析,得出用本研究的方法能够很好的实现外设和QEMU代码的分离,运行结果正确,运行速度能达到外设源码与QEMU源码在一起编译时速度的97%以上。本研究能为虚拟机开发人员以及QEMU开源使用者提供一定的借鉴意义。
  • Accepted: 2025-11-11
    Abstract: Helicopter is widely used in military, civil and people's livelihood fields because of its superior flexibility and mobility, and has an irreplaceable position in specific fields. Since the advent of the helicopter, its faults have emerged one after another. For helicopter fault detection, it is often used to manually check each part one by one. Aiming at the traditional helicopter fault detection, based on ep4ce series FPGA chip, this paper uses 12 bit 8-channel analog-to-digital converter (ADC) chip adc128s022 to realize the real-time acquisition of helicopter vibration data, and extracts the eigenvalues of the collected data by fast Fourier transform (FFT). The experimental results show that the alarm rate of helicopter fault detection is more than 99%.
  • Accepted: 2025-10-20
    The application of dynamic reconfiguration technology is numerous, but research oriented towards DSP chips is extremely scarce. This paper proposes a DSP-based partial dynamic reconfiguration method, which takes the most frequently modified functions in application programs as reconfiguration elements. It reasonably allocates the DSP's memory space and FLASH storage space occupied by the functions to be reconfigured, and replaces the function data in this space online on demand, thereby realizing partial dynamic reconfiguration. Tests using the domestic FT-M6678 show that this method can effectively change the functions of the reconfigurable functions without affecting the operation of other modules of the software. It provides practical methods and experience for the flexible use of DSP and has a good application prospect.
  • Accepted: 2025-09-24
    With the gradual adoption of embedded systems in industrial control systems, the need to establish a data-centric digital factory to support production management, scheduling decisions, and the intelligent configuration of production resources has become increasingly prominent. Among these, efficient and reliable data transmission methods play a crucial underlying supporting role in digital construction and are the prerequisite for the orderly operation of the entire embedded system. Data Distribution Service (DDS), as a high-performance communication middleware, provides a specification for data sharing between different systems and has received widespread attention in recent years. However, there are still issues with current complete data distribution services on embedded platforms, such as the inability to allow embedded devices to directly join the distributed network of data distribution services as communication nodes, and the inability to guarantee the real-time performance of urgent messages in scenarios of network resource conflicts. To address this issue, this paper proposes an optimization strategy based on software and hardware co-design, focusing on the operational characteristics of DDS. It involves a dedicated SRAM for rapid loading of DDS modules and utilizes DMA technology to improve data interaction energy efficiency, including multi-level parallel computing technology based on module decoupling and a high-availability software design strategy based on the Master-Works pattern. Testing and verification were conducted on STM32H4, and the analysis results show that the method designed in this paper is suitable for real-time performance analysis of data distribution services in network environments. Compared to centralized data centers, the packet loss rate is reduced by 5%, and the data transmission efficiency is improved by approximately 8%
  • Accepted: 2025-09-24
    Currently, for SOC chips integrated with neural network processors on the market, when running the YOLO algorithm, the post-processing part is executed on the CPU, which increases the overall time consumption of the algorithm. This paper proposes a hardware acceleration scheme for YOLO post-processing based on FPGA chips using RTL logic.First, the algorithm execution process is optimized to greatly reduce redundant calculations.Next, the numerical distribution of feature maps is analyzed and restricted, and the variable range is reasonably defined.Subsequently, the RAM lookup process is sorted out to complete the mapping of nonlinear functions.Then, the data flow control logic architecture of the overall post-processing algorithm is elaborated, and some practical techniques are proposed for key functional modules.Finally, the acceleration scheme is tested on the board based on the domestic ZYNQ chip, and the performance is evaluated from multiple dimensions with the reasons analyzed.The experimental results show that the implementation scheme occupies less than 3% of the logic resources, with a calculation accuracy loss of about 0.5%, and the calculation efficiency is 7 times higher than that of the CPU. When connected to real-time video acquisition, the FPGA system runs stably, and the target frame detection and marking are accurate.
  • Accepted: 2025-09-17
    The existing debugging functionalities in embedded real-time operating system application development encompass variable inspection,breakpoint management,and memory read/write operations,which generally satisfy users’ debugging requirements for multitasking applications.However,there is limited focus on debugging specific tasks during the multitasking debugging process.Particularly,when multiple tasks invoke the same funtion ,it creates inconvenience for users during debugging.This paper presents a method for debugging specified tasks in multitasking programs,implemented on an embedded real-time operating system and an autonomous debugger software for the “HunXin” digital signal processor. This approach significantly enhances the efficiency of multitasking application debugging and reduces the application development time.
  • Accepted: 2025-09-17
    PCIe interface bus enables low-latency, high-bandwidth data transmission between CPU and FPGA, with the key factor being the design of a DMA engine, allowing CPU to be uninvolved in the data transmission. However, the majority of current CPU+FPGA data transmission solutions are based on foreign FPGA devices from Xilinx, and there is a severe shortage of commercial IP cores for domestic FPGA, making it challenging to port these solutions to domestic FPGA platforms. Therefore, this paper uses domestic FPGA to design a PCIe interface-based DMA engine and its corresponding driver, hiding the parsing of transaction layer packets in the PCIe protocol stack and reducing the development complexity of domestic FPGA in PCIe based applications. Experimental results demonstrate that, the DMA engine achieves a read throughput of 784 MB/s and a write throughput of 800 MB/s via PCIe 2.0 x2 bus, reaching 82% and 84% of the theoretical maximum bandwidth of PCIe 2.0 x2, respectively.