Current Issue

  • Select all
    |
    Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    WU Liangshun, TAO Tao, ZHANG Bin
    Download PDF ( ) HTML ( )   Knowledge map   Save

    As neural network models become increasingly complex, Network-on-Chip (NoC) plays a critical communication role in heterogeneous computing systems. However, the traditional NoC simulation tools generally lack support for heterogeneous computing units such as matrix processing units and RISC-V programmable cores, making it difficult to meet the requirements of large-scale AI tasks in terms of real-time performance, throughput, and energy efficiency. To address these challenges, this paper proposes and implements a behavior-level NoC simulation framework for heterogeneous computing. The framework features high-precision node modeling, a dynamic pipelining mechanism, a hybrid task-aware routing algorithm, and full-path visualization and debugging capabilities. The experimental results demonstrate that the proposed framework significantly outperforms traditional methods in average latency, throughput, and visualization debugging efficiency. Notably, it exhibits greater stability and scalability in scenarios involving hybrid task flows and hardware faults, providing strong support for the design and optimization of NoC in next-generation intelligent computing platforms.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    GUO Tao, ZHOU Haiyang, YU Yuxin, FAN Xiaochang, WANG Shuo, ZHANG Yanlong, CHEN Lei
    Download PDF ( ) HTML ( )   Knowledge map   Save

    In response to the demand for intelligent equipment electronic systems, this article designs a neural network accelerator soft core and supporting quantitative compilation software based on the programmable logic on the "Hongxin" intelligent reconfigurable platform. It realizes the unified quantitative compilation and deployment of neural network models for self-developed accelerator soft cores, and expands the functions of the "Hongtu" embedded real-time operating system, achieving support for hardware accelerated operation of neural networks. Through experimental testing, the performance of the neural network accelerator soft core is comparable to that of the AMD Xilinx DPU soft core. The "Hongtu" embedded real-time operating system running ResNet18 and ResNet50 delivers four times higher performance than the AMD Xilinx PetaLinux environment. These results enhance the artificial intelligence capabilities of the "Hongxin" intelligent reconfigurable platform.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    LI Yawei, XIE Xinyang, SUN Yirui, ZHAO Shuangmei, SU Guobin
    Download PDF ( ) HTML ( )   Knowledge map   Save

    To address the issues of inconsistent abstraction levels in algorithm accelerator models, complex verification environment construction, and cross-toolchain and multi-language collaboration, an agile verification platform for algorithm accelerators based on AST and DPI is designed. Using the AST parsing algorithm model for the syntax tree structure of the C program, specific algorithm functions are mapped to the SV DPI interface to generate UVM reference models and direct test vectors. The RTL code is automatically parsed to generate a UVM-based verification environment, and the reference model is compared with the actual output through the generated DPI interface to verify the function correctness. This platform effectively lowers the verification threshold for algorithm accelerator verification personnel. With automated tools, it can directly create a verification environment aligned with the industrial output, significantly shortening the verification cycle.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    HUANG He, YANG Fan, PU Tao, AI Jingmei
    Download PDF ( ) HTML ( )   Knowledge map   Save

    Addressing the core requirements of unmanned systems in terms of autonomous controllability, real-time response, and intelligent collaboration, this paper proposes a full-stack localized unmanned intelligent control system solution based on ReWorks embedded real-time operating system and openEuler open-source operating system. By constructing a dual-system heterogeneous architecture of "AI brain + real-time cerebellum", combined with the ROS2 communication framework and microROS embedded extension, deep collaboration between intelligent decision-making and hard real-time control is achieved. Verification on domestic hardware platforms such as Loongson 2K1000 and Feiteng D2000 shows that the real-time performance indicators of this solution are significantly better than those of Linux, providing a full-stack autonomous controllable technology path for unmanned system application scenarios such as underwater robots and drones.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    JIN Ziyi, ZHU Zhichen, DU Jiang, CHEN Yixiang
    Download PDF ( ) HTML ( )   Knowledge map   Save

    This paper presents the embedded deployment of the PVAC model to predict the risk of ventilator-associated complications (VAC) in patients with acute respiratory failure. The PVAC model employs the USMOTE (0.9) algorithm to address imbalanced data and integrates an AdaBoost classifier, achieving an accuracy of 71.11% and a precision of 68.89%. To overcome the limitations of existing AI medical systems that rely on cloud servers, we implemented a fully embedded deployment of the PVAC model using the PYNQ-Z2 development board. This solution offers three key advantages: offline standalone operation, hardware acceleration for improved computational efficiency, and cost-effectiveness. Experimental results demonstrate that the hardware-software co-design approach significantly reduces the total execution time from 46.3 ms to 10.2 ms, achieving a speedup of 78%. Meanwhile, the ARM processor's workload decreases dramatically from 98% to 28%, with only a 0.2% drop in prediction accuracy, effectively preserving the model's original performance. This study not only validates the feasibility of embedding the PVAC model but also provides a reference for the localized deployment of other medical AI applications. Future work may focus on further optimizing the decision tree structure, leveraging the dynamic reconfigurability of FPGAs to support more complex models, extending the capability to process temporal signals, and developing low-power modes to extend device usage time, thereby enhancing the system's practicality and applicability.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    ZHANG Yi, ZHANG Yuling, YANG Xuecong
    Download PDF ( ) HTML ( )   Knowledge map   Save

    Memory access latency remains a major bottleneck for many applications on modern processors. To optimize memory access performance, it is crucial to exploit the locality of reference in memory accesses. Data layout optimization techniques, through operations such as merging, splitting, and reorganizing data structures, can significantly improve the locality of memory access. This paper first provides an overview of the technological background of memory architecture and data organization involved in layout optimization techniques. Then introduces the key issues that data orchestration techniques aim to address, the core ideas behind these techniques, and the main technologies upon which their implementation relies. Given the significant differences in storage and access patterns across various types of data, this paper focuses on systematically summarizing and categorizing relevant research, comparing the strengths and weaknesses of different approaches, and analyzing promising future research directions.

  • Research Paper
  • Research Paper
    HE Lianjie, CHEN Xiang, WANG Xijun
    Download PDF ( ) HTML ( )   Knowledge map   Save

    The PCIe interface bus enables low-latency, high-bandwidth data transmission between CPU and FPGA, with the key factor being the design of a DMA engine, allowing data transfer without CPU intervention. However, the majority of current CPU+FPGA data transmission solutions are based on foreign FPGA devices from Xilinx. There is a severe shortage of commercial IP cores for domestic FPGA, making it challenging to port these solutions to domestic FPGA platforms. To address this limitation, this paper designs a PCIe-based DMA engine and its corresponding driver on a domestic FPGA platform. The design encapsulates the parsing of transaction layer packets within the PCIe protocol stack, thereby reducing the development complexity of PCIe-based applications on domestic FPGA devices. The experimental results demonstrate that the DMA engine achieves a read throughput of 784 MB/s and a write throughput of 800 MB/s via PCIe 2.0 x2 bus, reaching 82% and 84% of the theoretical maximum bandwidth of PCIe 2.0 x2.

  • Research Paper
    XU Pengcheng, LI Guangfei, SHEN Wei, HE Xun
    Download PDF ( ) HTML ( )   Knowledge map   Save

    With the gradual adoption of embedded systems in industrial control systems, the need to establish a data-centric digital factory to support production management, scheduling decisions, and the intelligent production resources configuration has become increasingly prominent. Among these, efficient and reliable data transmission methods play a crucial role in digital construction and serve as a prerequisite for the orderly operation of the entire embedded system. Data Distribution Service (DDS), as a high-performance communication middleware, provides a specification for data sharing between different systems and has received widespread attention in recent years. However, current data distribution services for embedded platforms exhibit two gaps: embedded devices cannot directly participate as communication nodes, and time-critical messages lack real-time guarantees when network resources are contended. To address this issue, this paper proposes an optimized strategy based on software and hardware co-design, focusing on the operational characteristics of DDS. It involves a dedicated SRAM for rapid loading of DDS modules and utilizes DMA technology to improve data interaction energy efficiency, including multi-level parallel computing technology based on module decoupling and a high-availability software design strategy based on the Master-Works pattern. Testing and verification were conducted on STM32H4, and the analysis results show that the method designed in this paper is suitable for real-time performance analysis of data distribution services in network environments. Compared to centralized data centers, the packet loss rate is reduced by 5%, and the data transmission efficiency is improved by approximately 8%.

  • Research Paper
    ZHANG Jianfeng, SHAO Jingjie, LIAO Xianglong, ZENG Pin
    Download PDF ( ) HTML ( )   Knowledge map   Save

    Traditional hardware verification relies on manual analysis of waveform signals, which faces problems such as low efficiency, easy errors, and difficulty in tracing transaction level behavior. Therefore, this paper proposes an auxiliary tool for verifying CHI protocol in multi-core processors based on VCD data and PyVCD library, which can improve the efficiency of transaction waveform analysis. VCD (Value Change Dump) is an international standard Verilog waveform data file format, and PyVCD is an open-source pure Python code library used for parsing VCD files. Our method exports waveform data of specified signals from various simulation tools through TCL scripts and converts it to VCD format. Further use the PyVCD library to perform algorithm analysis on waveforms, implement waveform structured parsing and transaction reconstruction algorithms, and aggregate distributed Flit data into a complete sequence of transaction objects. Obtain waveform data and combine discrete Flits from different nodes and channels into a complete transaction. After obtaining the transaction object sequence, convert the transaction object into an ASCII string, generate a character signal sequence, and create a VCD file for viewing transaction level waveforms in waveform software. Analyze the performance parameters of transactions in the protocol. Developed the Goldmemory tool to analyze the transaction object sequence of multiple nodes in the system and automatically identify data errors and other scenarios. The platform based on this method has been deployed in multi-core processor engineering, and through waveform analysis of CHI transactions, the efficiency of simulation verification has been greatly improved. At the same time, it can quickly locate performance bottlenecks in architecture design to achieve rapid iterative optimization of the architecture.

  • Research Paper
    PEI Jianhua, TANG Jun, ZHOU Yang
    Download PDF ( ) HTML ( )   Knowledge map   Save

    For SoCs integrating neural network accelerators on the market, YOLO’s post-processing is typically executed on the CPU, increasing overall inference latency. This paper proposes a hardware acceleration scheme for YOLO post-processing based on FPGA chips using RTL logic. First, the algorithm execution process is optimized to greatly reduce redundant calculations. Second, the numerical distribution of feature maps is analyzed and restricted, and the variable range is reasonably defined. Third, the RAM lookup process is sorted out to complete the mapping of nonlinear functions. Fourth, the data flow control logic architecture of the overall post-processing algorithm is elaborated, and some practical techniques are proposed for key functional modules. Finally, the acceleration scheme is tested on the board based on the domestic ZYNQ chip, and the performance is evaluated from multiple dimensions with the reasons analyzed. The experimental results show that the implementation scheme occupies less than 3% of the logic resources, with about 0.5% loss in calculation accuracy, and achieves a 7 times higher over a CPU baseline. When connected to real-time video acquisition, the FPGA system runs stably, and produces accurate target detection and annotation.

  • Industry Viewpoint
  • Industry Viewpoint
    YIN Pengyue, CHAI Jing, ZHANG Yalin
    Download PDF ( ) HTML ( )   Knowledge map   Save