To address the performance bottleneck of Sparse Matrix-Vector Multiplication (SpMV) on GPU platforms, this paper proposes an optimization algorithm based on row re-segmentation and its accompanying performance evaluation model. The method first establishes a quantitative mapping relationship between matrix row lengths and computational resource allocation. By setting dynamic thresholds, the original matrix is partitioned into long-row and short-row submatrices, which are then computed using thread-level and thread-block-level parallel strategies respectively. This approach effectively alleviates the inherent conflict between GPU SIMT execution characteristics and irregular data distribution in sparse matrices. To quantify the additional overhead introduced during preprocessing, performance penalty models for Atomic Conflict and Padding are developed, transforming extra memory access and computation into computable cost functions. Building upon these models, a parameter space search algorithm is constructed that rapidly identifies optimal preprocessing parameters within predefined parameter sets by leveraging pre-acquired hardware performance metrics and matrix non-zero element distribution information. The experimental results demonstrate that the proposed optimization algorithm outperforms traditional GPU sparse computation library cuSPARSE across multiple benchmark sparse matrix datasets, achieving performance improvements of up to 1.26× and 1.17× in specific scenarios. Furthermore, the parameter search process incurs low overhead, and the method exhibits strong generalizability, demonstrating adaptability to diverse input matrices and GPU hardware architectures.
In high-speed storage devices based on FPGA, the cascading capability between devices is crucial for the compatibility and scalability of the devices. Therefore, this paper designs a cascading storage system for high-speed storage devices based on FPGA, which integrates the high bandwidth of FPGA-based high-speed storage devices and the flexible scalability of general storage devices. Experimental results show that under the "one master and multiple slaves" management mode of global clock synchronization and token polling, this cascading storage system can maintain a storage bandwidth of 6.40 GB/s. In the continuous writing and replay tests of large-scale data, the data is stably written and there are no error codes in the verification, effectively achieving the transparent expansion of the storage system.
Helicopters are widely used in military, civilian, and public sectors due to their superior flexibility and maneuverability, and they hold an irreplaceable position in specific fields. To address the high false alarm rate in traditional helicopter fault detection, this study utilizes the EP4CE series FPGA chip and a 12-bit 8-channel analog-to-digital converter chip ADC128S022 to achieve real-time acquisition of helicopter vibration data. The collected data is processed using the Fast Fourier Transform (FFT), and feature values are extracted for analysis. The experimental results demonstrate that the system achieves an accuracy rate of over 99% in helicopter fault detection and alarm.
In order to solve the problem that partition application cannot be executed correctly when running ARINC653 partition operating system and application program on MPC750 processor hardware simulator provided by QEMU5.1.0, this paper carries out abnormal cause analysis, related technical research and problem code investigation. Based on the research of PPC simulator source code provided by QEMU, MPC750 processor documentation and ARINC653 operating system related code, by analyzing the abnormal printing information of the operating system, observing the modification of the memory status of the simulator, and testing the setting of the relevant status register of MMU, the QEMU code problem is located. The partition application can start and run normally under MPC750 processor hardware simulator and ARINC653 operating system environment.
This paper presents a control and management platform for the Airborne Integrated Communication System (AICS), addressing its urgent need for efficient data processing and resource management under complex multi-network, multi-bus, and multi-task conditions. Based on the independently-designed FT-2000/4 high-performance processor, the design adopts a generalization and miniaturization approach, to enhance external communication interface capabilities through FPGA and protocol conversion bridges. It is equipped with Tianmai 3 embedded operating system to implement object-orinted software layered design. The testing validation and engineering application show that the platform can efficiently handle the system control and management functions based on SRIO network, and fully meet the real-time and reliability requirements. It demonstrates excellent applicability and promotion value on airborne embedded platforms.
This paper investigates the page table mechanism for address translation between the QEMU virtual machine and the host, providing an in-depth analysis of page table filling principles and how memory read/write instructions trigger distinct processing flows for different memory types. By introducing a new flag bit into page attributes and setting/checking this bit during page table population and in memory access helper functions, the method achieves recognition of addresses with specific attributes and invokes corresponding callback functions. With reference to the peripheral addition process in the QEMU leon3 example, interface functions for dynamic libraries are designed, including device creation, initialization, and read/write callbacks. The read/write process and parameter transmission characteristics of QEMU for MMIO peripherals are analyzed to clarify the peripheral locating principle and the basic parameters required by callback functions. Based on this, precise call locations for read/write callback functions in off-chip MMIO peripheral dynamic libraries are designed and presented. Experiments verify the correctness and performance sensitivity of the proposed approach. Results demonstrate that this method effectively separates peripheral code from QEMU source code, ensures correct execution, and achieves over 97% of the performance compared to compiling peripheral source code directly with QEMU. This study offers valuable insights for virtual machine developers and QEMU open-source users.
Indirect memory accesses, prevalent in data-intensive applications such as graph processing and sparse linear algebra, exhibit irregular patterns that severely degrade cache performance due to their low spatial/temporal locality. Traditional stride-based prefetchers fail to capture such patterns where target addresses are dynamically computed through index arrays. This paper proposes a dynamic multi-pattern-aware prefetcher (DMP) to address these challenges. DMP introduces a lightweight shifted differential matching mechanism to autonomously identify indirect access patterns by comparing index data sequences with target address sequence. Implemented on the open-source XuanTie C910 RISC-V processor, DMP reduces L1 data cache miss rates by 27.3% and achieves speedups of 1.07~1.22 timed for Sparse Matrix-Vector Multiplication (SpMV) algorithm. This work provides a hardware-efficient solution for non-contiguous memory access patterns in modern processors.
Dynamic reconfiguration technology offers numerous applications, yet research focused on DSP chips is notably limited. This paper proposes a DSP-based partial dynamic reconfiguration method, utilizing the most frequently modified functions in application programs as reconfiguration elements. It reasonably allocates the DSP's memory space and FLASH storage space occupied by the functions to be reconfigured, and replaces the function data in this space online on demand, thereby realizing partial dynamic reconfiguration. Tests using the domestic FT-M6678 show that this method can effectively change the functions of the reconfigurable functions without affecting the operation of other modules of the software. This work provides practical strategies and valuable insights for the flexible utilization of DSPs, demonstrating significant potential for real-world applications.
Underground oil pipelines are subject to fracture due to factors such as corrosion, fatigue, creep, erosion and wear thinning, resulting in huge losses. To solve the problem of underground pipeline fracture detection, a ring array ultrasonic circumferential target detection circuit based on the synthetic aperture principle is designed. The circuit includes an ultrasonic excitation module, an array probe control circuit, an ultrasonic transceiver and generator circuit, and a data acquisition and storage circuit. The ultrasonic excitation module is responsible for amplifying the continuous square wave excitation signal to both ends of the ultrasonic probe. The array control circuit controls the ultrasonic emission module, which adopts the working mode of one transmit and multiple receivers at a certain time. The ultrasonic receiving circuit handles the signal amplification and detection, and finally the data acquisition module collects and stores the data and uploads it to the PC for processing. The experiment results show that the circuit can detect objects within the range of 180° in the air medium, and the detection distance exceeds 2 m.
As a critical component of the launch vehicle measurement system, the transponder is capable of receiving and coherently relaying two C-band velocity measurement signals. To achieve high-precision coherent signal relay, the project team utilized an FPGA hardware platform and implemented methods such as improved quantization accuracy, innovative quantization approaches for relay ratios, cross-relay operation modes, and rational allocation of signal processing time. These measures enabled the design of high-precision coherent relay software for velocity measurement signals. Taking the commonly used 200 kHz Doppler frequency shift as an example, the designed velocity measurement accuracy has reached 0.002 3 Hz. Additionally, unlike the independent operation modes where Channel A's main station transmits and Channel A's main/auxiliary stations receive, and Channel B's main station transmits and Channel B's main/auxiliary stations receive, the system now supports bidirectional non-common-source velocity measurement. When either Channel A or B fails to receive signals normally, the design allows any main station of Channel A or B to transmit, while the main/auxiliary stations of both channels synchronously receive. Therefore, this enhances the system's velocity measurement accuracy under abnormal conditions.
The existing debugging functionalities in embedded real-time operating system application development encompass variable inspection, breakpoint management, and memory read/write operations, which generally satisfy users’ debugging requirements for multitasking applications. However, there is limited focus on debugging specific tasks during the multitasking debugging process. Particularly, when multiple tasks invoke the same function, it creates inconvenience for users during debugging. This paper presents a method for debugging specified tasks in multitasking programs, implemented on an embedded real-time operating system and an autonomous debugger software for the “HunXin” digital signal processor. This approach significantly enhances the efficiency of multitasking application debugging and reduces the application development time.