Most Viewed

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Special Issue of Emerging Computing Chip Design
    SHU Yuhao, LI Yifei, WANG Jincheng, LIU Weiqiang, HA Yajun
    Integrated Circuits and Embedded Systems. 2025, 25(8): 23-30. https://doi.org/10.20193/j.ices2097-4191.2025.0046

    With the rapid advancement of cutting-edge technologies such as artificial intelligence and quantum computing, the demand for high-performance computing chips continues to increase. However, traditional von Neumann architectures are increasingly constrained by the memory wall and power wall, making it difficult to meet the computing demands of data-intensive applications. Cryogenic in-memory computing combines the superior electrical properties of cryogenic CMOS devices with the high bandwidth and low latency advantages of in-memory computing architectures, providing a new solution to overcome computing bottlenecks. This review summarizes the key characteristics of CMOS devices and various memory media at cryogenic temperatures, systematically reviews representative architectures, key implementations, and performance metrics of cryogenic in-memory computing in the fields of artificial intelligence and quantum computing. Moreover, this review analyzes the challenges and development trends at the levels of device technology, circuit systems, and EDA tools.

  • Special Issue of Emerging Computing Chip Design
    YAN Peiran, ZHI Qinzhe, LIU Lifeng, JIA Tianyu
    Integrated Circuits and Embedded Systems. 2025, 25(8): 31-40. https://doi.org/10.20193/j.ices2097-4191.2025.0043

    As Moore’s Law slows down, domain-specific SoC (DSSoC) has emerged as a promising energy-efficient design strategy by integrating domain-specific accelerator (DSA). However, the design process for DSSoC remains highly complex, leading to prolonged development cycles and significant labor effort. Recent advances in large language models (LLMs) have introduced new methodologies for agile chip design, demonstrating substantial potential in code and EDA script generation. In this work, an LLM-based multi-agent framework for DSSoC design is proposed, which consists of end-to-end design stages from architecture definition to code generation and EDA physical implementation. The approach is validated through two case studies involving 2-to 4-week SoC designs at process nodes of 22 nm and 7 nm. The evalautions show the generated SoCs achieve energy efficiency improvements of 4.84× and 3.82×, compared to SoCs generated by the existing framework.

  • Special Issue of Emerging Computing Chip Design
    LI Qingxin, WEI Jinghe, GAO Ying, HAN Yujie, JU Hu, CAI Shujun, JIANG Jianfei
    Integrated Circuits and Embedded Systems. 2025, 25(8): 64-73. https://doi.org/10.20193/j.ices2097-4191.2025.0052

    A communication interface for NoC and Flash controller is designed, mainly consisting of request path module, protocol conversion module and response path module. The request path module can complete data verification and cross-clock processing of the request packet sent by NoC. The protocol conversion module converts the processed packet into configuration instructions in the form of AHB bus signal, configuring the Flash controller and controlling the Flash storage device to complete erasing, reading and writing operations. When Flash storage devices generate response data, the protocol conversion module packs the received response data into response packets and feeds it back to NoC through the response path module. This communication interface can improve the packet transmission efficiency between NoC and Flash controller to solve the difficulties of efficient packet transmission interaction of multi-chiplet interconnected data, providing the technical foundation for the development of multi-chiplet integration technology.

  • Special Issue of Emerging Computing Chip Design
    YU Tianyang, WU Bi, CHEN Ke, LIU Weiqiang
    Integrated Circuits and Embedded Systems. 2025, 25(8): 1-9. https://doi.org/10.20193/j.ices2097-4191.2025.0047

    Hyperdimensional computing (HDC), an emerging computing paradigm drawing inspiration from the human brain, boasts several notable advantages, including low complexity, exceptional robustness, and high interpretability. Consequently, it holds immense potential for a wide array of applications in edge-side applications. HDC serves as an innovative approach that mimics the human brain's information processing mechanisms. By leveraging hyperdimensional vectors and straightforward logical operations, it can accomplish complex cognitive functions. Instead of relying on the complicated architecture of neural network with multi-layers, it employs a lightweight encoding-querying process, paving a fresh technical avenue for the development of highly efficient edge-side artificial intelligence chips. This review provides a meticulous and in-depth analysis of the theoretical foundations and the progressive development of algorithms within HDC, and thoroughly investigates the viability of implementing hardware acceleration techniques at every step of HDC. Based on this, this review focuses on the dedicated hardware for the querying step, summarizes the three implementation methods of FPGA, ASIC, and in-memory computing, and analyzes the advantages and disadvantages of different methods. Moreover, considering the prevalent shortcomings inherent in existing hardware for hyperdimensional querying, this review presents some most recent research advancements. Finally, the challenges confronting hardware for HDC are delineated, and the promising avenues for its future research endeavors are outlined.

  • Special Issue of Emerging Computing Chip Design
    TAN Jiahui, SU Jiongzhe, ZHOU Rong, ZHANG Chunzheng, CAI Hao
    Integrated Circuits and Embedded Systems. 2025, 25(8): 53-63. https://doi.org/10.20193/j.ices2097-4191.2025.0045

    Computing-In-Memory (CIM) based on Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is expected to be an effective way to overcome the "memory wall" bottleneck. This paper proposes a high-energy-efficient CIM design scheme for STT-MRAM in the time domain: a custom series-connected memory cell structure, through the series connection of transistors and complementary MTJ design, forms a magnetic resistance chain of multiple rows of memory cells in the computing mode, and combines a time-domain conversion circuit to convert the resistance value into a pulse delay signal. Further, a complementary series array architecture is designed, generating differential time signals through the separate storage of positive and negative weights to support signed number calculations. In terms of quantization circuit design, a Successive Approximation Register (SAR) Time-to-Digital Converter (TDC) is proposed, which adopts a structure combining a voltage-adjustable delay chain and a flip-flop. To achieve multi-bit multiply-accumulate operations, a signed number weight encoding scheme and a digital post-processing architecture are proposed. Through encoding weight mapping and digital shift-accumulate algorithms, the 8-bit input and 8-bit weight multiply-accumulate operation is decomposed into low 5-bit time-domain calculation and high-bit digital-domain calculation, outputting a 21-bit full-precision result. Based on the 28 nm CMOS process, the layout design and post-simulation were completed. At 0.9 V voltage, a 9-bit multiply-accumulate operation with a resolution margin of 270 ps was achieved, with an energy consumption of only 16 fJ per operation. The designed 5-bit SAR-TDC achieves high linearity conversion from time to digital. A 9 Kb time-domain CIM macrocell with an area of 0.026 mm2 was designed, including a memory cell array, SAR-TDC module, computing circuit, and read-write control circuit. The macrocell can achieve energy efficiencies of 26.4 TOPS/W and 42.8 TOPS/W when performing convolutional layer and fully connected layer calculations, respectively, while achieving 8-bit precision calculation and an area efficiency of 0.523 TOPS/mm2.

  • Special Issue of Emerging Computing Chip Design
    HU Dongwei, BA Xiaohui, LIU Gengting, WANG Linan, LEI Yuejun
    Integrated Circuits and Embedded Systems. 2025, 25(8): 81-90. https://doi.org/10.20193/j.ices2097-4191.2025.0044

    In Cache-Coherent Network-on-Chip (NoC) of many-core CPU, the snooping and snooping response Process (SNP Process) incurs long latency. To address this, two techniques: multicast routing and adaptive routing are proposed in this paper. According to the requirements of these two techniques, the NoC packet formats for Snooping Request Channel (SNP REQ Ch) and Snooping Response Channel (SNP RESP Ch) are proposed, and furthermore, the NoC routers of SNP REQ Ch and SNP RESP Ch are VLSI implemented. The implementation results show that the routers for both SNP REQ Ch and SNP RESP Ch are of 85 940.3 μm2 or 103 518.5 μm2, while an 8×8 network occupies 5.57 mm2, which is feasible for large-scale chips. Simulations are employed to compare the latencies of 4 configurations: unicast determined routing, unicast adaptive routing, multicast determined routing, and multicast adaptive routing. The simulation results show that the latency of SNP Process with multicast adaptive routing could be cut by 45% for a single snooping request comparing to that with unicast determined routing, resulting in a much shorter latency than DDR/HBM access, and by 73% for 32 consecutive snooping requests with outstanding technique employed at the Point of Coherency (PoC), which validate the effectiveness of the proposed techniques.

  • Special Issue of Emerging Computing Chip Design
    DU Xirui, YIN Guodong, CHEN Yiming, CHEONG Ling-An, YU Tianyi, YANG Huazhong, LI Xueqing
    Integrated Circuits and Embedded Systems. 2025, 25(8): 10-22. https://doi.org/10.20193/j.ices2097-4191.2025.0041

    Neural networks are representative algorithms of artificial intelligence, but their huge number of parameters poses new challenges to their hardware deployment at the edge. On the one hand, for the flexibility of applications, computing hardware is required to be able to transfer the deployed model between tasks through parameter fine-tuning at the edge. On the other hand, in order to improve computing energy efficiency and performance, it is necessary to implement large-capacity on-chip storage to reduce off-chip memory access costs. The recently proposed ROM-SRAM hybrid compute-in-memory architecture is a promising solution under mature CMOS technology. Thanks to the high-density ROM-based compute-in-memory, most of the weights of the neural network can be stored on the chip, cutting the reliance on off-chip memory access. Meanwhile, SRAM-based compute-in-memory can provide flexibility for edge compute-in-memory based on high-density ROM. To expand the design and application space of ROM-SRAM hybrid compute-in-memory architecture, it is necessary to further improve the density of ROM-based compute-in-memory to support larger networks and explore solutions to obtain greater flexibility through a small amount of SRAM compute-in-memory. This paper introduces several common techniques to improve the memory density of ROM-based compute-in-memory, as well as the neural network fine-tuning methods based on the ROM-SRAM hybrid compute-in-memory architecture to improve flexibility. The solutions to the deployment of ultra-large-scale neural networks and the bottleneck of dynamic matrix multiplication in large language models with long sequences are discussed, and the outlook for the broad design space and application prospects of ROM-SRAM hybrid compute-in-memory architecture is provided.

  • Special Issue of Emerging Computing Chip Design
    XU Junjie, WEI Jinghe, LIU Guozhu, HE Jian, ZHANG Zheng
    Integrated Circuits and Embedded Systems. 2025, 25(8): 74-80. https://doi.org/10.20193/j.ices2097-4191.2025.0051

    PCIe and SRIO are the mainstream high-speed communication interface protocols. In the large data application scenario represented by artificial intelligence, achieving the compatibility of the above protocols is the key to build a large computing power system to break through the bottleneck of storage and computing power. In view of the above requirements, CIP interconnection core realizes multi-protocol conversion interaction such as PCIe, SRIO, DDR and NAND FLASH with a unified routing network. Among them, PCIe is the main human-computer interaction interface, and the construction of PCIe RP system is the basis of PCIe communication. The existing PCIe reading and writing devices based on operating system have some problems, such as high delay and poor operability. In order to solve the above problems, a PCIe RP system is built based on Cortex-M3 processor, and the corresponding drivers and software are developed, which realizes efficient and accurate data transmission between PCIe and various devices. On the basis of realizing the basic functions, the stability tests of 50 000 times, 100 000 times and 150 000 times of large-scale data interaction were completed respectively. The results show that the system has good stability in large-scale data interaction events. It provides a solution for data interaction between processor and PCIe.

  • Special Issue of Emerging Computing Chip Design
    YAN Zheng, ZHANG Chenshuo, BAI Yichuan, DU Yuan, DU Li
    Integrated Circuits and Embedded Systems. 2025, 25(8): 41-52. https://doi.org/10.20193/j.ices2097-4191.2025.0042

    Convolution is the most common operation in CNN networks, and the power consumption of multiplication and accumulation operations in convolution is high, which limits the performance of many CNN hardware accelerators. Reducing the number of multiplications in convolution is one of the effective ways to improve the performance of CNN accelerators. As a fast convolution algorithm, Winograd algorithm could reduce up to 75% multiplications in convolution. However, the weights of the model for Winograd convolution have a significantly different distribution, which results in longer quantization bit width to maintain similar accuracy and neutralizes the hardware reduction brought by the reduction of multiplications. In this paper, we analyze this problem quantitively and propose a new quantization scheme for Winograd convolution. The quantized Winograd computation hardware module is implemented with accuracy loss less than 1%. To further reduce the hardware cost, we apply the approximate multiplier (AM) to Winograd convolution. Compared with the conventional convolution computation block, the Winograd block saves 27.3% of the area, and the application of the approximate multiplier in Winograd block saves 39.6% of the area without significant performance loss.

  • Industry Viewpoint
    CAO Kaihua, ZHANG He, LIU Hongxi, WANG Gefei, WANG Zhaohao, ZHAO Weisheng
    Integrated Circuits and Embedded Systems. 2026, 26(1): 1-4. https://doi.org/10.20193/j.ices2097-4191.2025.0120

    面向人工智能、边缘计算及高可靠嵌入式系统对高速、低功耗与非易失存储的迫切需求,自旋轨道力矩磁随机存储器(SOT-MRAM)成为新一代存储的重要发展方向。北京航空航天大学联合致真存储(北京)科技有限公司在材料、器件、工艺及架构层面开展协同创新,研制全球首颗8 Mb SOT-MRAM芯片。通过自主可控的8 英寸制造平台构建了兼容主流CMOS工艺的混合集成技术路线,并在保持亚纳秒级超快写入、超高可靠性与低功耗优势的同时,实现了容量规模化突破。相关成果为 SOT-MRAM从技术验证迈向工程化与产业化提供了关键路径,对我国新型存储器产业发展具有重要引领意义。

  • Special Topic on IC Design Automation and High-reliability Design
    WANG Qitao, FENG Haoran, LAO Junjie, YOU Jiaxin, LIN Zefan, LAI Liyang
    Integrated Circuits and Embedded Systems. 2026, 26(4): 41-50. https://doi.org/10.20193/j.ices2097-4191.2025.0133

    With the increasing complexity and integration levels of integrated circuits, Diagnosis-Driven Yield Analysis (DDYA) has become increasingly important in accelerating physical failure analysis and improving yield. However, the low diagnostic resolution of scan chain diagnosis based on scan testing remains a weak link in DDYA. This thesis studies a scan chain diagnosis based on hardware architecture improvement-sideway scan. This technique groups scan chains through clock domain or layout constraints and introduces a cyclic shift sideway transmission path between adjacent scan chains within each group. By transmitting data from the faulty chain to the normal chain and then unloading it, followed by analysis using the sideway diagnostic algorithm, the technique enables precise diagnosis of various fault scenarios. This architecture offers lower hardware overhead compared to the two-dimensional scan and higher diagnostic resolution compared to the bidirectional scan. Comparative experiments across multiple circuits demonstrate that, compared to software-based scan chain diagnosis, Sideway Scan achieves up to 41% improvement in single-fault diagnosis resolution, up to 80% in double-fault diagnosis, and up to 168% in triple-fault diagnosis. Meanwhile, in various fault scenarios, diagnosis time is reduced by over 90%, with the maximum reduction reaching 99%. The study demonstrates the feasibility, stability, time advantage, and diagnostic resolution advantage of the sideway scan, providing a more efficient and precise solution for fault diagnosis in complex integrated circuits.

  • Special Topic on IC Design Automation and High-reliability Design
    MA Jingbo, ZHANG Guangda, WANG Huiquan, PEI Bingxi, FANG Jian, HUANG Chenglong, LUO Hui, JIANG Yande
    Integrated Circuits and Embedded Systems. 2026, 26(4): 26-33. https://doi.org/10.20193/j.ices2097-4191.2025.0137

    As SoC architectures evolve to meet the computational intensity of diverse AI applications, the pursuit of high-performance throughput must be balanced with uncompromising reliability. Consequently, parity check mechanisms have emerged as a cornerstone of modern circuit design, essential for safeguarding the integrity of massive data movement within the SoC fabric. However, in wide-bit-width data transmission scenarios, traditional parity check circuit designs face challenges such as high verification complexity and significant decoding latency, which in turn constrain the overall performance of SoCs, including system master clock frequency and data access bandwidth. To address this technical challenge, this paper innovatively proposes a multi-stage pipelined parity check circuit design method for the AXI bus in SoC memory. This design employs a pipelined architecture to optimize the verification process in stages, significantly reducing the critical path delay in the data pathway. The experiment results demonstrate that, at a minimal cost of a 0.47% increase in total circuit area and a 0.24% rise in power consumption, the proposed design method achieves timing optimization of the date read/write bus critical path, reducing the maximum delay of the AXI bus write and read data circuit paths by 18.62% and by 25.60% respectively, effectively enhancing the overall performance and reliability of the SoC.

  • Special Topic on IC Design Automation and High-reliability Design
    CHEN Kehao, LI Zepeng, LIN Ziqing, LIU Genggeng
    Integrated Circuits and Embedded Systems. 2026, 26(4): 1-13. https://doi.org/10.20193/j.ices2097-4191.2025.0136

    As integrated circuit feature sizes continue to shrink, the antenna effect increasingly impacts chip reliability. Layer assignment, a critical step in physical design, allocates 2D routing segments into a multi-layer 3D space. Improper assignment can cause wires to form excessively long antennas that accumulate charge and damage gates. However, existing research primarily focuses on delay and via optimization without adequately considering antenna effects. Moreover, the widely adopted non-default-rule (NDR) wire technology in advanced nodes exacerbates antenna effects due to larger wire widths. This paper proposes an antenna-aware layer assignment algorithm for advanced technology nodes comprising four core strategies. An antenna-cost-aware dynamic programming strategy that reduces violations during initialization. A high-layer-priority segment reassignment strategy that precisely controls antenna area growth. A timing-aware NDR replacement strategy that fixes violations while limiting delay impact. A g-edge resource negotiation strategy that releases routing resources through cross-net coordination. The experimental results demonstrate that the proposed algorithm significantly reduces antenna-violating nets and pins while maintaining excellent delay and via count performance.

  • Special Topic on IC Design Automation and High-reliability Design
    ZHOU Shiqi, CAI Huayang, WANG Jingyi, LIU Genggeng
    Integrated Circuits and Embedded Systems. 2026, 26(4): 51-60. https://doi.org/10.20193/j.ices2097-4191.2025.0134

    Continuous-flow microfluidic biochips (CFMBs) are widely used in biochemical analysis due to their high precision and reliability. CFMBs consist of a flow layer and a control layer. To manage complex logic in the control layer with limited control pins, multiplexers are extensively employed. However, the physical design of multiplexers-specifically the co-optimization of valve placement and channel routing-remains underexplored. To address this, this paper proposes a co-optimization method based on Discrete Particle Swarm Optimization (DPSO). First, valve placement regions are constrained via preprocessing to ensure routing feasibility. Second, a DPSO framework encodes placement into particle positions and utilizes an embedded A* router to provide routing cost as fitness, establishing a closed-loop feedback mechanism between placement and routing. Third, X-architecture routing is introduced to expand the solution space and minimize wirelength. Experimental results demonstrate that the proposed method reduces the average control channel length by 8.27%. Notably, the X-architecture contributes a 5.01% improvement over traditional R-type routing, significantly enhancing both layout quality and routing efficiency.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    WU Liangshun, TAO Tao, ZHANG Bin
    Integrated Circuits and Embedded Systems. 2025, 25(12): 1-7. https://doi.org/10.20193/j.ices2097-4191.2025.0063

    As neural network models become increasingly complex, Network-on-Chip (NoC) plays a critical communication role in heterogeneous computing systems. However, the traditional NoC simulation tools generally lack support for heterogeneous computing units such as matrix processing units and RISC-V programmable cores, making it difficult to meet the requirements of large-scale AI tasks in terms of real-time performance, throughput, and energy efficiency. To address these challenges, this paper proposes and implements a behavior-level NoC simulation framework for heterogeneous computing. The framework features high-precision node modeling, a dynamic pipelining mechanism, a hybrid task-aware routing algorithm, and full-path visualization and debugging capabilities. The experimental results demonstrate that the proposed framework significantly outperforms traditional methods in average latency, throughput, and visualization debugging efficiency. Notably, it exhibits greater stability and scalability in scenarios involving hybrid task flows and hardware faults, providing strong support for the design and optimization of NoC in next-generation intelligent computing platforms.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    ZHANG Yi, ZHANG Yuling, YANG Xuecong
    Integrated Circuits and Embedded Systems. 2025, 25(12): 40-51. https://doi.org/10.20193/j.ices2097-4191.2025.0089

    Memory access latency remains a major bottleneck for many applications on modern processors. To optimize memory access performance, it is crucial to exploit the locality of reference in memory accesses. Data layout optimization techniques, through operations such as merging, splitting, and reorganizing data structures, can significantly improve the locality of memory access. This paper first provides an overview of the technological background of memory architecture and data organization involved in layout optimization techniques. Then introduces the key issues that data orchestration techniques aim to address, the core ideas behind these techniques, and the main technologies upon which their implementation relies. Given the significant differences in storage and access patterns across various types of data, this paper focuses on systematically summarizing and categorizing relevant research, comparing the strengths and weaknesses of different approaches, and analyzing promising future research directions.

  • Special Topic on IC Design Automation and High-reliability Design
    TIAN Chunsheng, ZHAO Xiangyu, WANG Shuo, WANG Zhuoli, CAO Yongzheng, ZHOU Jing, ZHANG Yaowei, CHEN Lei
    Integrated Circuits and Embedded Systems. 2026, 26(4): 14-25. https://doi.org/10.20193/j.ices2097-4191.2025.0138

    The widespread integration of Field Programmable Gate Arrays (FPGAs) in high-performance computing, AI inference, and 5G communications has led to an unprecedented escalation in design scale and timing constraint complexity. These trends impose stringent demands on the runtime efficiency of Static Timing Analysis (STA). Current FPGA STA tools, primarily anchored in single-core or multi-core CPU architectures, are increasingly hitting a performance wall, despite persistent algorithmic refinements, they struggle with computational bottlenecks and suboptimal memory throughput when confronted with large-scale designs. In recent years, Graphics Processing Units (GPUs) with their massive parallel computing capabilities have provided new opportunities for improving FPGA STA performance. However, challenges in memory access patterns under heterogeneous GPU architectures, the optimization for timing graph loop detection, and heterogeneous parallel acceleration strategies continue to hinder the effectiveness of current GPU-accelerated methods in FPGA STA scenarios. To address these issues, we propose an FPGA STA algorithm accelerated by an efficient heterogeneous parallel strategy. First, targeting the problem of discontinuous memory access and field interleaving in traditional object-oriented data structures under CPU-GPU heterogeneous architectures, a structure-of-arrays (SoA)-based data layout strategy is presented. Combined with data reordering operations, this approach effectively reduces memory access latency and improves bandwidth utilization, providing a data foundation for high-performance FPGA STA computational engines. Second, to overcome the limitations of low efficiency and poor robustness in timing graph loop detection, a parallel loop detection optimization algorithm based on color propagation is designed, enabling efficient acceleration in the preprocessing stage of FPGA STA. Furthermore, a task decomposition and timing graph traversal method tailored for CPU-GPU heterogeneous architectures is proposed, achieving efficient acceleration of core STA operations such as delay calculation, levelization, and graph propagation. Finally, experimental results on both the OpenCores and industrial-grade FPGA benchmarks demonstrate that, compared with traditional CPU implementations, the proposed method achieves a runtime speedup of 3.125× to 33.333×, with overall performance surpassing that of the OpenTimer tool. This research provides a practical and feasible approach for efficient timing verification in large-scale FPGA designs.

  • Research Paper
    DONG Chunlei, ZHAO Bo, LYU Ping, LI Peijie, ZHANG Xia
    Integrated Circuits and Embedded Systems. 2025, 25(10): 47-54. https://doi.org/10.20193/j.ices2097-4191.2025.0010

    High-speed SerDes rates have progressed from 56 Gb/s to 112 Gb/s and beyond. Maintaining signal integrity at such ultra-high speeds while balancing power consumption, reliability, flexibility, and cost-effectiveness is a hot topic in current research. This paper reviews key technologies for 112 Gb/s SerDes from four perspectives: transmitter, receiver, clock structure, and low-power techniques, based on the current mainstream architecture of analog-to-digital conversion and digital signal processing. This exploration is provided as a reference for research related to high-speed SerDes technology.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    HUANG He, YANG Fan, PU Tao, AI Jingmei
    Integrated Circuits and Embedded Systems. 2025, 25(12): 27-32. https://doi.org/10.20193/j.ices2097-4191.2025.0078

    Addressing the core requirements of unmanned systems in terms of autonomous controllability, real-time response, and intelligent collaboration, this paper proposes a full-stack localized unmanned intelligent control system solution based on ReWorks embedded real-time operating system and openEuler open-source operating system. By constructing a dual-system heterogeneous architecture of "AI brain + real-time cerebellum", combined with the ROS2 communication framework and microROS embedded extension, deep collaboration between intelligent decision-making and hard real-time control is achieved. Verification on domestic hardware platforms such as Loongson 2K1000 and Feiteng D2000 shows that the real-time performance indicators of this solution are significantly better than those of Linux, providing a full-stack autonomous controllable technology path for unmanned system application scenarios such as underwater robots and drones.

  • Special Topic of Intelligent Embedded System Software and Hardware Collaborative Design and Application
    JIN Ziyi, ZHU Zhichen, DU Jiang, CHEN Yixiang
    Integrated Circuits and Embedded Systems. 2025, 25(12): 33-39. https://doi.org/10.20193/j.ices2097-4191.2025.0061

    This paper presents the embedded deployment of the PVAC model to predict the risk of ventilator-associated complications (VAC) in patients with acute respiratory failure. The PVAC model employs the USMOTE (0.9) algorithm to address imbalanced data and integrates an AdaBoost classifier, achieving an accuracy of 71.11% and a precision of 68.89%. To overcome the limitations of existing AI medical systems that rely on cloud servers, we implemented a fully embedded deployment of the PVAC model using the PYNQ-Z2 development board. This solution offers three key advantages: offline standalone operation, hardware acceleration for improved computational efficiency, and cost-effectiveness. Experimental results demonstrate that the hardware-software co-design approach significantly reduces the total execution time from 46.3 ms to 10.2 ms, achieving a speedup of 78%. Meanwhile, the ARM processor's workload decreases dramatically from 98% to 28%, with only a 0.2% drop in prediction accuracy, effectively preserving the model's original performance. This study not only validates the feasibility of embedding the PVAC model but also provides a reference for the localized deployment of other medical AI applications. Future work may focus on further optimizing the decision tree structure, leveraging the dynamic reconfigurability of FPGAs to support more complex models, extending the capability to process temporal signals, and developing low-power modes to extend device usage time, thereby enhancing the system's practicality and applicability.