Integrated Circuits and Embedded Systems

Select

Special Issue of Emerging Computing Chip Design

Review on cryogenic in-memory computing chip design

SHU Yuhao, LI Yifei, WANG Jincheng, LIU Weiqiang, HA Yajun

Integrated Circuits and Embedded Systems. 2025, 25(8): 23-30. https://doi.org/10.20193/j.ices2097-4191.2025.0046

Download PDF (107) HTML (3209)

Knowledge map

Save

With the rapid advancement of cutting-edge technologies such as artificial intelligence and quantum computing, the demand for high-performance computing chips continues to increase. However, traditional von Neumann architectures are increasingly constrained by the memory wall and power wall, making it difficult to meet the computing demands of data-intensive applications. Cryogenic in-memory computing combines the superior electrical properties of cryogenic CMOS devices with the high bandwidth and low latency advantages of in-memory computing architectures, providing a new solution to overcome computing bottlenecks. This review summarizes the key characteristics of CMOS devices and various memory media at cryogenic temperatures, systematically reviews representative architectures, key implementations, and performance metrics of cryogenic in-memory computing in the fields of artificial intelligence and quantum computing. Moreover, this review analyzes the challenges and development trends at the levels of device technology, circuit systems, and EDA tools.

Select

Special Issue of Emerging Computing Chip Design

End-to-end domain-specific SoC design with LLM-based multi-agent

YAN Peiran, ZHI Qinzhe, LIU Lifeng, JIA Tianyu

Integrated Circuits and Embedded Systems. 2025, 25(8): 31-40. https://doi.org/10.20193/j.ices2097-4191.2025.0043

Download PDF (124) HTML (2967)

Knowledge map

Save

As Moore’s Law slows down, domain-specific SoC (DSSoC) has emerged as a promising energy-efficient design strategy by integrating domain-specific accelerator (DSA). However, the design process for DSSoC remains highly complex, leading to prolonged development cycles and significant labor effort. Recent advances in large language models (LLMs) have introduced new methodologies for agile chip design, demonstrating substantial potential in code and EDA script generation. In this work, an LLM-based multi-agent framework for DSSoC design is proposed, which consists of end-to-end design stages from architecture definition to code generation and EDA physical implementation. The approach is validated through two case studies involving 2-to 4-week SoC designs at process nodes of 22 nm and 7 nm. The evalautions show the generated SoCs achieve energy efficiency improvements of 4.84× and 3.82×, compared to SoCs generated by the existing framework.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

FPGA-based lightweight design of 10 GbE high-bandwidth low-latency interface

SONG Chifeng, YANG Hangyu, LIU Jiyuan, TANG Yongming, YUAN Xiaodong, LI He

Integrated Circuits and Embedded Systems. 2025, 25(6): 39-47. https://doi.org/10.20193/j.ices2097-4191.2025.0020

Download PDF (67) HTML (2581)

Knowledge map

Save

The real-time simulation of new power system puts forward higher requirements for CPU-FPGA heterogeneous computing and multi-FPGA distributed computing, in which communication efficiency may become one of the bottlenecks. Given the current limitations of Gigabit Ethernet in bandwidth and real-time performance, this paper proposes an FPGA-based lightweight design of 10 GbE high-bandwidth low-latency interface. PHY is built based on the GT to achieve low latency and high reliability. In the UDP stack, alternating caching and queuing with priority are adopted to improve data throughput and balance instantaneous load. The on-board test results show that the design achieves low hardware resource consumption, a maximum transmission bandwidth of 9.70 Gb/s, an average transmission delay of 0.45 μs and stable interactions between protocol layers without interference, which provides efficient communication support for the simulation of power system and other applications.

Select

Special Issue of Emerging Computing Chip Design

Communication interface design for NoC and Flash controller

LI Qingxin, WEI Jinghe, GAO Ying, HAN Yujie, JU Hu, CAI Shujun, JIANG Jianfei

Integrated Circuits and Embedded Systems. 2025, 25(8): 64-73. https://doi.org/10.20193/j.ices2097-4191.2025.0052

Download PDF (38) HTML (2606)

Knowledge map

Save

A communication interface for NoC and Flash controller is designed, mainly consisting of request path module, protocol conversion module and response path module. The request path module can complete data verification and cross-clock processing of the request packet sent by NoC. The protocol conversion module converts the processed packet into configuration instructions in the form of AHB bus signal, configuring the Flash controller and controlling the Flash storage device to complete erasing, reading and writing operations. When Flash storage devices generate response data, the protocol conversion module packs the received response data into response packets and feeds it back to NoC through the response path module. This communication interface can improve the packet transmission efficiency between NoC and Flash controller to solve the difficulties of efficient packet transmission interaction of multi-chiplet interconnected data, providing the technical foundation for the development of multi-chiplet integration technology.

Select

Special Issue of Emerging Computing Chip Design

Hyperdimensional computing hardware: progress, trends and prospects

YU Tianyang, WU Bi, CHEN Ke, LIU Weiqiang

Integrated Circuits and Embedded Systems. 2025, 25(8): 1-9. https://doi.org/10.20193/j.ices2097-4191.2025.0047

Download PDF (159) HTML (2380)

Knowledge map

Save

Hyperdimensional computing (HDC), an emerging computing paradigm drawing inspiration from the human brain, boasts several notable advantages, including low complexity, exceptional robustness, and high interpretability. Consequently, it holds immense potential for a wide array of applications in edge-side applications. HDC serves as an innovative approach that mimics the human brain's information processing mechanisms. By leveraging hyperdimensional vectors and straightforward logical operations, it can accomplish complex cognitive functions. Instead of relying on the complicated architecture of neural network with multi-layers, it employs a lightweight encoding-querying process, paving a fresh technical avenue for the development of highly efficient edge-side artificial intelligence chips. This review provides a meticulous and in-depth analysis of the theoretical foundations and the progressive development of algorithms within HDC, and thoroughly investigates the viability of implementing hardware acceleration techniques at every step of HDC. Based on this, this review focuses on the dedicated hardware for the querying step, summarizes the three implementation methods of FPGA, ASIC, and in-memory computing, and analyzes the advantages and disadvantages of different methods. Moreover, considering the prevalent shortcomings inherent in existing hardware for hyperdimensional querying, this review presents some most recent research advancements. Finally, the challenges confronting hardware for HDC are delineated, and the promising avenues for its future research endeavors are outlined.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

A survey on hardware accelerator for large model inference on FPGA

HUANG Sixiao, PENG Haoxiang, SHI Xu, SU Zhifeng, HUANG Mingqiang, YU Hao

Integrated Circuits and Embedded Systems. 2025, 25(6): 1-13. https://doi.org/10.20193/j.ices2097-4191.2025.0023

Download PDF (189) HTML (1973)

Knowledge map

Save

In recent years, with the widespread application of large models (such as GPT, LLaMA, DeepSeek, etc.), the computing power requirements and energy efficiency issues in the reasoning stage have become increasingly prominent. Although traditional GPU solutions can provide high throughput, they face challenges in power consumption, real-time performance and cost. FPGAs have become an important alternative for large model reasoning deployment with their customizable architecture, low latency determinism and high energy efficiency. This paper systematically reviews the network structure of large models and the reasoning implementation technology of large models on FPGAs, covering three major directions: hardware architecture adaptation, algorithm-hardware co-optimization and system-level challenges. At the hardware level, the focus is on the design of computing units and storage level optimization strategies; at the algorithm level, key technologies such as model compression, dynamic quantization and compiler optimization are analyzed. At the system level, challenges such as multi-FPGA expansion, thermal management and emerging storage-computing integrated architectures are discussed. In addition, this paper summarizes the limitations of the current FPGA reasoning ecosystem (such as insufficient tool chain maturity) and looks forward to future trends, including chiplet heterogeneous integration, photonic computing fusion and the establishment of a standardized evaluation system. The research results show that the architectural flexibility of FPGA gives it a unique advantage in the field of efficient reasoning of large models, but interdisciplinary collaboration is still needed to promote the implementation of the technology.

Select

Special Issue of Emerging Computing Chip Design

Research on time-domain in-memory computing method for spin-transfer torque magnetic random access memory

TAN Jiahui, SU Jiongzhe, ZHOU Rong, ZHANG Chunzheng, CAI Hao

Integrated Circuits and Embedded Systems. 2025, 25(8): 53-63. https://doi.org/10.20193/j.ices2097-4191.2025.0045

Download PDF (44) HTML (1839)

Knowledge map

Save

Computing-In-Memory (CIM) based on Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is expected to be an effective way to overcome the "memory wall" bottleneck. This paper proposes a high-energy-efficient CIM design scheme for STT-MRAM in the time domain: a custom series-connected memory cell structure, through the series connection of transistors and complementary MTJ design, forms a magnetic resistance chain of multiple rows of memory cells in the computing mode, and combines a time-domain conversion circuit to convert the resistance value into a pulse delay signal. Further, a complementary series array architecture is designed, generating differential time signals through the separate storage of positive and negative weights to support signed number calculations. In terms of quantization circuit design, a Successive Approximation Register (SAR) Time-to-Digital Converter (TDC) is proposed, which adopts a structure combining a voltage-adjustable delay chain and a flip-flop. To achieve multi-bit multiply-accumulate operations, a signed number weight encoding scheme and a digital post-processing architecture are proposed. Through encoding weight mapping and digital shift-accumulate algorithms, the 8-bit input and 8-bit weight multiply-accumulate operation is decomposed into low 5-bit time-domain calculation and high-bit digital-domain calculation, outputting a 21-bit full-precision result. Based on the 28 nm CMOS process, the layout design and post-simulation were completed. At 0.9 V voltage, a 9-bit multiply-accumulate operation with a resolution margin of 270 ps was achieved, with an energy consumption of only 16 fJ per operation. The designed 5-bit SAR-TDC achieves high linearity conversion from time to digital. A 9 Kb time-domain CIM macrocell with an area of 0.026 mm² was designed, including a memory cell array, SAR-TDC module, computing circuit, and read-write control circuit. The macrocell can achieve energy efficiencies of 26.4 TOPS/W and 42.8 TOPS/W when performing convolutional layer and fully connected layer calculations, respectively, while achieving 8-bit precision calculation and an area efficiency of 0.523 TOPS/mm².

Select

Special Issue of Emerging Computing Chip Design

Accelerating the snooping & snooping response process of cache-coherent network-on-chip with multicast adaptive routing

HU Dongwei, BA Xiaohui, LIU Gengting, WANG Linan, LEI Yuejun

Integrated Circuits and Embedded Systems. 2025, 25(8): 81-90. https://doi.org/10.20193/j.ices2097-4191.2025.0044

Download PDF (39) HTML (1826)

Knowledge map

Save

In Cache-Coherent Network-on-Chip (NoC) of many-core CPU, the snooping and snooping response Process (SNP Process) incurs long latency. To address this, two techniques: multicast routing and adaptive routing are proposed in this paper. According to the requirements of these two techniques, the NoC packet formats for Snooping Request Channel (SNP REQ Ch) and Snooping Response Channel (SNP RESP Ch) are proposed, and furthermore, the NoC routers of SNP REQ Ch and SNP RESP Ch are VLSI implemented. The implementation results show that the routers for both SNP REQ Ch and SNP RESP Ch are of 85 940.3 μm² or 103 518.5 μm², while an 8×8 network occupies 5.57 mm², which is feasible for large-scale chips. Simulations are employed to compare the latencies of 4 configurations: unicast determined routing, unicast adaptive routing, multicast determined routing, and multicast adaptive routing. The simulation results show that the latency of SNP Process with multicast adaptive routing could be cut by 45% for a single snooping request comparing to that with unicast determined routing, resulting in a much shorter latency than DDR/HBM access, and by 73% for 32 consecutive snooping requests with outstanding technique employed at the Point of Coherency (PoC), which validate the effectiveness of the proposed techniques.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Dual-modal configurable software PUF based on DSP IP core

ZHENG Ziyang, WANG Pengjun, LI Gang, CHEN Bo, YANG Xinrong, LI Xiangyu

Integrated Circuits and Embedded Systems. 2025, 25(6): 29-38. https://doi.org/10.20193/j.ices2097-4191.2025.0018

Download PDF (39) HTML (1467)

Knowledge map

Save

With the rapid advancement of information technology and artificial intelligence, the increasingly complex functions of IoT terminal devices have resulted in significant security threats due to their limited hardware resources. To address this, this paper proposes a dual-mode configurable software PUF (Physical Unclonable Function) design based on the DSP IP core. This approach leverages the timing violation behavior characteristics of sampling registers and the combinational logic delay features within the DSP IP core of FPGA. First, the internal structure of DSP IP cores in Xilinx Artix-7 FPGA is analyzed, determining the clock cycle range for normal data transmission based on their combinational logic delay information and timing constraints. Next, two distinct operational modes are configured based on the required challenge bit length, with overclocked clocks applied to induce abnormal computational results through timing violation in the sampling registers. Finally, a hash algorithm and parity check are used to compress the abnormal data of varying bit lengths into a 1-bit PUF response. This design eliminates the need for additional bias extraction circuits and allows for flexible configuration of two different challenge bit lengths for the software PUF implementation without modifying the hardware structure. The experimental results demonstrate that both operational modes achieve a reliability of over 98%, with excellent uniqueness and resistance to machine learning attacks, thereby validating the proposed scheme's feasibility and advantages in terms of both security and practicality.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Design and implementation of PMSM vector control technology based on FPGA

BAO Chaowei, FAN Wei, XIONG Gengfan, TANG Wantao

Integrated Circuits and Embedded Systems. 2025, 25(6): 58-67. https://doi.org/10.20193/j.ices2097-4191.2025.0029

Download PDF (68) HTML (1478)

Knowledge map

Save

In order to solve the problems of MCU and DSP serial calculation widely used in permanent magnet synchronous motor control, insufficient dynamic accuracy and long development cycle of complex vector control algorithms, a vector control technology of Permanent Magnet Synchronous Motors (PMSM) based on domestic FPGA is proposed. By adopting the modular design methods of Hardware Description Language (HDL) and Electronic Design Automation (EDA), the dual closed-loop feedforward PI control strategy, Space Vector Pulse Width Modulation (SVPWM) algorithm, and the underlying key modules such as coordinate transformation and encoder feedback are independently designed, and the logic and timing simulation tests of key functional modules are carried out by Modelsim respectively. Finally, a PMSM vector control hardware system based on domestic FPGA is constructed. The SVPWM waveform test, the dynamic accuracy under multiple signal inputs such as step signal and square wave signal, and the chip performance test analysis are conducted, which verified the effectiveness of the design and implementation results of the proposed vector control technology for permanent magnet synchronous motor based on domestic FPGA.

Select

Special Issue of Emerging Computing Chip Design

A review on ROM-SRAM hybrid compute-in-memory architecture

DU Xirui, YIN Guodong, CHEN Yiming, CHEONG Ling-An, YU Tianyi, YANG Huazhong, LI Xueqing

Integrated Circuits and Embedded Systems. 2025, 25(8): 10-22. https://doi.org/10.20193/j.ices2097-4191.2025.0041

Download PDF (127) HTML (1539)

Knowledge map

Save

Neural networks are representative algorithms of artificial intelligence, but their huge number of parameters poses new challenges to their hardware deployment at the edge. On the one hand, for the flexibility of applications, computing hardware is required to be able to transfer the deployed model between tasks through parameter fine-tuning at the edge. On the other hand, in order to improve computing energy efficiency and performance, it is necessary to implement large-capacity on-chip storage to reduce off-chip memory access costs. The recently proposed ROM-SRAM hybrid compute-in-memory architecture is a promising solution under mature CMOS technology. Thanks to the high-density ROM-based compute-in-memory, most of the weights of the neural network can be stored on the chip, cutting the reliance on off-chip memory access. Meanwhile, SRAM-based compute-in-memory can provide flexibility for edge compute-in-memory based on high-density ROM. To expand the design and application space of ROM-SRAM hybrid compute-in-memory architecture, it is necessary to further improve the density of ROM-based compute-in-memory to support larger networks and explore solutions to obtain greater flexibility through a small amount of SRAM compute-in-memory. This paper introduces several common techniques to improve the memory density of ROM-based compute-in-memory, as well as the neural network fine-tuning methods based on the ROM-SRAM hybrid compute-in-memory architecture to improve flexibility. The solutions to the deployment of ultra-large-scale neural networks and the bottleneck of dynamic matrix multiplication in large language models with long sequences are discussed, and the outlook for the broad design space and application prospects of ROM-SRAM hybrid compute-in-memory architecture is provided.

Select

Special Issue of Emerging Computing Chip Design

Design and implementation of PCIe RP system based on Cortex-M3 kernel

XU Junjie, WEI Jinghe, LIU Guozhu, HE Jian, ZHANG Zheng

Integrated Circuits and Embedded Systems. 2025, 25(8): 74-80. https://doi.org/10.20193/j.ices2097-4191.2025.0051

Download PDF (60) HTML (1583)

Knowledge map

Save

PCIe and SRIO are the mainstream high-speed communication interface protocols. In the large data application scenario represented by artificial intelligence, achieving the compatibility of the above protocols is the key to build a large computing power system to break through the bottleneck of storage and computing power. In view of the above requirements, CIP interconnection core realizes multi-protocol conversion interaction such as PCIe, SRIO, DDR and NAND FLASH with a unified routing network. Among them, PCIe is the main human-computer interaction interface, and the construction of PCIe RP system is the basis of PCIe communication. The existing PCIe reading and writing devices based on operating system have some problems, such as high delay and poor operability. In order to solve the above problems, a PCIe RP system is built based on Cortex-M3 processor, and the corresponding drivers and software are developed, which realizes efficient and accurate data transmission between PCIe and various devices. On the basis of realizing the basic functions, the stability tests of 50 000 times, 100 000 times and 150 000 times of large-scale data interaction were completed respectively. The results show that the system has good stability in large-scale data interaction events. It provides a solution for data interaction between processor and PCIe.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Perspectives of novel FPGA architectures and circuit design technologies

FAN Jicong, YU Zongguang, XIE Da, SHAN Yueer, XU Zhongyan

Integrated Circuits and Embedded Systems. 2025, 25(6): 14-28. https://doi.org/10.20193/j.ices2097-4191.2025.0016

Download PDF (117) HTML (1268)

Knowledge map

Save

Compared to custom-designed chips, Field Programmable Gate Arrays (FPGAs) support flexible hardware reconfiguration and offer advantages such as shorter design cycles and lower development costs. They are widely used in fields such as communications, data centers, radar, and aerospace. The design of FPGA architectures aims to create highly programmable FPGA chips while minimizing the area and performance costs associated with reconfigurability. With the continuous evolution of application demands and process technology capabilities, FPGA architecture design is entering a new phase. This article briefly describes the basic architecture of FPGA with its evaluation, summarizes the latest developments in novel FPGA architectures and circuit design technologies, and discusses the technical challenges and development trends of novel FPGA architectures and circuit design.

Select

Special Issue of Emerging Computing Chip Design

Design of efficient Winograd convolution hardware and its quantization scheme

YAN Zheng, ZHANG Chenshuo, BAI Yichuan, DU Yuan, DU Li

Integrated Circuits and Embedded Systems. 2025, 25(8): 41-52. https://doi.org/10.20193/j.ices2097-4191.2025.0042

Download PDF (45) HTML (1452)

Knowledge map

Save

Convolution is the most common operation in CNN networks, and the power consumption of multiplication and accumulation operations in convolution is high, which limits the performance of many CNN hardware accelerators. Reducing the number of multiplications in convolution is one of the effective ways to improve the performance of CNN accelerators. As a fast convolution algorithm, Winograd algorithm could reduce up to 75% multiplications in convolution. However, the weights of the model for Winograd convolution have a significantly different distribution, which results in longer quantization bit width to maintain similar accuracy and neutralizes the hardware reduction brought by the reduction of multiplications. In this paper, we analyze this problem quantitively and propose a new quantization scheme for Winograd convolution. The quantized Winograd computation hardware module is implemented with accuracy loss less than 1%. To further reduce the hardware cost, we apply the approximate multiplier (AM) to Winograd convolution. Compared with the conventional convolution computation block, the Winograd block saves 27.3% of the area, and the application of the approximate multiplier in Winograd block saves 39.6% of the area without significant performance loss.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Research on coarse-grained parallel routing method for large-scale FPGAs

TIAN Chunsheng, CHEN Lei, WANG Shuo, ZHOU Jing, WANG Zhuoli, ZHANG Yaowei

Integrated Circuits and Embedded Systems. 2025, 25(6): 68-77. https://doi.org/10.20193/j.ices2097-4191.2025.0024

Download PDF (53) HTML (1213)

Knowledge map

Save

Aiming at the problems such as excessive resource overhead, high memory consumption, and low routing efficiency in the routing process of large-scale FPGAs, a resource-friendly coarse-grained parallel routing method tailored for large-scale FPGAs is proposed. First, a non-intrusive data optimization technique is proposed to reduce the resource overhead and memory consumption caused by the routing resource graph, addressing the memory explosion problem resulting from the increasing scale of FPGAs and providing a data foundation for the routing method. Second, adaptive load balancing and high-fanout net partitioning techniques are introduced to tackle the low parallelism in coarse-grained parallel routing, thereby improving the overall routing efficiency. The experimental results show that the proposed coarse-grained parallel routing method for large-scale FPGAs can achieve a 3.18× speedup in runtime while reducing resource and memory consumption by 90%, without compromising performance metrics such as wirelength and critical path delay.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Efficient and high-security FPGA configuration bit-stream cryptographic algorithm and implementation

ZHANG Yanlong, ZHOU Jing, TAI Yu, CAI Haiyang, WANG Shuo, XIAO Ke, ZHANG Xueting, DONG Han, DU Zhong

Integrated Circuits and Embedded Systems. 2025, 25(6): 48-57. https://doi.org/10.20193/j.ices2097-4191.2025.0022

Download PDF (163) HTML (1235)

Knowledge map

Save

To address the current issues of high resource overhead and low efficiency in FPGA configuration bitstream decryption and authentication, this paper proposes the GMAC_GF32 authentication algorithm based on finite field GF(2³²) multiplication operations. Combined with AES encryption in CTR mode, we design and implement an efficient and highly secure FPGA configuration bitstream decryption and authentication method. The method employs a four-stage pipeline design for the AES256_CTR decryption module, ensuring that each decryption cycle aligns with the time required to transmit 128 bits of data, thereby maximizing the decryption throughput of the FPGA. Additionally, each pipeline stage enhances power side-channel security by utilizingsixteen S-Boxes operating in parallel. The authentication module improves existing verification codes to 32 bits through GF(2³²) operations, effectively mitigating the inefficiency of serial verification code computation, improving clock utilization. The authentication module enhances security by incorporating built-in polynomial functions to prevent the loading of malicious code streams. Experimental validation on an FPGA prototype board demonstrates that the proposed pipeline decryption approach optimizes the AES256_CTR algorithm, compressing the decryption process to four clock cycles. The authentication method significantly reduces additional authentication data volume and hidden time costs while maintaining security strength, achieving a 96.5% reduction in area resource consumption for the authentication algorithm;thereby achieving no noticeable increase in the overall decryption-authentication circuit area. The proposed method is well-suited for FPGA chip design scenarios requiring high performance and robust security.

Select

Research Paper

Acceleration scheme for LMS algorithms based on RISC-V

YE Anlong, MA Lingkun, QU Zongyi

Integrated Circuits and Embedded Systems. 2025, 25(5): 52-59. https://doi.org/10.20193/j.ices2097-4191.2025.0006

Download PDF (34) HTML (1086)

Knowledge map

Save

The Least Mean Square algorithm, as a typical adaptive filtering algorithm, has found extensive application in the field of noise suppression. Its implementation is mainly based on general-purpose processors, yet this method suffers from low computational efficiency and performance. The RISC-V architecture, with the advantages of open-source, streamlining, and scalability, is suitable for the implementation of dedicated processors. In this paper, a RISC-V based specialized processor is designed for the LMS algorithm. The customized instruction set F extension is used to process floating point numbers, and MAC instructions are added to the coprocessor to complete the acceleration of the LMS algorithm. The experiment results show that the processor can realize effective noise cancellation, when the input signal-to-noise ratio is 5 dB, the signal-to-noise ratio after noise cancellation is 17.5 dB. When the system uses FPU to execute the LMS algorithm, the number of instruction execution is 220 354, and the execution cycle is 586 221. When this design scheme is used, operating in FPU+MAC mode, the number of instruction execution is 31 621, and the execution cycle is 89 412, demonstrating a remarkable improvement in efficiency.

Select

Cover Articl

Efficient implementation of mathematical functions based on RISC-V custom instructions

XING Gen, HU Zhiyuan, BI Dawei

Integrated Circuits and Embedded Systems. 2025, 25(5): 1-7. https://doi.org/10.20193/j.ices2097-4191.2024.0089

Download PDF (82) HTML (1126)

Knowledge map

Save

In industrial control algorithms, the calculation of mathematical functions typically requires a large number of clock cycles, affecting the performance of the algorithms. This paper conducts an in-depth analysis and comparison of related methods for calculating floating mathematical functions, and designs a piecewise table lookup polynomial fitting method based on the Remez algorithm. This method is suitable for hardware circuit implementation to calculate mathematical functions. At the same time, the corresponding hardware circuits are implemented in the form of RISC-V custom instructions, closely coupled with the RISC-V processor core. The experimental results show that compared with no custom extension instructions, the processor's delay in calculating mathematical functions is reduced by 93.62%. Compared with the method of calculating mathematical functions using the CORDIC instruction set, the calculation delay is reduced by 79.83%. This achievement provides new ideas and solutions for RISC-V architecture embedded microprocessors with low-cost, high real-time requirements.

Select

Research Paper

Design of license plate recognition co-processor based on RISC-V extended instructions

ZHAO Chenxu, QU Yingjie, WANG Haiting

Integrated Circuits and Embedded Systems. 2025, 25(4): 47-53. https://doi.org/10.20193/j.ices2097-4191.2024.0088

Download PDF (52) HTML (995)

Knowledge map

Save

With the development of intelligent transportation systems, license plate recognition systems have transitioned from traditional PC platforms to portable embedded terminals, thereby imposing higher demands on the accuracy, speed, and security of existing license plate recognition systems. RISC-V is an instruction set architecture characterized by being open-source, streamlined, efficient, low-power, and modular, offering a high degree of flexibility. In this paper, a license plate recognition system based on the Hummingbird E203 RISC-V processor is designed, utilizing an improved eight-direction Sobel operator for high-precision edge detection. The system is implemented on the Da Vinci PRO development board. The experimental results show that the system has a recognition accuracy rate of 96%, with an average recognition time of around 45 ms. It demonstrates high recognition accuracy and real-time performance. Compared to traditional license plate recognition systems, this system offers better performance.

Select

Special Issue on FPGA Cutting-edge Technologies and Applied Research

Real-time efficient dense optical flow accelerator based on FPGA

FENG Yutai, XU Wenyang, CHEN Fan, WANG Jiaxing, TANG Yongming, SUN Hao

Integrated Circuits and Embedded Systems. 2025, 25(6): 78-86. https://doi.org/10.20193/j.ices2097-4191.2025.0021

Download PDF (62) HTML (851)

Knowledge map

Save

The optical flow method constructs a dense motion field representation by analyzing the pixel displacements between frames, which can quantify the motion direction and velocity of objects in the scene with sub-pixel accuracy, and is a core technology for applications such as body-awareness, intelligent sensing in low-altitude economy, and localization and navigation. However, the dense optical flow algorithm faces high computational complexity, and its multi-layer pyramid structure and inter-layer data dependencies lead to problems such as inefficient access and idle computational resources, which together limit the real-time and efficient deployment of the algorithm on the edge side. In order to solve this problem, this paper proposes a real-time and efficient FPGA hardware acceleration scheme for the dense LK pyramid optical flow algorithm based on the optimization strategy of co-designing algorithms, architectures and circuits. The scheme optimizes the algorithm accuracy and hardware friendliness through batch bilinear interpolation and temporal gradient generation, optimizes the parallelism of hardware architecture through pyramid multilayer folding design, and optimizes the access efficiency of pyramid downsampling process through three-stage segmentation architecture, which significantly improves the energy efficiency and real-time performance of the dense LK optic flow computation. Measurements on the AMD KV260 platform show that the hardware accelerator achieves 102 times faster processing speed compared to high-performance CPUs, realizes 62 f/s real-time processing capability at 752×480 resolution, with an average endpoint error (AEE) of 0.522 pixel, and an average angular error (AAE) of 0.325°, providing both highly dynamic visual perception scenarios. This provides a hardware-accelerated solution with high accuracy and low latency for highly dynamic visual perception scenes.

Most Viewed

Please choose a citation manager

Content to export

模态框（Modal）标题

Most Viewed

Please choose a citation manager

Content to export