Hyperdimensional computing (HDC), an emerging computing paradigm drawing inspiration from the human brain, boasts several notable advantages, including low complexity, exceptional robustness, and high interpretability. Consequently, it holds immense potential for a wide array of applications in edge-side applications. HDC serves as an innovative approach that mimics the human brain's information processing mechanisms. By leveraging hyperdimensional vectors and straightforward logical operations, it can accomplish complex cognitive functions. Instead of relying on the complicated architecture of neural network with multi-layers, it employs a lightweight encoding-querying process, paving a fresh technical avenue for the development of highly efficient edge-side artificial intelligence chips. This review provides a meticulous and in-depth analysis of the theoretical foundations and the progressive development of algorithms within HDC, and thoroughly investigates the viability of implementing hardware acceleration techniques at every step of HDC. Based on this, this review focuses on the dedicated hardware for the querying step, summarizes the three implementation methods of FPGA, ASIC, and in-memory computing, and analyzes the advantages and disadvantages of different methods. Moreover, considering the prevalent shortcomings inherent in existing hardware for hyperdimensional querying, this review presents some most recent research advancements. Finally, the challenges confronting hardware for HDC are delineated, and the promising avenues for its future research endeavors are outlined.
Neural networks are representative algorithms of artificial intelligence, but their huge number of parameters poses new challenges to their hardware deployment at the edge. On the one hand, for the flexibility of applications, computing hardware is required to be able to transfer the deployed model between tasks through parameter fine-tuning at the edge. On the other hand, in order to improve computing energy efficiency and performance, it is necessary to implement large-capacity on-chip storage to reduce off-chip memory access costs. The recently proposed ROM-SRAM hybrid compute-in-memory architecture is a promising solution under mature CMOS technology. Thanks to the high-density ROM-based compute-in-memory, most of the weights of the neural network can be stored on the chip, cutting the reliance on off-chip memory access. Meanwhile, SRAM-based compute-in-memory can provide flexibility for edge compute-in-memory based on high-density ROM. To expand the design and application space of ROM-SRAM hybrid compute-in-memory architecture, it is necessary to further improve the density of ROM-based compute-in-memory to support larger networks and explore solutions to obtain greater flexibility through a small amount of SRAM compute-in-memory. This paper introduces several common techniques to improve the memory density of ROM-based compute-in-memory, as well as the neural network fine-tuning methods based on the ROM-SRAM hybrid compute-in-memory architecture to improve flexibility. The solutions to the deployment of ultra-large-scale neural networks and the bottleneck of dynamic matrix multiplication in large language models with long sequences are discussed, and the outlook for the broad design space and application prospects of ROM-SRAM hybrid compute-in-memory architecture is provided.
With the rapid advancement of cutting-edge technologies such as artificial intelligence and quantum computing, the demand for high-performance computing chips continues to increase. However, traditional von Neumann architectures are increasingly constrained by the memory wall and power wall, making it difficult to meet the computing demands of data-intensive applications. Cryogenic in-memory computing combines the superior electrical properties of cryogenic CMOS devices with the high bandwidth and low latency advantages of in-memory computing architectures, providing a new solution to overcome computing bottlenecks. This review summarizes the key characteristics of CMOS devices and various memory media at cryogenic temperatures, systematically reviews representative architectures, key implementations, and performance metrics of cryogenic in-memory computing in the fields of artificial intelligence and quantum computing. Moreover, this review analyzes the challenges and development trends at the levels of device technology, circuit systems, and EDA tools.
As Moore’s Law slows down, domain-specific SoC (DSSoC) has emerged as a promising energy-efficient design strategy by integrating domain-specific accelerator (DSA). However, the design process for DSSoC remains highly complex, leading to prolonged development cycles and significant labor effort. Recent advances in large language models (LLMs) have introduced new methodologies for agile chip design, demonstrating substantial potential in code and EDA script generation. In this work, an LLM-based multi-agent framework for DSSoC design is proposed, which consists of end-to-end design stages from architecture definition to code generation and EDA physical implementation. The approach is validated through two case studies involving 2-to 4-week SoC designs at process nodes of 22 nm and 7 nm. The evalautions show the generated SoCs achieve energy efficiency improvements of 4.84× and 3.82×, compared to SoCs generated by the existing framework.
Convolution is the most common operation in CNN networks, and the power consumption of multiplication and accumulation operations in convolution is high, which limits the performance of many CNN hardware accelerators. Reducing the number of multiplications in convolution is one of the effective ways to improve the performance of CNN accelerators. As a fast convolution algorithm, Winograd algorithm could reduce up to 75% multiplications in convolution. However, the weights of the model for Winograd convolution have a significantly different distribution, which results in longer quantization bit width to maintain similar accuracy and neutralizes the hardware reduction brought by the reduction of multiplications. In this paper, we analyze this problem quantitively and propose a new quantization scheme for Winograd convolution. The quantized Winograd computation hardware module is implemented with accuracy loss less than 1%. To further reduce the hardware cost, we apply the approximate multiplier (AM) to Winograd convolution. Compared with the conventional convolution computation block, the Winograd block saves 27.3% of the area, and the application of the approximate multiplier in Winograd block saves 39.6% of the area without significant performance loss.
Computing-In-Memory (CIM) based on Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is expected to be an effective way to overcome the "memory wall" bottleneck. This paper proposes a high-energy-efficient CIM design scheme for STT-MRAM in the time domain: a custom series-connected memory cell structure, through the series connection of transistors and complementary MTJ design, forms a magnetic resistance chain of multiple rows of memory cells in the computing mode, and combines a time-domain conversion circuit to convert the resistance value into a pulse delay signal. Further, a complementary series array architecture is designed, generating differential time signals through the separate storage of positive and negative weights to support signed number calculations. In terms of quantization circuit design, a Successive Approximation Register (SAR) Time-to-Digital Converter (TDC) is proposed, which adopts a structure combining a voltage-adjustable delay chain and a flip-flop. To achieve multi-bit multiply-accumulate operations, a signed number weight encoding scheme and a digital post-processing architecture are proposed. Through encoding weight mapping and digital shift-accumulate algorithms, the 8-bit input and 8-bit weight multiply-accumulate operation is decomposed into low 5-bit time-domain calculation and high-bit digital-domain calculation, outputting a 21-bit full-precision result. Based on the 28 nm CMOS process, the layout design and post-simulation were completed. At 0.9 V voltage, a 9-bit multiply-accumulate operation with a resolution margin of 270 ps was achieved, with an energy consumption of only 16 fJ per operation. The designed 5-bit SAR-TDC achieves high linearity conversion from time to digital. A 9 Kb time-domain CIM macrocell with an area of 0.026 mm2 was designed, including a memory cell array, SAR-TDC module, computing circuit, and read-write control circuit. The macrocell can achieve energy efficiencies of 26.4 TOPS/W and 42.8 TOPS/W when performing convolutional layer and fully connected layer calculations, respectively, while achieving 8-bit precision calculation and an area efficiency of 0.523 TOPS/mm2.
A communication interface for NoC and Flash controller is designed, mainly consisting of request path module, protocol conversion module and response path module. The request path module can complete data verification and cross-clock processing of the request packet sent by NoC. The protocol conversion module converts the processed packet into configuration instructions in the form of AHB bus signal, configuring the Flash controller and controlling the Flash storage device to complete erasing, reading and writing operations. When Flash storage devices generate response data, the protocol conversion module packs the received response data into response packets and feeds it back to NoC through the response path module. This communication interface can improve the packet transmission efficiency between NoC and Flash controller to solve the difficulties of efficient packet transmission interaction of multi-chiplet interconnected data, providing the technical foundation for the development of multi-chiplet integration technology.
PCIe and SRIO are the mainstream high-speed communication interface protocols. In the large data application scenario represented by artificial intelligence, achieving the compatibility of the above protocols is the key to build a large computing power system to break through the bottleneck of storage and computing power. In view of the above requirements, CIP interconnection core realizes multi-protocol conversion interaction such as PCIe, SRIO, DDR and NAND FLASH with a unified routing network. Among them, PCIe is the main human-computer interaction interface, and the construction of PCIe RP system is the basis of PCIe communication. The existing PCIe reading and writing devices based on operating system have some problems, such as high delay and poor operability. In order to solve the above problems, a PCIe RP system is built based on Cortex-M3 processor, and the corresponding drivers and software are developed, which realizes efficient and accurate data transmission between PCIe and various devices. On the basis of realizing the basic functions, the stability tests of 50 000 times, 100 000 times and 150 000 times of large-scale data interaction were completed respectively. The results show that the system has good stability in large-scale data interaction events. It provides a solution for data interaction between processor and PCIe.
In Cache-Coherent Network-on-Chip (NoC) of many-core CPU, the snooping and snooping response Process (SNP Process) incurs long latency. To address this, two techniques: multicast routing and adaptive routing are proposed in this paper. According to the requirements of these two techniques, the NoC packet formats for Snooping Request Channel (SNP REQ Ch) and Snooping Response Channel (SNP RESP Ch) are proposed, and furthermore, the NoC routers of SNP REQ Ch and SNP RESP Ch are VLSI implemented. The implementation results show that the routers for both SNP REQ Ch and SNP RESP Ch are of 85 940.3 μm2 or 103 518.5 μm2, while an 8×8 network occupies 5.57 mm2, which is feasible for large-scale chips. Simulations are employed to compare the latencies of 4 configurations: unicast determined routing, unicast adaptive routing, multicast determined routing, and multicast adaptive routing. The simulation results show that the latency of SNP Process with multicast adaptive routing could be cut by 45% for a single snooping request comparing to that with unicast determined routing, resulting in a much shorter latency than DDR/HBM access, and by 73% for 32 consecutive snooping requests with outstanding technique employed at the Point of Coherency (PoC), which validate the effectiveness of the proposed techniques.