In the contemporary landscape, the performance demands placed on microprocessors are continually escalating. Consequently, efficiently reducing delays on critical paths has become a crucial research challenge. To address this significant problem, we propose a timing optimization method based on logic local remapping specifically targeting critical paths. This method integrates critical path information to construct localized, small netlist subsets within the topological network surrounding the critical paths. Subsequently, each of these localized small netlists undergoes a remapping process. During this process, a netlist pool is established to store multiple sets of results generated after synthesis with varying parameters. Finally, the remapped small netlist exhibiting the best timing characteristics is selected from this pool to replace the original local netlist segment. We have implemented the aforementioned algorithm and conducted experiments on seven open-source designs. The experiments were performed using the Nangate 45 typical process corner and the OpenROAD physical design tool flow. The results demonstrate that, across these seven open-source designs, the optimized netlists achieved through our method exhibit significant improvements in timing slack. Specifically, the Worst Negative Slack (WNS) shows an improvement of at least 1.120%, and the Total Negative Slack (TNS) shows an improvement of at least 11.646%. These substantial gains validate the effectiveness of the proposed method in meeting the stringent timing constraints inherent in high-performance microprocessor design. This approach provides a potent solution for enhancing timing closure in advanced digital circuits.
This paper presents the design and implementation of a high-efficiency computing circuit chip based on the 22 nm FD-SOI process, covering a wide voltage range from 0.2 V to 0.8 V. The paper conducts a comparative analysis and optimization of energy efficiency across four levels: architectural design, cell library selection, low-power logic synthesis, and physical design. By implementing and simulating different computing architectures, the paper identifies the optimal architectural design with the best overall performance. Standard cell libraries with different channel lengths and threshold voltages are evaluated, and a mix of high-drive and low-leakage cells are used to balance performance and power consumption. The optimized energy efficiency is reduced to 102.64 fJ/Op, representing a 17.5% improvement over a single-type cell design. During the logic synthesis stage, the DesignWare-LP flow is applied, achieving a 6.7% improvement in energy efficiency through logic reorganization and low-power cell replacement. During the physical design phase, unit density is controlled to further reduce parasitic capacitance. We validated the optimized chip through tape-out verification, with test results showing: at a 0.24 V operating voltage, the energy efficiency of the adder and multiplier reached 1.55 fJ/Op and 14.1 fJ/Op, respectively, with latency below 100 ns, effectively addressing the shortcomings of existing solutions in terms of wide voltage adaptability or multi-dimensional energy efficiency optimization.
The rapid development of Generative Artificial Intelligence (GenAI), driven by breakthroughs in Deep Learning (DL) and Large Language Model (LLM) technologies, has imposed increasingly stringent requirements on the performance and energy efficiency of underlying computing hardware platforms and their executing algorithms. As a fundamental operation, General Matrix Multiplication (GEMM) supports the vast majority of computing tasks in the training and inference of deep neural networks. Therefore, the efficiency of matrix multiplication kernels directly and profoundly affects core indicators such as model training duration, inference latency, and associated operational costs-factors that are crucial to the practical deployment and scalability of AI solutions. Currently, there is room for improvement in the optimization of matrix extension instructions in the field of artificial intelligence, and optimizing matrix operation algorithms for domestic microprocessors is of great significance. This paper focuses on performance optimization of GEMM based on matrix extension instructions for domestic microprocessors. The operational efficiency of GEMM is improved through instruction optimization, pipeline adjustment, outer product extension, and other aspects, and the correctness and feasibility of the optimization scheme are verified through tests. Experimental results show that this optimization method can improve the operational efficiency of single-precision floating-point matrix multiplication by more than 10%.
As a representative method of neural morphological computing, spiking neural networks (SNNs) have been widely used in various perception and control tasks, and software-hardware collaborative computing. However, the traditional Leaky Integrate and Fire (LIF) neural model widely is used in SNN can only encode input features as binary pulse signals, which severely limits its feature expression ability and performance in complex visual tasks such as object detection. This article proposes a pulse neural network with a ternary activation layer based on a multi valued logic neuron model. By modifying the activation range within the convolutional layers, the ternary activation layer can significantly improve the model performance of object detection algorithms on the basis of binary activation. The experiment results on public datasets show that the proposed method of ternary activation can improve the average precision(AP) across three types of traffic signs from 80.8% to 92.5% compared with binary activation. In addition, in order to run the above spike neural network in a multi value logic computing system based on novel nano-devices, we also evaluate the performance of the model after parameter quantification. The results show that after quantifying the model parameters to the integer range, the performance of the model only decreased by more than 1%. Compared with ANN, while the accuracy decreases by 4.2%, the number of parameters decreases by 81.6%.
With semiconductor processes advancing into the nanometer scale, SoC chip complexity has surged dramatically, imposing significant challenges to the efficiency and flexibility of conventional debugging approaches. As a critical component for SoC-external system interaction, the functional correctness and performance stability of DDR memory are paramount to chip reliability. This paper proposes an intelligent debugging solution based on GDB-OpenOCD co-design to address the debugging challenges of high-speed DDR controllers in modern SoCs. By deeply integrating GDB and OpenOCD frameworks, the solution achieves unified JTAG management for multi-device systems and enables cross-module parallel debugging, substantially enhancing SoC observability. To tackle DDR debugging difficulties, we innovatively develop a modular parameter configuration architecture that supports dynamic reconfiguration of controller timing parameters, reducing debugging cycles by 50%. Furthermore, a systematic verification framework is established, incorporating a comprehensive memory stress test suite covering 16 test scenarios to thoroughly validate DDR functionality and performance. Through this hardware-software co-design approach, the solution significantly improves debugging efficiency and controllability, providing key technical references for autonomous debugging in domestic chips.
In resource-constrained near-sensor smart sensing systems, the deployment of deep neural networks (DNNs) faces severe challenges in terms of energy efficiency and area. Computing-in-memory (CIM) architecture circumvents the data movement overhead of the Von Neumann architecture by performing parallelized multiply-accumulate (MAC) operations in-situ within memory arrays, achieving significant improvements in both energy efficiency and area efficiency. However, as the bit-width and scale of MAC computation increase, high-precision analog-to-digital conversion and digital-to-analog conversion and long-distance data routing will lead to unacceptable energy and latency overheads, reducing the energy efficiency of CIM. Aiming at the above situation, this work proposes an all-analog CIM macro supporting multi-bit MAC. The design employs grouped row capacitors scheme for DAC-less parallel conversion of multi-bit input activation (IA). Integrated C-2C capacitor ladders perform weighting of signed multi-bit weights within the analog MAC core. The proposed macro is implemented in TSMC 22 nm process with a power consumption of 0.128 mW and an area of 0.06 mm2. The measured throughput is 76.8 GOPS, achieving a high energy efficiency of 600 TOPS/W and an area efficiency of 1.28 TOPS/mm2.
As integrated circuit technology approaches its physical limits, carbon-based electronic devices, with their high carrier mobility, emerge as a crucial pathway for overcoming the bottlenecks of silicon-based technology. However, the existing PKUCNTFET process platform remains immature and exhibits significant differences in design rules compared to traditional silicon-based processes. Consequently, conventional silicon-based SRAM compilers cannot be directly adapted. Since SRAM is a critical component of processors, its development in laboratory settings currently relies solely on manual design, hindering the advancement and application of carbon-based processors and memory. This paper presents the first reconfigurable SRAM compiler specifically designed for carbon-based technology. It employs fully customized cell designs and a modular architecture (parameter parsing → circuit generation → layout output). Utilizing the Hanan grid algorithm for multilayer interconnect optimization, combined with A* search and via collision detection to reduce routing latency, the compiler achieves fully automated design flow, resolving core challenges in carbon-based process adaptation and multi-mode configuration.The experimental results demonstrate that the generated SRAM arrays successfully pass LVS/DRC verification. The compiler supports three operating modes: single-port read/write (1RW), dual-port synchronous read/write (2R2W), and one-read-one-write (1R1W), and can generate SRAM arrays with configurable widths (8~256 bit) and depths (64~4 096 words). It also enables automated Liberty timing characterization across 27 PVT corners. This work provides a self-controllable domestic memory solution for the laboratory development of carbon-based integrated circuits.
Based on physics-of-failure (PoF) analysis methods, this paper proposes a standard for SiP reliability evaluation. This paper completes an analysis of the applicability of domestic and international SiP reliability evaluation methods based on PoF. Utilizing computer simulation technology, a SiP reliability evaluation framework is established, including model construction, stress profile analysis, reliability prediction, and lifetime forecasting. The core content of this reliability evaluation standard is identified, covering evaluation processes, work contents, and detailed requirements. The application in an actual SiP product is also explored, demonstrating its high effectiveness and engineering applicability. The results show good consistency with those from accelerated lifespan testing. The research outcomes help address current challenges in SiP reliability evaluation, such as the lack of unified and effective methods, poor targeting of evaluations, long test cycles, insufficient failure datas, and high experimental costs.
With the increasing demand for non-volatile storage in embedded systems, the functional verification of embedded flash (eFlash) controllers has become a crucial step to ensure system reliability. In response to the low efficiency and poor timing compatibility of traditional directed testing in eFlash controller verification, this paper designs and implements an efficient verification platform for eFlash controllers based on the Universal Verification Methodology (UVM) and oriented to the AHB-Lite bus. The platform utilizes the core components of UVM to achieve a hierarchical architecture, and employs automated scripts and an integrated register model (RAL), adopting random constraint testing and coverage-driven strategies. This ensures verification completeness while shortening the verification cycle. The verification results show that this verification platform can effectively verify the various functions of the eFlash controller, achieving 100% code coverage and 100% functional coverage.
With the ever-increasing demand for computational power and chip performance, multi-chiplet integration has emerged as a critical approach to enhancing integration density and processing capabilities. The chiplet interconnect interface serves as the key enabler for chiplet architectures, where compatibility design-particularly supporting diverse interconnect protocols-poses significant challenges. Given the widespread adoption of the AXI protocol in System-on-Chip (SoC) designs, achieving efficient AXI compatibility in chiplet interconnects is of substantial importance. This paper presents a comprehensive cross-die AXI transmission architecture circuit designed at the protocol layer based on an in-house chiplet interconnect standard, implementing a local agent-based flow control mechanism to manage AXI channel handshaking across dies. We detail the methodology for mapping AXI protocol signals to interconnect data packets to enable end-to-end transmission. The functional correctness is verified through a Universal Verification Methodology (UVM) testbench, with performance evaluated via an FPGA validation platform. Under 1 024-bit read/write data widths during 4KB burst write operations, the measurement showed 85-clock-cycle transmission latency (theoretical minimum: 64 cycles) and 84.92% bandwidth utilization at 1 GHz interconnect frequency. This work provides a systematic reference for protocol layer design in chiplet interconnect standards to adapt AXI or similar communication protocols.