Student Projects

Beyond Conventional Pipelining: Unpacking an Intel P6-Inspired Processor Design

Project Video

Team Members

Team Members:

Chenyi Li, Fan Zhang, Yucheng Zhang, Yulie Lu, Tianhao Liang, Jiewei Chen

Instructors:

Xinfei Guo

Project Description

Problem

In high-performance application scenarios, the sequential execution of CPUs has been phased out since the last century. Modern processors largely support features such as multiple-issue and out-of-order execution. To address data hazards that affect the reliability of out-of-order execution, the industry proposed the Tomasulo algorithm. However, this algorithm doesn't support precise interrupt. Our project has implemented an out-of-order processor based on the Tomasulo algorithm, but it includes a reorder buffer, enabling precise interrupt support.

Concept Generation

The Tomasulo algorithm is a dynamic instruction scheduling technique that facilitates out-of-order execution. In this algorithm, instructions are dispatched to operation-specific Reservation Stations (RS). These instructions are executed as soon as their operands become available, thus overcoming stalls caused by data hazards. Once completed, instructions broadcast their results to all RS through a Common Data Bus (CDB), providing operands for waiting instructions. When dispatched, instructions carry copies of their available operands to eliminate false dependencies. This mechanism optimizes the use of computational resources, ultimately enhancing overall processor performance.

Design Description

This design focuses on developing a high-performance Out-of-Order pipeline processor based on the Scalar Intel P6 Style architecture. The design incorporates the Tomasulo Algorithm and Reorder Buffer to enhance processor performance and instruction-level parallelism.
The processor design utilizes reservation stations to hold instructions and operands, enabling parallel execution and reducing dependencies. Register renaming is employed to efficiently handle data dependencies and eliminate potential hazards. Instructions are issued, executed, and written back in stages, with the Tomasulo Algorithm dynamically scheduling instructions based on operand availability.
To ensure precise exception handling and maintain in-order instruction commit, we have implemented a Reorder Buffer. This component tracks the original program order and reorders instructions for in-order instruction commit.
The Verilog implementation provides a reliable and versatile solution for high-performance computing. It aligns the VeriSimpleV processor with the Intel P6 architecture, significantly improving instruction-level parallelism and throughput.

Validation

Validation Process:
The validation of our out-of-order processor design was performed by comparing it to a baseline 5-stage in-order pipeline processor in lab 3.
Both functional and performance testing were conducted through behavior simulation in Xilinx Vivado. During the simulation, a suite of benchmark programs, including computational-intensive and control-intensive workloads, were executed to assess the functional correctness and performance of our processor.
Validation Results:
Based on the validation results, most specifications were successfully met.
- Testbench output matched the baseline.
- Memory status matched the baseline.
- CPI (Cycle Per Instruction) was kept at or below 1.5.
- Speedup was equal to or greater than 2.
Our out-of-order design, utilizing Tomasulo's algorithm with a reorder buffer, outperformed the baseline model in most cases, demonstrating improved efficiency.

Modeling and Analysis

In the testbench, we tested ALUI, ALUR, ALUB, and LD/ST instructions.
To keep things simple, we didn't model the delay caused by memory access, which is a significant factor contributing to the speedup achieved by out-of-order execution. Our analysis primarily focuses on verifying the correctness of the pipeline's output and the speedup achieved by resolving data hazards and control hazards using out-of-order execution.
To ensure the correctness of our processor, we executed the same code on both our processor and the 5-stage in-order pipeline processor we implemented and tested in lab 3. Subsequently, we compared the testbench output and the memory log of these two processors to check if they produced identical results.
For evaluating the performance of our processor, we used the 5-stage pipeline processor as the baseline. We compared their performance based on their average CPI (Cycle Per Instruction) when executing the same code. The CPI can be found in the testbench output.

Conclusion

Our out-of-order pipelined processor design, incorporating the Tomasulo algorithm with a Reorder Buffer, effectively harnesses instruction-level parallelism to achieve enhanced performance. The crux of our design lies in this dual modality approach, which dynamically schedules and executes instructions while preserving program order for precise results.

Acknowledgement

Sponsor: UM-SJTU Joint Institute
Xinfei Guo from UM-SJTU JointInstitute

Reference

[1] R. M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, in IBM Journal of Research and Development, vol. 11, no. 1, pp. 25-33, Jan. 1967. doi: 10.1147/rd.111.0025.
[2] University of Michigan, EECS470 WN21 Lecture 8: P6 Microarchitecture, University of Michigan EECS Department, 2021.

VM495 Abstract

UM-SJTU JOINT INSTITUTE