Ryan Zhi

Posted on Jan 22

Summary of Deploying Vortex GPGPU on K7 FPGA Development Board

Project Background

Our team consists of four members, each with a specific role. I am responsible for overall project coordination, scheduling, hardware architecture design, system integration, and data flow management. I also support team members when they encounter difficulties and escalate issues to our technical lead, Mr. Geng, if necessary. One team member focuses on testing, another on driver development, and the third on IP core modularization.
The primary goal of this project is to lay the groundwork for future Compute-in-Memory (CIM) chips by implementing a minimal-cost RISC-V-based GPGPU on a K7 FPGA. The architecture consists of one core, one socket, and one cluster.

What is Compute-in-Memory (CIM)?

Compute-in-Memory is an architecture that integrates computing capabilities within memory units, enabling efficient 2D and 3D matrix operations (multiplication and addition). Its key advantages include:
Breaking the memory wall by reducing unnecessary data movement delays and power consumption.
Enhancing computational efficiency by orders of magnitude while reducing costs.
CIM is a non-von Neumann architecture that can deliver significantly higher performance (over 1000 TOPS) and efficiency (exceeding 10-100 TOPS/W) compared to existing ASIC chips.

Project Design and Background

Why Choose Vortex GPGPU Based on RISC-V?

Efficiency and Open Ecosystem: RISC-V's minimal instruction set is more efficient than ARM, resulting in smaller chip area and better performance. Vortex GPGPU is open-source, offering greater customization and innovation opportunities compared to closed ecosystems like NVIDIA CUDA and AMD ROCm.
Research and Innovation: The RISC-V ecosystem is still rapidly evolving, providing ample opportunities for academic and industrial innovation, especially in hardware optimization and algorithm acceleration.

Challenges and Solutions in Deploying Vortex GPGPU on K7

Hardware Architecture Differences:

Challenge: U280 (UltraScale+) vs. K7 (Kintex-7) architecture; limited logic resources and memory bandwidth on K7.
Solution: Modular design using Vivado Block Design, optimized pipeline depth, and reduced non-essential logic to fit K7's resources.

PCIe Interface and Driver Differences:

Challenge: U280 supports Xilinx Runtime (XRT) and HBM, while K7 requires custom Linux drivers and XDMA IP.
Solution: Developed a Generic Virtual Interface (GVI) driver to encapsulate XDMA, optimized PCIe data channels, and used DMA queues to improve throughput.

Resource Limitations on K7:

Challenge: Limited LUTs, DSPs, and DDR3 memory bandwidth.
Solution: Optimized logic resource usage, split complex operations into multiple cycles, and enhanced local BRAM utilization for caching.

Performance Differences Between K7 and U280

Hardware Resources:

U280: UltraScale+ architecture with abundant LUTs, FFs, DSPs, and HBM.
K7: Kintex-7 architecture with limited resources and DDR3 support.

Computational Capability:

U280: Supports multi-cluster, multi-socket, multi-core GPGPU architecture.
K7: Limited to a single cluster, socket, and core due to resource
constraints.

PCIe Support:

U280: PCIe Gen3.
K7: PCIe Gen2.

Optimization on K7

Minimal GPGPU Architecture: Implemented a basic 1-cluster, 1-socket, 1-core architecture, removing non-essential features.
Resource Reuse and Optimization: Shared modules, optimized key paths for common operations (e.g., matrix multiplication), and increased local cache usage.

Impact on System Performance

Throughput and Latency: Lower throughput and higher latency on K7 due to limited resources and lower PCIe bandwidth.
Task Scale: Suitable only for small-scale tasks, not for large deep learning models.
Advantages: Low-cost validation platform, 95% cost reduction, and scalable architecture for future upgrades.

Task Planning and Team Coordination

Task Allocation:

Project Manager: Overall coordination, hardware architecture, system integration, and data flow management.
Tester: Functional and performance testing, OpenCL script execution.
Driver Developer: GVI driver development and debugging.
IP Core Engineer: Optimization and modularization of GPGPU core for K7.

Collaboration Methods:

Tools: Gantt charts, project management tools, enterprise messaging, and knowledge base for documentation.
Meetings: Daily stand-ups and weekly project reviews to synchronize progress and address challenges.
Support: Regular check-ins to assist team members with technical issues and facilitate cross-domain collaboration.

Balancing Hardware and Software Teams:

Clear Interface Definition: Established standardized interfaces and functional requirements between hardware and software.
Milestone Setting: Aligned hardware IP delivery with software driver development and testing.
Communication: Regular joint meetings to ensure consistent understanding of interface requirements and avoid rework.

Overcoming Challenges:

Progress Synchronization: Set short-term goals to avoid delays due to mismatched hardware and software development paces.
Cross-Domain Support: Provided cross-disciplinary technical guidance through documentation and knowledge sharing.

Modular Design in Vivado Block Design

Principles:

Single Responsibility: Each module handles a specific function (e.g., data transfer, GPGPU computation).
Standardized Interfaces: Use AXI4, AXI4-Lite, and AXI4-Stream for module communication.
Hierarchical Design: Organize modules into layers (e.g., data transfer, computation, control) for easier maintenance.

Scalability:

Parameterized Modules: Allow dynamic adjustment of module parameters (e.g., PCIe bandwidth, GPGPU core count).
Reserved Interfaces: Include extra AXI ports and interrupt lines for future expansion.
Flexible Module Replacement: Standardized interfaces enable easy module upgrades.

Maintainability:

Module Packaging: Use Vivado Packager Tool to create reusable IP cores with GUIs for configuration.
Layered Debugging: Develop independent test cases for each module and use simulation tools for verification.
Documentation and Naming: Maintain consistent naming conventions and comprehensive documentation for future reference.

Conclusion

Through clear task allocation, effective tools, and strong team coordination, we successfully overcame technical challenges and deployed the Vortex GPGPU on the K7 FPGA. This project not only validated the feasibility of GPGPU design on resource-constrained hardware but also provided a low-cost platform for future development and expansion.

DEV Community