Compute System Architecture
This document details the computational backend of PyCauset, including the CPU/GPU abstraction layer, parallelization strategies, and hardware-specific optimizations.
1. Architecture Overview
PyCauset uses a unified compute architecture that abstracts the underlying hardware (CPU or GPU) from the high-level matrix operations.
classDiagram
class ComputeContext {
+instance()
+get_device()
+enable_gpu()
}
class ComputeDevice {
<<interface>>
+matmul()
+add()
+inverse()
}
class AutoSolver {
+select_device()
}
class CpuDevice {
+solver_: CpuSolver
}
class CudaDevice {
+cublasHandle
+cusolverHandle
}
ComputeContext --> AutoSolver : manages
AutoSolver --|> ComputeDevice
AutoSolver --> CpuDevice : delegates (small matrices)
AutoSolver --> CudaDevice : delegates (large matrices)
CpuDevice --|> ComputeDevice
CudaDevice --|> ComputeDevice
1.1 ComputeContext
The ComputeContext is a singleton that serves as the entry point for all hardware-accelerated operations. It manages the lifecycle of the compute devices and handles the dynamic loading of the CUDA backend.
1.2 AutoSolver
The AutoSolver is a smart dispatcher that implements the ComputeDevice interface. It automatically selects the best device for a given operation based on:
* Problem Size: Uses an element-count threshold (gpu_threshold_elements_) to avoid PCIe transfer overhead on small workloads.
* Measured Speedup: When a GPU is available, AutoSolver runs a small startup micro-benchmark and may prefer CPU if the GPU is slower for the tested workload.
* Hardware Availability: If no GPU is detected, it falls back to the CPU.
* Operation + Type Support: GPU routing is operation-specific and gated by matrix type + dtype compatibility (for example, dense float32/float64 with matching dtypes). Many operations are intentionally CPU-only for now (e.g., matrix-vector multiply, outer product, elementwise multiply/divide, dot/sum/norm).
1.3 IO Acceleration Integration
To minimize page faults during computation, the compute backend integrates with the IO Accelerator (see MemoryArchitecture).
- Prefetching: Before starting a heavy operation (like
matmulorinverse), the solver callsmatrix->get_accelerator()->prefetch(). This hints the OS to load the data into RAM asynchronously. - Discarding: For temporary intermediate results, the solver may call
discard()after usage to free up memory immediately.
1.4 Debug trace tags (kernel vs IO)
For deterministic testing and debugging (especially device routing and “did this trigger evaluation?” checks), PyCauset exposes a small trace surface:
-
Kernel trace (thread-local):
pycauset._debug_clear_kernel_trace()pycauset._debug_last_kernel_trace()
-
IO trace (separate channel, thread-local):
pycauset._debug_clear_io_trace()pycauset._debug_last_io_trace()
The IO trace is intentionally separate so IO hints (prefetch/discard) don’t clobber kernel dispatch traces.
2. CPU Backend
The CPU backend is designed for low-latency execution of small-to-medium workloads and robust fallback for all operations.
2.1 CpuDevice & CpuSolver
CpuDevice: A thin wrapper that implements theComputeDeviceinterface.CpuSolver: The core implementation class containing the algorithms. It handles:- Dense Matrix Ops: Blocked matrix multiplication (tiled for cache efficiency).
- Lazy Initialization: Uses dynamic
betaparameter (0.0 for first block, 1.0 for others) to avoid global zero-filling of output matrices. This prevents unnecessary page faults for out-of-core datasets.
- Lazy Initialization: Uses dynamic
- Bit Matrix Ops: Optimized using
std::popcount(AVX-512/NEON) for 30x speedups over naive loops. - Element-wise Ops: Parallelized addition, subtraction, and multiplication.
- Dense Matrix Ops: Blocked matrix multiplication (tiled for cache efficiency).
2.2 Parallelization (ParallelUtils)
PyCauset uses a custom thread pool (ThreadPool) to manage parallelism with a Dynamic Scheduling model (Work-Stealing Approximation).
ParallelFor: A helper function that splits loops across available threads.- Dynamic Scheduling: Unlike static partitioning (which suffers from stalls if one thread hits a page fault),
ParallelForuses an atomic index to hand out small chunks of work (grain_size) to threads as they finish previous tasks. - Load Balancing: This ensures that if one thread is blocked by I/O or OS paging, other threads continue processing the remaining work, maximizing CPU utilization.
- Granularity: The grain size is tuned (heuristic:
range / (threads * 4)) to balance load distribution against atomic contention overhead.
3. GPU Backend (CUDA)
The GPU backend leverages NVIDIA CUDA for massive parallelism, particularly for \(O(N^3)\) operations like matrix multiplication and inversion.
3.1 CudaDevice
- Libraries: Wraps
cuBLAS(for matrix multiplication) andcuSOLVER(for decomposition/inversion). - Memory Management: Maintains persistent device buffers (
d_A,d_B,d_C) to avoidcudaMallocoverhead on every call.
3.2 Custom Kernels
For operations not supported by standard libraries, PyCauset implements custom CUDA kernels:
* BitMatrix Multiplication: Uses a "Transpose-then-Popcount" strategy.
1. Transpose Matrix B using warp shuffles (__ballot_sync).
2. Perform bitwise AND + POPCOUNT on 64-bit words.
3. Achieves theoretical peak throughput (64 ops/cycle/thread).
* Selected Element-wise Ops: k_add, k_sub, k_mul_scalar are available for dense float matrices when the AutoSolver routes the operation to CUDA. Some elementwise operations (notably elementwise multiply/divide) are currently CPU-only.
3.3 Pinned Memory
To maximize data transfer speeds, PyCauset uses Pinned Memory (Page-Locked Memory) when a GPU is active.
* Mechanism: Uses cudaHostAlloc instead of malloc.
* Benefit: Allows the GPU's DMA engine to read/write directly to host RAM, bypassing CPU staging buffers. This typically doubles transfer bandwidth.
4. Hardware Detection & Capabilities
PyCauset uses a ComputeDevice abstraction to manage hardware backends. The CudaDevice implementation includes logic to query NVIDIA GPU properties and determine the optimal execution strategy.
Compute Capability Heuristics
The preferred_precision() method in CudaDevice returns an integer code:
* 1: Float32 (Single Precision)
* 2: Float64 (Double Precision)
Logic
The decision is based on the GPU's Compute Capability (CC):
| CC | Architecture | Example Cards | FP64 Rate | Preference |
|---|---|---|---|---|
| 6.0 | Pascal | Tesla P100 | 1/2 FP32 | Float64 |
| 6.1 | Pascal | GTX 10-series | 1/32 FP32 | Float32 |
| 7.0 | Volta | Tesla V100 | 1/2 FP32 | Float64 |
| 7.5 | Turing | RTX 20-series | 1/32 FP32 | Float32 |
| 8.0 | Ampere | Tesla A100 | 1/2 FP32 | Float64 |
| 8.6 | Ampere | RTX 30-series | 1/64 FP32 | Float32 |
Python Binding Integration
PyCauset’s Python bindings include an internal “NumPy → PyCauset” import helper (referred to in code as native.asarray) that converts NumPy arrays into PyCauset vectors/matrices based on shape and dtype.
Note: this is not a public pycauset.asarray API; the user-facing constructors are pycauset.vector(...) and pycauset.matrix(...).
* Shape: Supports 1D (vector) and 2D (matrix) arrays.
* Dtype policy (current):
* 1D float32 vectors are promoted to float64 vectors.
* 2D float32 matrices remain float32 matrices.
* This conversion does not currently depend on GPU availability or preferred_precision().
5. Async GPU Architecture & Streaming
PyCauset uses an Asynchronous Heterogeneous Computing model to handle matrices larger than GPU memory. Instead of blocking the CPU while the GPU computes, or blocking the GPU while the CPU loads data, we use a Double-Buffered Pipeline.
The AsyncStreamer Class
The core of this architecture is the AsyncStreamer<T> class (src/accelerators/cuda/AsyncStreamer.hpp).
Responsibilities
- Pinned Memory Management: Allocates
cudaMallocHostmemory, which is required for asynchronous DMA transfers. - Stream Management: Maintains a dedicated
transfer_streamseparate from thecompute_stream. - Synchronization: Uses
cudaEvent_tto coordinate the CPU (Producer) and GPU (Consumer) without blocking the CPU thread unnecessarily.
Pipeline Logic
The pipeline consists of two buffers (Index 0 and Index 1) and utilizes Hybrid CPU/GPU Parallelism.
-
CPU Phase (Producer):
- Wait: Waits for Buffer \(i\) to be free (consumed by GPU in previous cycle).
- Fill (Parallelized): Uses
pycauset::ParallelFor(Thread Pool) to pack data from the source matrix into the Pinned Memory Buffer \(i\). This multi-threaded approach ensures that memory bandwidth is maximized and the CPU does not become a bottleneck for fast GPUs. - Submit: Submits an asynchronous
cudaMemcpyAsyncon thetransfer_stream. - Record: Records
event_transfer_complete.
-
GPU Phase (Consumer):
- Wait: The
compute_streamwaits forevent_transfer_complete. - Compute: Executes kernels (e.g.,
cublasDgemm) reading from Buffer \(i\) (Device Memory). - Record: Records
event_compute_complete.
- Wait: The
This architecture allows for True Overlap: * Time \(T\): GPU computes Batch \(k\). * Time \(T\): CPU threads pack Batch \(k+1\). * Time \(T\): DMA Engine transfers Batch \(k+1\) (once packing is done).
Supported Operations
1. Matrix Multiplication (matmul)
- Function:
CudaDevice::matmul_streaming - Strategy: Tiled GEMM.
- Pipeline:
- Streams tiles of Matrix A and Matrix B.
- Accumulates results into a persistent tile of Matrix C on the GPU.
- Allows multiplying matrices of arbitrary size, limited only by system RAM.
2. Eigenvalue Solver (batch_gemv)
- Function:
CudaDevice::batch_gemv_streaming - Strategy: Blocked Matrix-Vector Multiplication.
- Pipeline:
- Keeps vectors \(X\) and \(Y\) in VRAM.
- Streams Matrix \(A\) in chunks from Host to Device.
- Computes \(Y += A_{chunk} \times X\).
3. Matrix Inversion (inverse)
- Function:
CudaSolver::invert - Strategy: Right-Looking Blocked LU Decomposition.
- Pipeline:
- Panel Factorization: Loads a block column, factorizes it in VRAM.
- Trailing Update: Streams the trailing submatrix and updates it using the factorized panel (\(A' = A - L \times U\)).
- Reuses the
matmul_streaminginfrastructure for the update step.
- Performance: Enables inversion of matrices significantly larger than GPU memory.
6. Optimization Roadmap & Future Work
This section outlines the plan to address the remaining gaps in the PyCauset acceleration layer.
Phase 1: Dense Eigenvalues on GPU (Complete)
- Status: ✅ Complete (Symmetric Only)
- Implementation: Uses
syevdfor symmetric matrices (checked at runtime) and falls back to CPU for non-symmetric ones.
Phase 2: Element-wise Operations on GPU (Partial)
- Status: 🟡 Partial
- Implementation: CUDA supports
add,subtract, andmultiply_scalarfor dense float32/float64 with matching dtypes. Elementwise multiply/divide are currently CPU-only.
Phase 3: BitMatrix Optimization (Complete)
- Status: ✅ Complete
- Implementation: "Bit-Packed GEMM" kernel implemented using
__popcand shared memory tiling.
Phase 4: Precision Control
- Objective: Allow users to force Float64 on consumer GPUs.
- Plan: Add
PrecisionModeenum (AUTO,FORCE_FLOAT32,FORCE_FLOAT64) toAcceleratorConfig.
1.5 Resilience & Fallback (R1_SAFETY)
To ensure robustness against hardware instability (e.g., GPU driver crashes, OOM), AutoSolver implements a Circuit Breaker pattern:
- Pessimistic Initialization: GPU context creation is wrapped in ry-catch. If it fails, the GPU is disabled for the session.
- Operation Guard: Every GPU operation (matmul, inverse, etc.) is guarded.
- Fallback: If a GPU operation throws a hardware exception:
- A warning is logged to stderr.
- The GPU is permanently disabled (gpu_device_ = nullptr).
- The operation is retried immediately on the CPU.