OpenBLAS Integration Plan
Goal: Replace manual C++ compute loops with optimized BLAS (Basic Linear Algebra Subprograms) calls to achieve NumPy-level performance while retaining PyCauset's out-of-core streaming architecture.
Strategy: "Hybrid Engine"
- Chassis: PyCauset MemoryGovernor & MemoryMapper (Handles I/O, Pinning, Streaming).
- Engine: OpenBLAS (Handles raw compute via AVX2/AVX-512).
Phase 1: Preparation & Cleanup
Objective: Simplify the codebase before adding new dependencies.
-
Do not target Float16 for CPU BLAS acceleration (DONE)
- Reason: OpenBLAS does not provide Float16 GEMM on CPU.
- Clarification:
float16is still a first-class dtype in PyCauset (storage + interop) and is not retired.- Codebase evidence:
DataType::FLOAT16exists;DenseMatrix<float16_t>exists; Python exposespycauset.float16andpycauset.Float16Matrix. - Current behavior: CPU
matmulsupportsFloat16Matrixresults via a fallback that accumulates in float32 and stores float16.
- Codebase evidence:
- Action: Keep Float16 as a storage dtype; for BLAS-backed matmul, use Float32/Float64.
- Action: Document Float16 limitation: on CPU, matmul is not BLAS-accelerated for Float16.
-
Snapshot Benchmarks
- Action: Record current "Native C++" performance for Float64 (already done: ~22 GFLOPS).
- Action: Record current performance for Float32 (if implemented) to have a baseline.
Phase 2: Dependency Integration (Build System)
Objective: Successfully link OpenBLAS in the Windows/MSVC environment.
-
Acquire OpenBLAS (DONE)
- Action: Configured CMake to automatically download pre-built binaries (v0.3.26) for Windows during build.
-
Update CMake Configuration (DONE)
- Action: Modify
CMakeLists.txtto find the OpenBLAS library. (DONE) - Action: Add include directories for
cblas.h. (DONE) - Action: Link
pycauset_coreagainstlibopenblas.lib. (DONE) - Action: Ensure the
.dllis copied to the build output directory so tests can run. (DONE)
- Action: Modify
-
Verify Linkage
- Action: Create a minimal "Hello BLAS" C++ test file that just calls
cblas_dgemmon a 2x2 matrix. - Action: Compile and run this test to confirm the build system is working before touching the main code.
- Action: Create a minimal "Hello BLAS" C++ test file that just calls
Phase 3: Implementation (The "Surgical" Replacement)
Objective: Swap the inner loops for BLAS calls inside the Streaming Architecture.
-
Modify
CpuSolver.cppHeaders (DONE)- Action: Include
<cblas.h>. (DONE)
- Action: Include
-
Implement Float64 (Double) Support (DONE)
- Action: Update
matmul_streaming_f64. (DONE - Refactored to templatematmul_streaming<T>) - Change: Inside the inner streaming loop (where we currently have
ParallelForloops), replace the logic with a call tocblas_dgemm. (DONE) - Detail: We must calculate
LDA,LDB,LDC(strides) correctly based on the full matrix width, even when processing a small tile. (DONE)
- Action: Update
-
Implement Float32 (Single) Support (DONE)
- Action: Create/Update
matmul_streaming_f32. (DONE - Handled by template) - Change: Use
cblas_sgemm. (DONE)
- Action: Create/Update
-
Implement Complex Numbers Support
- Note: BLAS supports "Single Complex" (
cgemm) and "Double Complex" (zgemm). - Action: Implement
matmul_streaming_c64(Complex Double) usingcblas_zgemm. - Action: Implement
matmul_streaming_c32(Complex Float) usingcblas_cgemm. - Detail: Ensure
std::complex<double>memory layout matches BLAS expectations (it usually does: real, imag, real, imag...).
- Note: BLAS supports "Single Complex" (
Phase 4: Verification & Benchmarking
Objective: Prove correctness and performance gains.
-
Unit Testing
- Action: Run
test_symmetric_matrixand other existing tests. - Action: Create specific tests for non-square matrices to ensure
LDA/LDBstrides are correct (common BLAS bug).
- Action: Run
-
Performance Benchmarking
- Action: Run
benchmark_native.exe(Float64). - Target: We expect to see GFLOPS jump from ~22 to ~200+ (matching NumPy).
- Action: Run benchmarks for Float32 and Complex types.
- Action: Run
-
Large Scale Test
- Action: Run a benchmark with N > RAM_SIZE (e.g., N=30,000 on a 16GB machine).
- Verify: Ensure it does not crash (Streaming works) and runs reasonably fast (BLAS works).
Phase 5: Documentation & Cleanup
Objective: Finalize the transition.
-
Update Documentation
- Action: Update
documentation/internals/plans/SUPPORT_READINESS_FRAMEWORK.mdwith BLAS-backed status and thresholds. - Action: Update
internalsdocs to explain the BLAS dependency. - Action: Remove old "Micro-Kernel" implementation plans.
- Action: Update
-
Code Cleanup
- Action: Remove the old generic
matmul_implC++ loop implementation if it is no longer used by any type.
- Action: Remove the old generic
Notes on Complex Numbers
- Float32 Complex: Corresponds to
std::complex<float>. BLAS routine:cblas_cgemm. - Float64 Complex: Corresponds to
std::complex<double>. BLAS routine:cblas_zgemm. - Implementation: The streaming logic remains identical. We just pin the memory (which is \(2 \times\) larger per element) and pass the pointer to the appropriate BLAS function.