Protocol: Adding New Matrix/Vector Operations
Objective: Make new ops easy to add without “hunt across the codebase”, and ensure both matrix and vector variants are covered.
The current architecture (where things go)
- Frontend (type resolution + allocation + entry point)
- Declarations:
include/pycauset/math/LinearAlgebra.hpp - Definitions:
src/math/LinearAlgebra.cpp
- Declarations:
- Context (
ComputeContext)include/pycauset/compute/ComputeContext.hpp,src/compute/ComputeContext.cpp
- Dispatcher (
AutoSolver)include/pycauset/compute/AutoSolver.hpp,src/compute/AutoSolver.cpp
- Interface (
ComputeDevice)include/pycauset/compute/ComputeDevice.hpp
- Implementations (CPU / CUDA)
- CPU device:
include/pycauset/compute/cpu/CpuDevice.hpp,src/compute/cpu/CpuDevice.cpp - CPU algorithms:
include/pycauset/compute/cpu/CpuSolver.hpp,src/compute/cpu/CpuSolver.cpp - CUDA device:
src/accelerators/cuda/CudaDevice.hpp,src/accelerators/cuda/CudaDevice.cu
- CPU device:
The “Support Checklist” (authoritative)
When adding a new operation, you must decide and/or implement support in each of these axes:
-
Operand rank
- Matrix–Matrix
- Vector–Vector
- Matrix–Vector and Vector–Matrix (if applicable)
-
Scalar kind + flags
- Fundamental kinds:
bit,int,float - Flags/permutations:
complex,unsigned - Special rule: bit is allowed to have op-specific exceptions (bitwise vs numeric vs error-by-design). See internals/DType System.
- Fundamental kinds:
-
Structure/storage
- Dense vs triangular vs symmetric/antisymmetric vs identity/diagonal
- Vector special cases like
UnitVector
-
Device coverage
- CPU must be correct.
- GPU is optional; if unsupported, routing must be prevented or it must throw clearly.
-
Python surface
- Bindings:
src/bindings/(bind_matrix.cpp,bind_vector.cpp,bind_complex.cpp) and the aggregatorsrc/bindings.cpp. - Python API helpers/wrappers may also need updates in
python/pycauset/.
- Bindings:
-
Documentation + tests
- Add/modify API docs in
documentation/docs/and guides indocumentation/guides/. - Add tests (Python and/or C++) covering dtype permutations and “error-by-design” cases.
- Add/modify API docs in
-
Properties (R1_PROPERTIES)
- Decide which properties the op consumes (e.g.,
is_upper_triangular,is_hermitian). - Decide whether the op produces/propagates/changes properties on its outputs.
- Enforce the no-scan rule: properties must never be validated by scanning payload data.
- Document “power user” semantics where relevant: properties are gospel and may override payload truth.
- Decide which properties the op consumes (e.g.,
-
Cached-derived properties (R1_PROPERTIES)
- Decide which cached-derived properties (e.g.,
trace) the op invalidates (default: invalidate all cached-derived values on any payload mutation). - If the op is a metadata-only transform, decide which cached-derived values can be propagated via explicit \(O(1)\) rules (otherwise clear).
- If the op is parallelized, decide whether it should emit a constant-size effect summary to help the post-op health check update metadata without a second payload pass.
- Decide which cached-derived properties (e.g.,
Step-by-step (what to edit, in order)
1) Add the device-interface method
Add a pure virtual method to include/pycauset/compute/ComputeDevice.hpp.
Matrix example:
Vector example:
2) Implement CPU algorithm + wire CpuDevice
- Add implementation to
include/pycauset/compute/cpu/CpuSolver.hppandsrc/compute/cpu/CpuSolver.cpp. - Wire through
include/pycauset/compute/cpu/CpuDevice.hppandsrc/compute/cpu/CpuDevice.cppas a thin passthrough.
3) Add AutoSolver routing
Update include/pycauset/compute/AutoSolver.hpp and src/compute/AutoSolver.cpp.
- Default: route by size with
select_device(elements)->my_new_op(...). - If GPU does not support the op/dtypes/structures, either:
- enforce “CPU only” in AutoSolver for that op, or
- keep GPU routing but ensure GPU throws a clear error.
4) (Optional) Implement CUDA
If supported, implement in src/accelerators/cuda/CudaDevice.cu.
5) Add the frontend wrapper(s)
Add a top-level frontend function in include/pycauset/math/LinearAlgebra.hpp + src/math/LinearAlgebra.cpp.
Responsibilities of the frontend:
- validate shapes,
- determine the result dtype/structure,
- allocate the result using ObjectFactory::create_matrix/create_vector,
- dispatch via ComputeContext::instance().get_device()->....
Additional responsibility (R1_PROPERTIES):
- Compute any effective structure category once (e.g., zero/identity/diagonal/triangular/general) from properties and pass that decision down to avoid repeated property lookups in inner loops.
6) Bind to Python
Add the binding to the appropriate src/bindings/bind_*.cpp and ensure it is reachable from src/bindings.cpp.
7) Declare dtype/coverage expectations (policy + tests)
Every new op must explicitly state its dtype behavior. In particular:
- Fundamental-kind rule (bit/int/float): do not “promote down” across kinds. For example,
matmul(bit, float64) -> float64. - Underpromotion within floats:
matmul(float32, float64) -> float32by default. - Overflow: integer overflow throws; large integer matmul may emit a risk warning (advisory).
These rules are defined in internals/DType System and summarized in project/Philosophy.
Streaming manager wiring (per-op template)
Use this when deciding whether an op should stream and how to express it.
- Decide streamability and invariants. Specify when streaming is allowed (e.g., associative/commutative reductions, block-separable transforms) and when to fall back to single-shot (e.g., global normalization constants, shape-dependent fusion that would regress correctness). Capture the rationale so AutoSolver can gate.
- Define the streaming descriptor. Choose chunk shape (rows/tiles/blocks), prefetch distance, double-buffering vs single-buffer, and whether pinned host staging is required. Provide CPU defaults and GPU defaults separately; GPU defaults should include host<->device staging policy and max in-flight buffers.
- Route through AutoSolver. Add a routing path that prefers the streaming plan when the device supports it and the op’s streamability predicates are satisfied; otherwise route to the non-streaming implementation or raise a clear “streaming not supported for this configuration” error.
- Budgeting with the memory governor. Set per-op limits: max residency for staging buffers, spill/evict behavior, and what to do when budgets are tight (e.g., shrink chunk size, drop to CPU-only, or disable streaming). Avoid hidden allocations outside the governor.
- Telemetry and tests. Add a minimal streaming test that asserts the op uses the streaming path (e.g., via a debug hook or perf counter) and a fallback test that confirms graceful degradation when streaming is blocked. Document the defaults and any known non-streamable cases in the op’s docstring/API doc.
Minimal “Definition of Done” for a new op
-
ComputeDeviceinterface updated. - CPU implementation exists and is correct.
-
AutoSolverdispatch correct (CPU-only or GPU-enabled). - Frontend wrapper(s) in
src/math/LinearAlgebra.cpp(matrix and/or vector). - Python bindings updated.
- Dtype behavior documented (including bit exceptions and cross-kind rules).
- Tests cover supported dtypes + at least one “error-by-design” dtype.
If the op is property-aware (R1_PROPERTIES):
- The op’s behavior with properties is documented (which properties it reads, and which it writes/propagates).
- Property behavior is deterministic and respects the “no truth validation / no data scans” rule.
If the op interacts with cached-derived properties (R1_PROPERTIES):
- Cache behavior is documented (which cached-derived values can be preserved/propagated vs must be cleared).
- Cache behavior is deterministic and uses only \(O(1)\) rules (no scans / no extra passes).
For bit specifically, always state whether the new op is:
- bitwise (stays
bit, stays packed), or - numeric (may widen to
int/float, potentially huge), or - error-by-design for
bitunless the user explicitly requests widening.