Skip to content

Support Readiness Framework (DTypes × Ops × Devices)

Status: Active (replaces OPTIMIZATION_CHECKLIST.md) Goal: A scalable, template-driven checklist to keep dtype + operation coverage correct, fast, and consistent across CPU/GPU and RAM/out-of-core.

This document is intentionally not a huge per-dtype × per-op grid. Instead it provides:

  • A canonical inventory of the dtypes and operations the project currently declares/exposes.
  • A small set of gates (correctness, storage, device routing, lookahead hints, benchmarks) that must be satisfied.
  • Two reusable templates:
  • “Add a dtype” template (end-to-end integration)
  • “Add an op” template (end-to-end integration)

Use this as the “definition of done” for dtype expansion and operation coverage.


1) Canonical dtype inventory

1.1 Frontend dtype tokens (factory layer)

These are the dtype tokens treated as the frontend contract for the public constructors/allocators (e.g. pycauset.matrix(..., dtype=...), pycauset.vector(..., dtype=...), pycauset.zeros(..., dtype=...)).

Small implementation note: some tokens may be temporarily “declared but not yet wired end-to-end”; when that happens, the system should fail clearly (or route through a compat layer) rather than silently producing a different dtype.

Real dtypes

  • bool / bit (packed)
  • int16
  • int32 (also reachable via int)
  • float16
  • float32
  • float64 (also reachable via float)

Complex dtypes (same scalar system; not a separate universe)

  • complex_float16 (first-class; storage-optimized)
  • complex_float32 (a.k.a. complex64)
  • complex_float64 (a.k.a. complex128)

Non-goal (by design): PyCauset does not support complex permutations of non-float dtypes (complex int* / complex bit).

Rationale: in practical workloads, complex-valued linear algebra (eigen/spectral analysis, stable solvers, phase-sensitive kernels) inevitably requires floating-point compute and typically relies on established float/complex-float backends (BLAS/cuBLAS). Supporting complex ints/bits would add significant surface area (promotion, overflow rules, kernels, tests) without enabling meaningful acceleration or common workflows.

Implementation note (small but important): the framework treats these complex permutations as part of the dtype system (i.e. they “exist” conceptually), but they are not yet wired end-to-end in the codebase.

How float works (must be explicit)

  • float means float64 (storage and compute unless an op’s policy explicitly differs).
  • float32 is the explicit smaller float storage/compute dtype.
  • Mixed-float behavior is underpromotion by default: e.g. float32 + float64 operations select float32 unless an op’s promotion policy says otherwise (warnings may apply).

Float16/ComplexFloat16 execution note (scope + expectations)

  • float16 and complex_float16 are first-class storage dtypes.
  • On CPU, it is acceptable (and currently implemented in some ops) to upcast internally for compute (e.g., accumulate in float32 / complex) and then store back to float16/complex_float16.
  • CPU BLAS acceleration is not expected for float16; treat float16 primarily as a storage/bandwidth dtype on CPU unless/until a dedicated backend exists.
  • SRP requirement: when an op upcasts internally, that must be explicit in the op’s contract/support status (so users understand precision/perf tradeoffs).

1.2 Core scalar model + C++ DataType snapshot

PyCauset’s scalar system is best understood as a base dtype plus a small set of orthogonal flags.

  • Base kind: bit | int | float
  • Base widths: 1 for bit; 16/32/64 (and others as added) for int/float
  • Flags: complex (and later unsigned for integers)

Under this model, complex_float64 is “float64 + {complex}”, and similarly for other float base dtypes.

C++ implementation snapshot

Authoritative source: include/pycauset/core/Types.hpp.

  • BIT
  • INT16
  • INT32
  • FLOAT16
  • FLOAT32
  • FLOAT64

Policy: If a dtype exists in DataType, it must have an explicit status in this framework: - Public (reachable via Python factories), or - Internal-only (reachable only via internal constructors/bindings), or - Planned (declared but intentionally not wired end-to-end yet).

Complex representation rule: complex is a flag/permutation over float base dtypes. Storage strategies:

  • Complex floats (complex float32/float64) should use true complex numeric storage where possible (BLAS/cuBLAS path).
  • Complex float16 is first-class for storage/bandwidth wins at very large scale; solvers may upcast internally (e.g., compute in float32) where required for stability.

Reference implementation plan: documentation/internals/DTYPE_COMPLEX_OVERFLOW_PLAN.md.


2) Canonical operation inventory

2.1 Frontend (LinearAlgebra) operations

Authoritative sources: - include/pycauset/math/LinearAlgebra.hpp - src/math/LinearAlgebra.cpp

Matrix × Matrix - add(a, b) - subtract(a, b) - elementwise_multiply(a, b) - elementwise_divide(a, b) - dispatch_matmul(a, b) (matmul)

Vector × Vector - add_vectors(a, b) - subtract_vectors(a, b) - dot_product(a, b) - dot_product_complex(a, b) - cross_product(a, b)

Reductions - norm(v) (vector L2) - norm(m) (matrix Frobenius) - sum(v) (complex-valued accumulator; real inputs return imag=0) - sum(m) (complex-valued accumulator; real inputs return imag=0)

Matrix/Vector mixed - matrix_vector_multiply(m, v) (a.k.a. “matvec”) - vector_matrix_multiply(v, m) (a.k.a. “vecmat”) - outer_product(x, y)

Scalar operations on vectors - scalar_multiply_vector(v, s) (overloads for double, int64_t, complex<double>) - scalar_add_vector(v, s) (overloads for double, int64_t)

Special - compute_k_matrix(...)

Definition of “all operations” (minimum): everything declared in include/pycauset/math/LinearAlgebra.hpp.

2.2 Device interface operations

Authoritative source: include/pycauset/compute/ComputeDevice.hpp.

Note: The device layer is the implementation contract. If an operation exists here, it must have: - CPU correctness (required) - GPU coverage (optional, but must be explicitly routed/blocked)

2.2.1 Current per-op device/out-of-core status (SRP handoff table)

This table is a living SRP handoff artifact. It records the current routing policy:

  • CPU: whether a correctness implementation exists (required)
  • GPU: GPU-enabled vs CPU-route vs blocked
  • Out-of-core: naive vs streaming-enabled
ComputeDevice op CPU GPU Out-of-core Notes
matmul GPU-enabled (Dense float32/64; bit×bit→int32) streaming-enabled (CUDA VRAM chunking) AutoSolver uses heuristic + dtype checks
inverse GPU-enabled (Dense float32/64) naive AutoSolver uses heuristic + dtype checks
batch_gemv GPU-enabled streaming-enabled (CUDA VRAM chunking) AutoSolver uses size-based selection
matrix_vector_multiply CPU-route naive CUDA implementation not present
vector_matrix_multiply CPU-route naive CUDA implementation not present
outer_product CPU-route naive CUDA implementation not present
add GPU-enabled (Dense float32/64; matching dtype) naive AutoSolver uses heuristic + dtype checks
subtract GPU-enabled (Dense float32/64; matching dtype) naive AutoSolver uses heuristic + dtype checks
elementwise_multiply CPU-route naive CUDA implementation not present
elementwise_divide CPU-route naive CUDA implementation not present
multiply_scalar GPU-enabled (Dense float32/64; matching dtype) naive AutoSolver uses heuristic + dtype checks
dot CPU-route naive Always CPU for now
dot_complex CPU-route naive Always CPU for now
sum(VectorBase) CPU-route naive Always CPU for now
l2_norm CPU-route naive Always CPU for now
add_vector CPU-route naive Always CPU for now
subtract_vector CPU-route naive Always CPU for now
scalar_multiply_vector CPU-route naive Always CPU for now
scalar_multiply_vector_complex CPU-route naive Always CPU for now
scalar_add_vector CPU-route naive Always CPU for now
cross_product CPU-route naive Always CPU for now
compute_k_matrix CPU-route naive Always CPU for now
frobenius_norm CPU-route naive Always CPU for now
sum(MatrixBase) CPU-route naive Always CPU for now
trace CPU-route naive Always CPU for now
determinant CPU-route naive Always CPU for now
qr CPU-route naive Always CPU for now

2.2.2 Python-level linalg endpoints (Phase G, endpoint-first)

These are high-level Python entrypoints exposed at pycauset.*.

They are allowed to be implemented initially via composition (calling existing ops) or via a NumPy fallback, as long as behavior is deterministic and documented.

Python endpoint Current status Notes
solve(a, b) ✅ baseline Currently uses invert(a) @ b when no dedicated solver exists
lstsq(a, b) ✅ baseline Normal-equations baseline: \((A^T A)^{-1} A^T b\) (returns only x)
slogdet(a) ✅ baseline Uses a.determinant() then returns (sign, logabsdet)
cond(a) ✅ baseline Uses norm(a) * norm(invert(a))
eigh(a) ✅ baseline NumPy fallback (numpy.linalg.eigh)
eigvalsh(a) ✅ baseline NumPy fallback (numpy.linalg.eigvalsh)
solve_triangular(...) blocked Not implemented yet
lu(...) blocked Not implemented yet
cholesky(...) blocked Not implemented yet
svd(...) blocked Not implemented yet
pinv(...) blocked Not implemented yet

2.3 Object protocol (required for any public dtype)

Independently of “math ops”, a dtype is not considered publicly supported unless its matrix/vector types satisfy:

  • Construction (size-based + data-based)
  • Element access + mutation (get, set, __getitem__, __setitem__ as applicable)
  • Complex safety: complex-valued objects must not silently drop imaginary parts (real-only access must be guarded or error)
  • Transpose behavior (view vs materialize policy is explicit)
  • Scalar handling (where supported) and dtype reporting
  • Persistence hooks (copy_storage, _from_storage) and round-trip load/save
  • NumPy interop where available (e.g., NumPy import/export fast-paths like internal native.asarray and np.asarray(obj))

3) Storage + persistence invariants

3.1 In-memory storage expectations ("optimal storage")

“Optimal storage” is defined by the smallest representation that preserves semantics and supports the required access patterns for the operation set.

Minimum expectations by kind:

  • bit
  • Use packed storage for dense/triangular bit matrices and bit vectors.
  • “Numeric” ops must explicitly widen (and document the widening), or be error-by-design.

  • int16 / int32

  • Dense storage is row-major contiguous.
  • Mixed-precision integer ops must either:

    • use dedicated integer kernels, or
    • widen via documented promotion rules.
  • float32 / float64

  • Dense storage is row-major contiguous.
  • Large-size defaults may prefer float32 for storage efficiency when policy says so.

  • float16

  • Dense storage uses float16 payload (pycauset::float16_t) and is treated as a first-class dtype.
  • CPU kernels may compute in higher precision (typically float32) and then store float16 results.
  • If an operation’s correctness depends on upcasted compute, the support status must say so (compute dtype vs storage dtype).

  • complex (flag/permutation)

  • Supported complex dtypes are complex floats only.
  • Storage is either true complex float storage (performance path) or two-plane float16 storage for complex_float16 (scale-first / bandwidth path).

Implementation reality check (must remain aligned with code):

  • CPU matmul for float16 currently uses float32 accumulation with float16 storage.
  • CPU matmul for complex_float16 currently uses upcasted complex compute with float16-plane storage.

3.2 On-disk format expectations

Persistence is a single-file .pycauset container with an mmap-friendly raw payload and a typed metadata block.

Invariants - Metadata must record at least: - matrix_type - data_type - dimensions + seed + scalar + transpose flag - The payload region is the raw storage bytes used by _from_storage(...) to memory-map.

Complex payloads (current): complex_float16 uses two-plane storage in-memory, but persistence round-trips via a single raw payload containing both planes contiguously. Metadata identifies the dtype as complex_float16 and normal shape/layout fields.

Complex payloads (optional future enhancement): multiple payload regions (or named blobs) could be added later for ease of inspection/tooling, but are not required for correctness.

Definition of done: saving + loading must round-trip for each (dtype, structure) that is public.


4) Cooperative Compute Architecture (CCA) / Lookahead protocol gate

Authoritative source: documentation/internals/CooperativeArchitecture.md.

Rule: For operations that stream or stride through persistent storage (especially matmul-like kernels), the solver must emit MemoryHints before heavy reads.

Minimum expectations: - Emit Sequential hints for row-major streaming reads - Emit Strided hints for transposed/column-like access - Tests must cover that hints are emitted (at least for CPU matmul path)


5) Device coverage + routing rules

  • CPU correctness is mandatory.
  • GPU coverage is optional per operation and per dtype/structure subset.
  • Device selection is expected to be automatic by default: choose CPU vs GPU based on hardware capability and a lightweight micro-benchmark/heuristic.
  • If GPU does not support a case, the system must do exactly one of:
  • route to CPU in AutoSolver (explicit CPU-only policy for that op/case), or
  • throw a clear error when GPU is selected/forced.

“Silent wrong answers” is the only unacceptable outcome.

5.1 Current implementation note (AutoSolver)

The current design already aligns with the “measure and choose” policy:

  • AutoSolver runs a small matmul micro-benchmark when a GPU device is enabled.
  • Dispatch uses thresholds + the measured speedup factor to prefer CPU when GPU would be slower.

Future direction (not implemented yet): cooperative CPU+GPU tandem execution for a single operation.


6) Gates (what must be true before claiming “supported”)

Treat each gate as a checklist you can apply to either “a new dtype” or “a new op”.

Gate A — Correctness

  • Small deterministic cases (hand-checkable)
  • Randomized cases (property-style)
  • Cross-dtype cases (promotion/overflow/underpromotion)
  • Complex-variant coverage for public complex dtypes (and “error-by-design” assertions where complex closure is planned)
  • “Error-by-design” cases are explicit and tested

Gate B — Storage + persistence

  • Round-trip save/load for dense + triangular + vector variants that are public
  • Out-of-core path is exercised (memory-mapped load + compute)

Gate C — Routing + device policy

  • CPU path exists
  • GPU path is implemented or explicitly blocked/routed
  • Behavior is consistent across frontend (LinearAlgebra) and Python surface

Gate D — CCA lookahead hints

  • Operation emits hints for persistent operands when it has a predictable access pattern

Gate E — Benchmarks (vs NumPy)

  • In-memory benchmarks vs NumPy for at least one representative shape regime
  • Disk-backed benchmark that exercises paging behavior (large mmap-backed payload)

7) Template: Adding a new dtype (end-to-end)

Fill this template for each new dtype you add.

7.1 Declare and normalize

  • Add/confirm DataType enum value
  • Define dtype token(s) for Python
  • Implement dtype normalization (accept pc.*, np.*, and case-insensitive strings)

7.2 Storage types and matrix/vector coverage

  • Dense matrix type exists
  • Triangular matrix type exists if required by the structure policy
  • Vector type exists
  • Identity/Diagonal participation is defined (supported or explicitly excluded)

7.3 Factory + persistence

  • ObjectFactory supports create/load/clone
  • Python save/load metadata mapping exists
  • _from_storage works for mmap-backed data

7.4 Operations

For each canonical operation group (Section 2): - [ ] Explicit dtype behavior statement (promotion, overflow, underpromotion) - [ ] CPU correctness - [ ] GPU policy (supported or blocked)

7.5 Tests + benchmarks

  • Gate A–E satisfied

8) Template: Adding a new operation (end-to-end)

Fill this template for each new operation you add.

8.1 Define the contract

  • Operand ranks supported (M×M, V×V, M×V, V×M)
  • Shape rules and error messages
  • Result dtype + structure rules (promotion + matrix-type promotion)

8.2 Wire end-to-end

  • ComputeDevice interface method added
  • CPU implementation (CpuSolver) + passthrough (CpuDevice)
  • AutoSolver routing policy (CPU-only or GPU-enabled)
  • Frontend wrapper in LinearAlgebra (allocation + dispatch)
  • Python bindings + Python API

8.3 Coverage axes (Protocols.md)

  • Operand rank
  • Scalar kind + flags (bit/int/float; complex/unsigned if applicable)
  • Structure/storage (dense/triangular/identity/diagonal/unit-vector)
  • Device coverage (CPU required; GPU optional)
  • Python surface
  • Documentation + tests

8.4 CCA lookahead

  • Emit memory hints for persistent operands when applicable

8.5 Tests + benchmarks

  • Gate A–E satisfied