Block Matrices

Block matrices provide a storage-first way to represent a large matrix as a 2D grid of smaller matrices (“blocks”), potentially with heterogeneous dtypes.

The core goals are:

Preserve structure and avoid global densification (“no silent densify”).
Keep compute routed through the existing leaf compute boundary (AutoSolver / device routing).
Support semi-lazy orchestration (thunked per-block results, evaluated only on triggers).
Persist block matrices as a snapshot without writing a single giant dense payload.

Where it lives

Python implementation:

python/pycauset/_internal/blockmatrix.py: BlockMatrix container + orchestration (block_matmul, block_add, block_sub, block_mul, block_div).
python/pycauset/_internal/submatrix_view.py: SubmatrixView (no-copy rectangular view).
python/pycauset/_internal/thunks.py: ThunkBlock (lazy, cached per-block evaluation).
python/pycauset/_internal/persistence.py: matrix_type=BLOCK save/load support via a sidecar directory.

Integration points:

python/pycauset/__init__.py: pycauset.matrix(...) block-grid construction disambiguation.
python/pycauset/_internal/ops.py: pycauset.matmul(a, b) routes to block orchestration if either operand is a BlockMatrix.

Data model

`BlockMatrix`

A BlockMatrix is a structural container of blocks laid out in a rectangular grid.

Invariants enforced at construction:

Grid must be rectangular (every block-row has the same number of block-cols).
All blocks in a block-row share the same height.
All blocks in a block-col share the same width.

The container exposes:

Elementwise indexing via M[i, j].
Block access via get_block(r, c) and set_block(r, c, block).
Partition metadata via row_partitions / col_partitions.

`SubmatrixView`

SubmatrixView(source, row0, col0, rows, cols) is a lightweight, no-copy rectangle.

Element reads delegate to the source.
repr/str are structure-only.
A view-of-a-view composes deterministically into a single view.

In block orchestration, SubmatrixView is used to tile operands when block boundaries do not align. Block-aware slicing returns tiled SubmatrixView blocks (no densify); unsupported view shapes error deterministically.

`ThunkBlock`

A ThunkBlock represents a deferred computation that produces a concrete matrix-like object.

It caches the computed result.
It is thread-safe for single-eval concurrency.
It is triggered by element access (get / __getitem__) or explicit materialize().

Staleness (snapshot-at-creation, R1):

ThunkBlock pins version on captured sources; evaluation/cache hits check versions and raise on mismatch (no auto-recompute).
BlockMatrix increments its own version on set_block, invalidating cached/thunked blocks owned by the container.
Leaf mutations are expected to bump their version; stale access is an error.

Orchestration semantics

“Once block, always block”

If either operand is a BlockMatrix, operations preserve block-ness by returning a BlockMatrix result (typically thunked):

Matmul: A @ B or pycauset.matmul(A, B)
Elementwise: +, -, *, /

Mixed operands are handled by wrapping the non-block operand as a 1×1 BlockMatrix, then refining partitions to align.

block_matmul refines the shared dimension using sorted(set(A.col_partitions) | set(B.row_partitions)).
Elementwise ops refine both axes using the union of row/col partitions.

The refinement step creates SubmatrixView tiles when necessary.

Leaf compute boundary

When orchestration reaches “leaf × leaf” matmul between native matrices, it routes through the public dispatch boundary (pycauset.matmul) so property-aware conversions (diagonal/triangular) still apply.

Device routing follows Compute Architecture per leaf op: AutoSolver decides CPU vs GPU for each block. Complex matmul is CPU-only on CUDA builds today; mixed-dtype containers stay heterogeneous because routing is per leaf op.

Evaluation triggers (semi-lazy)

Trigger evaluation of the minimal required block(s): element access, crossing the compute boundary, dense conversion (np.asarray), or persistence (pycauset.save).
Non-triggers: repr/str, shape/partition metadata, and get_block.
Cached results are reused until a version mismatch is detected; stale hits raise.

Concurrency: each ThunkBlock uses single-eval locking (e.g., once_flag/mutex) so concurrent requests compute once and reuse the cached block.

Deterministic accumulation per output block

Fixed k order for Σ_k A_ik @ B_kj.
Accumulator dtype is chosen from metadata before evaluation by folding the add-result dtype across term dtypes.
Local promotion is per-block; container stays heterogeneous.

IO accelerator integration

Orchestrated evaluation performs best-effort IO hints:

Prefetch before using a backing file: obj.get_accelerator().prefetch(0, size)
Discard after a temporary is no longer needed: discard(0, size)

These hooks are intentionally best-effort and should never be required for correctness.

Persistence format

Saving a block matrix uses a single .pycauset container file plus a sibling sidecar directory:

Container path: bm.pycauset
Sidecar directory: bm.pycauset.blocks/

The container stores:

matrix_type = "BLOCK"
data_type = "MIXED"
block_manifest with:
row_partitions, col_partitions
children: a grid of {path, payload_uuid} entries

Child blocks are stored as block_r{r}_c{c}.pycauset files in the sidecar directory.

Snapshot integrity:

Each manifest entry pins the child payload_uuid.
Load validates the pinned UUID; mismatch errors deterministically.

Save policies (Release 1):

Stale thunks fail save deterministically (no implicit recompute).
Overwrite cleanup deletes only deterministic child filenames within the sidecar.
Saves stage child files (and nested sidecars) then commit/rename to reduce partial updates.
No block-level cached-derived persistence (trace/determinant/norm/sum) is defined; caches remain per leaf child.

View blocks on save:

Persisting SubmatrixView blocks materializes the view block-locally into a small NumPy array, then converts via native.asarray.
This avoids global densification while still producing stable on-disk storage.

Debugging and traceability

Kernel trace:

pycauset._debug_clear_kernel_trace()
pycauset._debug_last_kernel_trace()

IO trace (separate channel):

pycauset._debug_clear_io_trace()
pycauset._debug_last_io_trace()

For device routing expectations and thunk trigger testing, prefer trace-based integration tests over timing.