Memory Architecture & Tiered Storage
Status: Implemented (Phases 1-4) Last Updated: December 8, 2025
Overview
PyCauset employs a Tiered Storage Architecture to balance the conflicting goals of "Infinite Scale" (Disk) and "High Performance" (RAM).
Instead of treating the disk as a simple extension of RAM (OS Paging), PyCauset actively manages where data lives.
The Hierarchy
- L1: Physical RAM (Hot)
- Objects are stored in anonymous memory (
std::vectorormalloc). - Fastest access.
- Managed by
MemoryGovernor.
- Objects are stored in anonymous memory (
- L2: Memory-Mapped Disk (Warm/Cold)
- Under RAM pressure, objects may spill by switching from RAM-only mapping to a file-backed (memory-mapped) mapper.
- The file-backed mapper uses session backing files (for example
.tmpfiles under the backing directory). - Portable persistence uses explicit
.pycausetsnapshots written bysave().
Export Safety (Materialization Guard)
A specific risk in this architecture is "Implosion": an L2 object (1TB file-backed) being naively converted to an L1 object (NumPy array), instantly exhausting RAM.
The Export Guard mechanism prevents this:
* File-backed or large spill-backed objects are flagged as huge.
* APIs like __array__ (used by np.array()) check this flag.
* If flagged, export unconditionally fails unless the user provides an explicit override (allow_huge=True via to_numpy).
Persistence format status
There is exactly one on-disk format for .pycauset: a single-file binary container with an mmap-friendly payload region and a sparse, self-describing, typed metadata block.
Metadata updates are crash-consistent and do not shift the payload.
Authoritative plans:
documentation/internals/plans/completed/R1_STORAGE_PLAN.md(container format)documentation/internals/plans/completed/R1_PROPERTIES_PLAN.md(metadata semantics)
Snapshot semantics (mutation policy)
PyCauset is designed so that persisted .pycauset files behave like immutable snapshots when loaded.
- Payload writes should not implicitly overwrite the snapshot.
- The implementation uses copy-on-write working copies so snapshot payload bytes are not overwritten by incidental mutation.
See guides/Storage and Memory for the canonical policy and crash-safety notes (including big-blob caches).
1. The Memory Governor (Phase 1)
The MemoryGovernor is a singleton that acts as the central resource manager.
Responsibilities
- Dynamic Budgeting:
- It polls the OS for actual available RAM (
GlobalMemoryStatusExon Windows,sysinfoon Linux). - It maintains a "Safety Margin" (default: 10% of RAM or 2GB) to prevent the OS from choking.
- It polls the OS for actual available RAM (
- Priority Queue:
- It tracks all
PersistentObjectinstances in an LRU (Least Recently Used) list. - When an object is accessed (
touch()), it moves to the front of the queue.
- It tracks all
- Eviction:
- When a new allocation is requested via
request_ram(size), the Governor checks ifAvailable RAM > size + margin. - If not, it attempts to evict the Least Recently Used objects (spilling them to disk) until space is created.
- When a new allocation is requested via
Usage
// In PersistentObject::initialize
if (MemoryGovernor::instance().request_ram(size_bytes)) {
// Allocate in RAM
allocate_ram_buffer();
MemoryGovernor::instance().register_object(this, size_bytes);
} else {
// Spill to disk immediately
initialize_mmap_file();
}
// In Solver
MemoryGovernor::instance().touch(matrix_a);
2. IO Accelerator (Phase 3)
The IOAccelerator optimizes the interaction between the application and the OS paging system for L2 (Disk-backed) objects.
Mechanism
- Windows:
- Write Optimization: Uses
SetFileValidData(requiresSE_MANAGE_VOLUME_NAME) to extend files without zero-filling, solving the "Import Gap". - Read Optimization: Uses
PrefetchVirtualMemoryto issue asynchronous requests to the OS memory manager, bringing pages into RAM before the CPU faults on them.
- Write Optimization: Uses
- Linux:
- Write Optimization: Uses
fallocateto pre-allocate disk blocks. - Read Optimization: Uses
madvisewithMADV_WILLNEED(orMAP_POPULATEat mapping time) to trigger read-ahead.
- Write Optimization: Uses
Workflow
- Creation: When a new file is created,
SetFileValidData/fallocateis called to reserve space instantly. - Prefetch: Before a heavy compute operation (e.g., matrix multiplication), the solver calls
accelerator->prefetch(). This hints to the OS to populate the page cache. - Compute: The CPU accesses the memory. Since pages are likely already in RAM, major page faults are minimized.
- Discard: After the operation, if the data is intermediate or unlikely to be reused soon,
accelerator->discard()is called. This usesOfferVirtualMemory(Windows) orMADV_DONTNEED(Linux) to tell the OS these pages can be evicted immediately, freeing up RAM for other tasks.
4. Pinned Memory & Direct Path (Phase 4)
To achieve maximum performance for operations that fit entirely in RAM, PyCauset implements a Direct Path optimization that bypasses the standard paging/tiling overhead.
The "Nanny Problem" & Anti-Nanny Logic
Standard memory-mapped files rely on the OS to page data in and out. While safe, this introduces latency (page faults). However, forcing "Streaming/Tiling" on RAM-resident data that could be handled by the OS pager is also inefficient (the "Nanny Problem").
The Solution: should_use_direct_path()
The MemoryGovernor now exposes a centralized decision method: should_use_direct_path(total_bytes).
- Check 1 (Pinning): If data fits in the Pinned Memory Budget, we pin it and use the BLAS Direct Path. This is the fastest possible mode.
- Check 2 (Anti-Nanny): If data fits in Total Available RAM (but exceeds the pinning budget), we still use the Direct Path (without pinning). We trust the OS pager to handle the memory efficiently, avoiding the overhead of manual tiling.
- Fallback: Only if the data exceeds Available RAM do we switch to the Streaming/Out-of-Core Solver.
Direct Path Workflow
In CpuSolver, before starting a heavy operation (like Matrix Multiplication):
- Attempt Pinning: Call
attempt_direct_path(). If successful, run BLAS on pinned memory. - Check Anti-Nanny: Call
MemoryGovernor::should_use_direct_path(). If true, run BLAS on unpinned memory (OS Paging). - Streaming Fallback: If both fail, run
matmul_streaming(Manual Tiling + Prefetching).
This strategy allows PyCauset to match the performance of in-memory libraries (like NumPy) for medium-sized datasets, while gracefully falling back to the robust, tiled, out-of-core solver for massive datasets.
Generalized Implementation
The "Direct Path" logic is implemented in a generalized helper attempt_direct_path<T> which supports both double and float. It is automatically attempted before falling back to the streaming/tiling logic.
Diagram
graph TD
subgraph Initial State
A[Matrix A] -->|shared_ptr| M1[MemoryMapper 1]
end
subgraph After B = A.copy
A2[Matrix A] -->|shared_ptr| M2[MemoryMapper 1]
B[Matrix B] -->|shared_ptr| M2
end
subgraph After B.set_element
A3[Matrix A] -->|shared_ptr| M2
B2[Matrix B] -->|shared_ptr| M3[MemoryMapper 2]
end
5. File Formats (R1_SAFETY)
PyCauset uses two distinct file formats for disk-backed storage.
5.1 Snapshot Container (.pycauset)
- Role: Portable, persistent storage for matrices and causal sets.
- Manager: python/pycauset/_internal/persistence.py.
- Structure:
- Header (4KB): Magic PYCAUSET, Version 1, Slot pointers.
- Slots (A/B): Double-buffered pointers to metadata blocks (crash consistency).
- Metadata: Typed JSON-like map (shape, dtype, properties).
- Payload: Raw binary data (aligned to 4KB).
5.2 Backing File (.tmp)
- Role: Temporary storage for spilled objects or large intermediates.
- Manager: src/core/MemoryMapper.cpp.
- Structure:
- Header (64 bytes): Magic PYCAUSET, Version 1, Reserved (zeros).
- Payload: Raw binary data.
- Safety: The 64-byte header prevents accidental loading of raw files as valid PyCauset containers (or vice versa). MemoryMapper distinguishes between the two formats by checking the "Reserved" field (which is non-zero in .pycauset headers).