NumPy Integration Guide
pycauset is designed to work seamlessly with the Python scientific stack, particularly NumPy. While pycauset uses its own optimized storage (RAM or disk-backed) for handling massive datasets, it provides smooth interoperability with NumPy arrays for convenience and flexibility.
Converting NumPy Arrays to PyCauset
You can convert NumPy arrays into pycauset objects using pycauset.matrix and pycauset.vector. These constructors automatically detect the data type of the NumPy array and create the corresponding optimized object.
Note: PyCauset does not expose a pycauset.asarray API. In PyCauset, “arrays” are not a first-class concept; matrices and vectors are.
Rectangular 2D arrays are supported for dense numeric matrices (int/uint/float/complex). Boolean 2D arrays are bit-packed (DenseBitMatrix) and also support rectangular (rows, cols) shapes.
Supported dtypes include:
- Integers:
int8/int16/int32/int64anduint8/uint16/uint32/uint64 - Floats:
float16/float32/float64 - Complex floats:
complex64/complex128(mapped tocomplex_float32/complex_float64) - Booleans:
bool_(mapped to bit-packed storage)
Performance Note: Import uses a parallelized direct path for large arrays, achieving >10GB/s on modern hardware. Non-contiguous arrays (slices) are automatically handled via an optimized parallel copy.
import numpy as np
import pycauset as pc
# Convert 1D NumPy array to Vector
arr_1d = np.array([1.0, 2.0, 3.0])
vec = pc.vector(arr_1d) # Returns [pycauset.FloatVector](<../docs/classes/vector/pycauset.FloatVector.md>)
# Convert 2D NumPy array to Matrix
arr_2d = np.array(((1, 2), (3, 4)), dtype=np.int32)
mat = pc.matrix(arr_2d) # Returns [pycauset.IntegerMatrix](<../docs/classes/matrix/pycauset.IntegerMatrix.md>)
# Unsigned integers
arr_u = np.array(((1, 2), (3, 4)), dtype=np.uint32)
mat_u = pc.matrix(arr_u) # Returns [pycauset.UInt32Matrix](<../docs/classes/matrix/pycauset.UInt32Matrix.md>)
# Complex
arr_c = np.array(((1 + 2j, 0), (0, 3 - 4j)), dtype=np.complex64)
mat_c = pc.matrix(arr_c) # Returns [pycauset.ComplexFloat32Matrix](<../docs/classes/matrix/pycauset.ComplexFloat32Matrix.md>)
# Convert Boolean array
arr_bool = np.array([True, False], dtype=bool)
vec_bool = pc.vector(arr_bool) # Returns [pycauset.BitVector](<../docs/classes/vector/pycauset.BitVector.md>)
Note: This operation creates a copy of the data. Depending on the size and the configured memory threshold, the new object will be stored in RAM or on disk. See guides/Storage and Memory.
NumPy UFunc Support
PyCauset matrices support NumPy universal functions (ufuncs) like np.sin, np.add, etc.
These operations return lazy expressions that are evaluated efficiently.
import numpy as np
import pycauset as pc
A = pc.matrix(np.random.rand(100, 100))
# Element-wise sine
B = np.sin(A)
# Element-wise addition
C = np.add(A, A)
Mixed Arithmetic and Ergonomics
PyCauset supports mixed operations between NumPy arrays and PyCauset objects.
Operators (+, -, *, @) automatically route to the optimized PyCauset implementation when possible.
A = pc.matrix(np.random.rand(1000, 1000))
B = np.random.rand(1000, 1000)
# Works efficiently!
# PyCauset handles the add, returning a FloatMatrix
C = A + B
# Also works (Reverse add support)
D = B + A
Converting PyCauset Objects to NumPy
All pycauset Matrix and Vector classes implement the NumPy array protocol (__array__). This means you can pass any pycauset object directly to np.array() or any function that expects an array-like object.
v = pc.vector([1, 2, 3])
# Convert to NumPy array
arr = np.array(v)
# Use in NumPy functions
mean_val = np.mean(v)
std_val = np.std(v)
Zero-Copy Views (copy=False)
By default, conversion creates a copy (safe). However, if you want high-performance access without duplication, you can request a view using pc.to_numpy(..., copy=False).
- Success: Returns a read-only NumPy array viewing the PyCauset memory.
- Fallback: If the object cannot be viewed (e.g., bit-packed matrices or complex expression templates), it issues a
UserWarningand falls back to a copy.
M = pc.matrix(np.eye(1000))
# Try to get a view
# Warning: The returned array is READ-ONLY. Modifying it is undefined behavior.
view = pc.to_numpy(M, copy=False)
Safety rules (materialization)
Converting a massive out-of-core matrix to NumPy is dangerous—it forces the entire dataset into RAM, which can crash your process. PyCauset guards against this.
- Snapshot-backed (
.pycauset) and RAM-backed (:memory:) objects:np.array(obj)is allowed and returns a copy. - Spill/file-backed objects (e.g.,
.tmp):np.array(obj)raises by default to prevent surprise full materialization. Opt in explicitly viapc.to_numpy(obj, allow_huge=True)if you truly want to load it into RAM. - Ceiling control: pc.set_export_max_bytes(bytes_or_None) sets a materialization limit.
Nonedisables the size ceiling; file-backed objects still requireallow_huge=True.
If you see an export error, either downsize, keep the data in PyCauset ops, or opt in with allow_huge=True intentionally.
On-disk conversions (NumPy formats)
If you need to move data between PyCauset snapshots and NumPy container files, use pc.convert_file.
Important note: exporting from .pycauset to .npy/.npz still produces a dense NumPy array in-process today (guarded by allow_huge), because NumPy’s writers expect dense arrays.
- Supported formats:
.pycauset,.npy,.npz(import/export in any direction). npz_keyselects a named array inside an archive; defaults to the first key.- Exports honor the same materialization guard: spill/file-backed sources require
allow_huge=True.
Example:
# Snapshot -> npy -> snapshot round-trip
pc.convert_file("A.pycauset", "A.npy")
pc.convert_file("A.npy", "A_roundtrip.pycauset")
# Pick a specific array inside an npz
pc.convert_file("bundle.npz", "vec.pycauset", npz_key="vector0")
Mixed Arithmetic
You can perform arithmetic operations directly between pycauset objects and NumPy arrays. pycauset handles the interoperability automatically.
Important note: when you mix a pycauset object with a NumPy array, the NumPy side is typically converted to a temporary pycauset object and the operation is executed through PyCauset's dtype rules (promotion, underpromotion, overflow). See documentation/internals/DType System.md.
Vector + NumPy Array
v = pc.vector([1, 2, 3])
arr = np.array([10, 20, 30])
# Result is a pycauset Vector (operation happens in C++ backend)
result = v + arr # [11, 22, 33]
Matrix @ NumPy Vector
You can use NumPy arrays as operands in matrix multiplication.
M = pc.matrix(((1, 0), (0, 1))) # Identity
v_np = np.array([5.0, 6.0])
# Result is a pycauset Vector
v_result = M @ v_np # [5.0, 6.0]
Performance Considerations
- PyCauset as Primary: When you perform operations like
pycauset_obj + numpy_obj,pycausetattempts to handle the operation. The NumPy array is temporarily converted to apycausetobject (backed by RAM or a temporary file), and the operation runs using the optimized C++ backend. The result is a newpycausetobject. - NumPy as Primary: If you use a NumPy function like
np.add(pycauset_obj, numpy_obj), NumPy will convert thepycausetobject to an in-memory array first. This might be slower and memory-intensive for large datasets.
Best Practice: For massive datasets, stick to pycauset native operations and objects as much as possible, only converting to NumPy for small results or specific analysis steps that pycauset doesn't yet support.