DType / Complex / Overflow Plan (Implementation)

Status (2025-12-16): This implementation plan is complete. Phase 1 complete; Phase 2 complete (int8/int16/int32/int64 + uint8/uint16/uint32/uint64 + float16 end-to-end); Phase 3 complete (complex floats are now first-class end-to-end on CPU for the current core-op surface); Phase 4 complete (support matrix declared + enforced in tests/tools). Optional backlog items are still listed below.

Scope update (2025-12)

This plan originally sketched “complex permutations for all base dtypes” (including complex int* and complex bit).

Current project direction: complex support is limited to complex floats only (complex_float16, complex_float32, complex_float64). Complex permutations of non-float dtypes (complex int* / complex bit) are a non-goal by design due to high implementation surface area (promotion/overflow/kernels/persistence/tests) with low practical payoff for PyCauset’s workloads.

As a result: - Phase 2 includes first-class float16 as a general dtype, plus the full signed/unsigned integer width set. - Phase 3 (complex) should be interpreted as “complex float integration”, with complex_float16 implemented after float16 readiness.

Phase completion status

Phase 0 — Documentation & policy grounding: Complete
Phase 1 — Centralize promotion + overflow policies: Complete
Phase 2 — Scalar system expansion: Complete (int8/int16/int32/int64, uint8/uint16/uint32/uint64, float16 end-to-end through factories/promotion/CPU dispatch/persistence/bindings/NumPy for the core op surface)
Phase 3 — Complex system integration: Complete (complex_float16/32/64 are first-class dtypes through factories/promotion/CPU dispatch/persistence/bindings/NumPy for core ops)
Phase 4 — Coverage enforcement: Complete (support matrix declared + enforced in tests/tools)

This file is an implementation plan. The authoritative dtype behavior documentation lives in:

documentation/internals/DType System.md

0) Problem statement

PyCauset supports several fundamentally different scalar/storage types (bit-packed bit, integers, floats) plus a partially-separate complex system. Adding a new operation currently requires touching multiple layers and remembering many dtype-specific corner cases:

type/promotion rules are split between global helpers and per-op frontends,
CPU kernels often dispatch on “result dtype” and omit some types,
complex numbers are currently not a first-class MatrixBase dtype and therefore drift from the main dispatch/type-resolution path,
missing coverage is easy to ship because there is no single enforceable “support matrix”.

This document proposes a new, centralized dtype architecture that:

makes complex floats first-class in the scalar type system,
adds multiple integer widths (signed/unsigned),
defines explicit promotion + overflow policies,
keeps the “anti-promotion / smallest type” ethos,
keeps performance and out-of-core constraints as first-class concerns.

1) Key constraints (from project philosophy + recent decisions)

Scale-first: matrices may be 100GB+; memory blowups are unacceptable.
Underpromotion default: when PyCauset underpromotes, it means compute and result storage both use the smallest selected dtype.
No silent widening for accuracy: no hidden “compute in float64 then downcast” in the default path.
Bit matrices are numeric for arithmetic ops: treat bit values as 0/1 numeric values for arithmetic ops (e.g., +, *, dot, matmul). Bitwise ops are explicit and must preserve bit-packed storage.
Overflow behavior: integer overflow is a runtime error. PyCauset does not auto-promote to avoid overflow.
Overflow warning: for large integer matmul, run a worst-case bound preflight and emit a warning when overflow looks plausible.
Complex floats are first-class: complex support is limited to float base dtypes.
complex_float32 / complex_float64 are BLAS-backed where applicable (native complex types complex64 / complex128).
complex_float16 is implemented as a first-class dtype using a two-plane float16 storage model.
Complex non-floats are a non-goal: complex int* / complex bit are intentionally unsupported to avoid a large promotion/overflow/kernel/persistence surface area with low payoff.
Fundamental-kind rule (bit/int/float): PyCauset never “promotes down” across fundamental kinds. If an operation mixes kinds, the result kind is the higher kind required by the operation’s semantics.

2) Terminology

Scalar type: the per-element numeric type (bit/int/float plus width and flags).
Matrix structure: dense/triangular/symmetric/etc. (storage layout and indexing constraints).
Operation (op): add/subtract/elementwise multiply/matmul/inverse/eigvals/etc.
Promotion policy: rules for selecting result dtypes for mixed-input ops.
Overflow policy: what happens when integer arithmetic overflows.

3) Proposed scalar type model (flags/permutations)

Represent scalar types as:

kind: bit | int | float
width_bits: for int/float (8/16/32/64), and 1 for bit
flags: a small set of orthogonal modifiers
complex (supported for float scalar types only)
unsigned (valid only for int)

Examples:

bit = (bit, 1, {})
int16 = (int, 16, {})
uint16 = (int, 16, {unsigned})
float16 = (float, 16, {})
complex float16 = (float, 16, {complex})
float32 = (float, 32, {})
complex float32 (complex64) = (float, 32, {complex})
float64 = (float, 64, {})
complex float64 (complex128) = (float, 64, {complex})

Supported scalar set (initial target)

bit
int8/int16/int32/int64
uint8/uint16/uint32/uint64
float16/float32/float64
complex_float16/complex_float32/complex_float64

4) Complex implementation strategy

4.1 Complex floats (performance path)

Implement complex_float32 (complex64) and complex_float64 (complex128) as true complex numeric types.
Prefer BLAS-backed complex GEMM where applicable.

4.2 Complex float16 (two-plane storage path)

Represent complex_float16 as two float16 planes (real + imag).
Motivation: there is no ubiquitous, efficient “native complex half” representation across the stack, and forcing complex-half into complex-float32 would violate the “smallest type” ethos.
Persistence must round-trip as a single complex dtype (one logical object, two payload planes).

4.3 Explicit non-goals

Complex permutations of non-float dtypes (complex int*, complex bit) are intentionally out of scope.
If/when we ever revisit this, it must be driven by concrete workloads and come with a scoped support matrix (ops × dtype) rather than a blanket “closure” rule.

This plan does not assume automatic widening in integer matmul. Under the current policy:

integer overflow throws, and
the system does not silently widen storage to avoid overflow.

If we ever decide that a particular op’s semantic result dtype must be wider (e.g., a count-producing op), that must be a named, explicit promotion rule and must be documented as semantics, not an overflow workaround.

5) Promotion policy (centralized, op-specific)

Create a single authoritative table/function:

resolve_result_scalar(op, a_scalar, b_scalar) -> scalar
resolve_result_structure(op, a_structure, b_structure) -> structure

Design principles:

Default to the smallest dtype that can represent the result per op semantics.
Mixed float precision underpromotes by default (compute+store in the smaller float), with a configurable option to promote instead.
Complex is a flag: complex-ness is preserved unless an op is explicitly defined to drop it.
Unsigned is preserved where meaningful; if an op can generate negatives, rules must define whether to promote to signed or throw.

5.1 Fundamental kinds (bit / int / float) and “no promote down”

PyCauset distinguishes three fundamental kinds:

bit (bit-packed boolean storage; special rules allowed)
int (signed/unsigned integers)
float (float16/float32/float64)

Rules:

1) No promote down across kinds. If kinds differ, the result kind cannot be the “lower” kind. 2) When a float participates, the result kind is float. Example: matmul(bit, float64) -> float64. 3) When only integers/bits participate, the result kind is integer unless the op is explicitly bitwise. 4) Underpromotion applies within a kind, not across kinds. Example: matmul(float32, float64) -> float32 by default.

This strikes a balance:

it preserves the “smallest type” ethos where it is meaningful (within float precision),
it avoids absurd outcomes like underpromoting a float computation to bit storage,
it keeps bit special (bitwise ops remain bitwise; numeric ops may change kind).

5.2 Bit is special (scale-first exceptions)

Bit matrices/vectors are used to represent large binary structures (e.g., spacetime relations) where the storage is often 10s–100s of GB.

As a result:

Bitwise ops (e.g., NOT/AND/OR/XOR) should preserve bit and stay bit-packed.
Numeric ops that inherently create non-binary results (e.g., bit + bit, matmul(bit, bit) producing integer counts) may require widening to int or float.

For such numeric ops, widening can be prohibitively expensive. Therefore, for bit we allow explicit, op-specific behavior:

supported with a documented widening result kind, or
error-by-design unless the user explicitly requests a widened dtype.

The support matrix must record which choice is made for each op.

Config hooks:

promotion_policy.float_mixed: underpromote_warn (default) | promote | underpromote_no_warn

Warning controls (exact API TBD, but must exist):

warning_policy.float_underpromotion: on by default when promotion_policy.float_mixed=underpromote_warn
warning_policy.int_reduction_acc_widen: on by default; emitted when dot/matmul widens the accumulator dtype
warning_policy.int_overflow_risk_preflight: on by default for “large” integer matmul; emitted when conservative bounds indicate plausible overflow in the requested output dtype

6) Overflow policy

6.1 Runtime behavior

Overflow is a hard error.
PyCauset does not auto-promote storage to avoid overflow.

6.1.1 Why this focuses on integer overflow (and not float overflow)

Floating-point overflow is real (e.g., float32 can overflow to +inf), but it behaves differently:

IEEE-754 overflow typically becomes inf (and may raise a floating-point flag), which then propagates.
This is often detectable after-the-fact (e.g., isfinite checks), whereas integer overflow in C++ can be undefined behavior or silent wrap depending on the implementation.

Policy-wise:

For integers: overflow must throw (no silent wrap).
For floats: overflow results in inf/nan according to IEEE-754; optional “finite-check” validation can exist as a debug/strict mode, but it is not the default because scanning 100GB+ outputs is expensive.

6.2 Preflight warning for large integer matmul

For integer matmul (and potentially some other high-risk ops), run a cheap preflight to estimate overflow risk:

1) sample blocks/rows to estimate max_abs(A) and max_abs(B) (including scalar metadata factors if they apply) 2) compute a conservative bound:

\[\max |C_{ij}| \le K \cdot \max|A| \cdot \max|B|\]

Where \(K\) is the inner dimension (for square matmul, \(K=N\)).

If the bound approaches/exceeds the target dtype max value, emit a warning:

PyCausetWarning: matmul(<lhs_dtype>, <rhs_dtype>) may overflow <out_dtype> (conservative bound). Consider requesting a wider output dtype or scaling.

Notes:

This is a heuristic. It should warn on risk; it does not guarantee overflow will happen.
It avoids inner-loop overflow checks in the performance-critical kernel.

Documentation requirement:

Add an “Overflow” section/doc describing the policy, the preflight warning, and user mitigations.

6.3 Reduction-aware accumulator width (dot/matmul) + required warning

Some integer reductions (especially dot/matmul) can overflow the accumulator even when inputs are representable and the requested output dtype is unchanged.

To keep integer math defined and to uphold “overflow throws” without requiring expensive per-multiply-add overflow checks inside the hot loop, PyCauset uses a reduction-aware accumulator width for integer reductions.

Key clarifications (scale-first):

This rule is about the accumulator dtype (compute registers / local scratch), not about materializing inputs.
In particular, bit inputs stay bit-packed; matmul(bit, int16) does not expand the bit matrix to int32 elements.
This rule does not silently widen the result storage dtype. If the user requests int16 output, the result is stored as int16 and overflow remains a hard error (typically detected at the final cast from the wider accumulator).

6.3.1 Accumulator-width selection (deterministic / conservative)

For matmul/dot over integer kinds (including bit treated as numeric 0/1), choose an accumulator dtype wide enough that the worst-case bound for the reduction fits.

For C = A @ B with inner dimension K:

Use a conservative magnitude bound based on dtype limits (no sampling required):

\[\max |C_{ij}| \le K \cdot \max|A| \cdot \max|B|\]

For bit, \(\max|A| = 1\).

For integer dtypes, \(\max|A|\) and \(\max|B|\) may be taken as the maximum representable magnitude for their dtypes (e.g., for int16, 32767). This is conservative and ensures accumulator selection is correctness-preserving without needing an extra pass over out-of-core data.

This is intentionally conservative: it is designed to be computed cheaply and to be correct without relying on probabilistic assumptions.

Optionally (future optimization): when it is cheap relative to the matmul itself and does not force an extra out-of-core pass, tighten the bound using exact streaming summaries such as row popcounts for bit and per-column max-abs for the integer operand.

6.3.2 User-visible warning (required)

Whenever the chosen accumulator dtype is wider than what a reader would naively expect from the inputs (e.g., matmul(bit, int16) accumulating into int32), PyCauset must emit a warning so users understand what is happening.

The warning must include:

operation name (e.g., matmul / dot)
lhs dtype and rhs dtype
chosen accumulator dtype
output storage dtype (explicitly stating whether it changed or not)
reason (reduction-aware widening to keep integer overflow defined)

Suggested warning text (exact wording not required, but content is):

PyCausetWarning: matmul(bit, int16) will accumulate in int32 (reduction-aware integer width). Output dtype remains int16; overflow still throws on cast. Bit input remains bit-packed (no materialization).

Noise control:

Warn once per call site (or once per unique (op, lhs_dtype, rhs_dtype, out_dtype, acc_dtype) tuple) to avoid spam.
Provide a user-facing way to silence/route warnings (Python warnings.warn(...) category, and/or a context flag).

7) Enforceable op coverage (“support matrix”)

Introduce an explicit coverage matrix that enumerates for each operation:

required scalar families (bit/int/float + complex)
supported widths
supported structures (dense/triangular/symmetric/etc.)
required behaviors (defined, error-by-design, or unimplemented)

Goal:

When a new op is added, missing dtype coverage becomes a failing test/tool run, not a surprise at runtime.

8) Implementation sequence (phased)

Phase 0 — Documentation & policy grounding (Complete)

Update project philosophy to explicitly define underpromotion and overflow behavior.
Add roadmap entry for multi-int widths + unsigned.
Add this plan doc.

Phase 1 — Centralize promotion + overflow policies (Complete)

Single promotion resolver per op.
Central overflow policy + preflight warning for integer matmul.
Reduction-aware accumulator width for integer dot/matmul + required user warning when accumulator widens.
Add mandatory tests for resolver correctness, warning emission, and reduction accumulator selection (see “Mandatory tests”).

Phase 2 — Scalar system expansion (Complete)

Add integer widths + unsigned.
Ensure constructors, IO, numpy interop, and basic ops exist.

Phase 3 — Complex system integration (Complete)

Core complex-float dtype integration is implemented (CPU + persistence + Python/NumPy for key ops).
See “Phase 3 — Complex system integration (Detailed)” in Section 8.1.

Phase 4 — Coverage enforcement (Complete)

Support matrix exists and is executed by unit tests and a dev checker tool, so declared support can’t silently regress.

8.1) Phase 3 — Complex system integration (Detailed)

Objective: Make complex float dtypes first-class and integrate them into the same end-to-end pipeline as real dtypes (frontend allocation → promotion resolver → CPU/GPU dispatch → persistence → Python).

User-facing requirement: complex float dtypes must behave like normal dtypes on the frontend. For example, pc.complex_float16 (or equivalent public token) must be a valid dtype= argument to Matrix/Vector factories.

Scope for Phase 3: expand complex support to float base dtypes only:

float16 → complex_float16 (two float16 planes)
float32 → complex_float32 (a.k.a. complex64)
float64 → complex_float64 (a.k.a. complex128)

Out of scope: complex permutations of non-float dtypes (complex int*, complex bit).

3.x Phase 3 status update (2025-12-16)

Completed in the current codebase:

First-class complex float dtypes exist end-to-end: complex_float16/32/64.
Storage:
complex_float32/64: dense storage uses native complex element types.
complex_float16: two-plane float16 storage (real+imag) for both matrices and vectors.
Dispatch/promotion:
promotion resolver supports complex results for matmul/add/sub/elementwise, plus dot/matvec/vecmat/outer.
CPU solver contains complex implementations for dot/matvec/vecmat/outer and vector elementwise/scalar ops.
Python/NumPy/persistence:
dtype tokens + factory inference + np.array(...) interop + container persistence round-trip.
dot returns Python complex when either operand is complex.

Optional backlog (not required for plan completion):

Ensure solver/eigensystem outputs use first-class complex dtypes end-to-end (no parallel complex object model).
BLAS/cBLAS complex GEMM path for dense complex matmul on CPU (and GPU complex where applicable).
Expand complex coverage across additional operations beyond the current core set.

3.0 Replace legacy `ComplexMatrix` / `ComplexVector` (compat layer)

Current state (updated 2025-12-16):

First-class complex float matrices/vectors now exist as MatrixBase/VectorBase dtypes (complex_float16/32/64).
The legacy ComplexMatrix / ComplexVector concept may still exist in some solver/eigensystem return paths. That legacy path is now considered technical debt (it drifts from the first-class dtype pipeline).

Plan (still valid):

Ensure any remaining solver/eigensystem paths route through first-class complex dtype matrices/vectors.
Long-term goal: complex is a normal MatrixBase/VectorBase dtype, so LinearAlgebra and ComputeDevice don’t need a parallel complex universe.

Frontend contract note:

Provide explicit dtype tokens for complex floats (at minimum: complex_float16, complex_float32, complex_float64).
These tokens must normalize through the same dtype normalization funnel as real dtypes and participate in the same factory code paths.

3.1 Make “complex” first-class in the scalar type model

Requirement: represent scalar types as (kind, width_bits, flags) where flags includes at least {complex, unsigned}.

Implementation direction:

Introduce a ScalarType descriptor (or equivalent) that can represent:
base dtype (float16/float32/float64)
flags (complex)
Plumb this through the type-resolution path so promotion is defined as:
resolve_result_scalar(op, a_scalar, b_scalar) -> scalar

Design constraint (to match the frontend requirement):

Even though complex can be represented as (base_dtype + complex flag), it must be treated as a distinct dtype identity for:
promotion resolution,
dispatch selection,
persistence metadata,
and the support-matrix enforcement (coverage must be tracked per complex permutation).

Back-compat note:

The existing DataType enum can remain as a legacy base-type id during migration, but Phase 3 must ensure complex-ness is not “out-of-band” anymore.

3.2 Storage strategy for complex (by base kind)

We intentionally use two different representations depending on the float width, to balance performance and scale-first storage efficiency.

3.2.1 Complex floats (performance path)

complex_float32 (complex64) and complex_float64 (complex128) are true complex numeric types.
Implement dense complex storage as contiguous std::complex<float> / std::complex<double> (or ABI-compatible equivalent).
Route matmul to BLAS complex GEMM where possible.
GPU: use cuBLAS complex GEMM when available.

3.2.2 Complex float16 (two-plane storage path)

Represent complex_float16 as two float16 planes of equal shape:
real plane: float16
imag plane: float16
Motivation: avoid forcing half-precision complex values into float32 complex storage, and avoid depending on a non-portable “native complex half” ABI.

Important clarification:

“Two-plane storage” is an implementation detail. The object is still a single complex-typed matrix/vector from the API perspective, and it must round-trip via persistence as a complex dtype (not as two unrelated real objects).

3.3 First-class complex matrices/vectors in the core object model

Hard requirement: complex objects must participate in factories, persistence, and dispatch the same way other dtypes do.

Minimum deliverables:

A MatrixBase-derived complex matrix implementation for:
complex_float32 / complex_float64 (dense)
complex_float16 (two-plane storage)
A VectorBase-derived complex vector implementation (same split).

Interface hazards to address explicitly (to avoid “biting us later”):

Many existing code paths use get_element_as_double(...). For complex dtypes, this must never silently drop the imaginary part.
Either implement get_element_as_double as a hard error for complex matrices, or ensure it is only used behind a “real-only” guard.
Complex-aware paths must use get_element_as_complex(...).
ComputeDevice::multiply_scalar currently takes double; Phase 3 must define the complex-scalar story:
either add complex-scalar device entry points, or
restrict complex-scalar multiply to frontend methods that dispatch to complex kernels.

3.4 Operation coverage policy for complex

Phase 3 does not require “every op supports every complex dtype” on day one, but it must make coverage enforceable:

For each op in the canonical LinearAlgebra surface (at least LinearAlgebra.hpp):
declare complex propagation rules (preserve complex, drop complex, or error-by-design)
declare result dtype selection rules (including for bit special cases)
Ensure the resolver has explicit rows for complex permutations.

Coverage principle (mathematical independence):

Complex permutations must be treated as separate coverage targets even when they reuse plane-wise kernels.
“Works because it decomposes into two real ops” is not a substitute for tests: each complex dtype/op combination must be explicitly tested (or explicitly error-by-design with a stable error).

Specific expectations:

complex_float32/complex_float64:
add/sub/elementwise/matmul must work on CPU.
GPU support is optional, but routing must be correct (fallback to CPU when unsupported).
complex_float16:
add/sub/elementwise/matmul must work on CPU.
if implemented via two-plane arithmetic, correctness must be validated vs NumPy complex computations.

3.5 Persistence format for complex

Current implementation note (updated 2025-12-16):

complex_float16 uses a two-plane in-memory layout (real + imag), but is persisted as a single contiguous raw payload containing both planes back-to-back.
Typed metadata records the dtype identity (complex_float16) and the normal shape/layout fields; there is no need for multi-member payloads to round-trip correctly.

Future option (not required for correctness):

Multi-member payloads could still be introduced later for tooling/inspection convenience, but would be an on-disk format enhancement rather than a correctness requirement.

3.6 GPU/CPU selection policy

Match project intent:

Default behavior: benchmark/poll hardware once, then pick the fastest device.
If GPU does not support a dtype/op/structure, fall back to CPU.
Avoid exploding “one kernel per infinitesimal device” by using:
a small set of coarse regimes (dtype/shape thresholds)
a micro-benchmark-derived speedup factor

3.7 Tests (keep the explosion under control)

The only way this stays maintainable is if we separate:

pure-logic resolver tests (exhaustive across dtype permutations), from
kernel correctness tests (representative shapes), from
error-by-design tests (stable error messages).

Phase 3 must add a minimal “complex smoke matrix” for the LinearAlgebra surface:

complex_float64: add/sub/elementwise/matmul correctness vs NumPy
complex_float32: same, smaller shapes + tolerances
complex_float16: add/sub/elementwise/matmul correctness vs NumPy (two-plane storage) + persistence round-trip

9) Mandatory tests

These tests are required. They exist to prevent dtype coverage drift and to catch correctness/performance regressions early.

9.1 Pure-logic dtype resolution tests

Add unit tests (no kernels) that exercise the resolver tables/functions. At minimum:

Fundamental kind rule: never promote down across bit -> int -> float.
Float underpromotion: e.g., matmul(float32, float64) -> float32 by default.
Complex flag behavior: for each op, verify complex propagation/behavior is explicit (preserve/drop/error-by-design) and covered.
Unsigned flag behavior: verify signed/unsigned mixing rules are explicit and tested.
Error-by-design paths: verify they error with stable, specific messages.

These tests should be table-driven and exhaustive across the supported dtype set for each resolver entry.

9.2 Kernel/integration correctness tests

Add tests that validate numeric correctness and overflow behavior for representative ops and shapes:

dot/matmul integer correctness across widths.
Overflow throws deterministically (no silent wrap).

For reduction-aware accumulator widening specifically, add at least one test where:

matmul(bit, int16) (or dot(bit, int16)) produces a value that would overflow an int16 accumulator but fits in int32 output.
The test asserts:
correct numeric result,
accumulator-widen warning is emitted and mentions: op name, lhs/rhs dtypes, accumulator dtype, and output dtype.

9.3 Warning tests (user-facing behavior)

Add Python-level tests (and C++ tests where applicable) that validate warnings are:

emitted when required,
de-duplicated (warn-once policy),
informative (message includes the dtypes involved and what is happening),
suppressible/routable via a user-facing control.

Warnings to cover:

float underpromotion warning (if enabled)
integer overflow-risk preflight warning (heuristic)
integer reduction accumulator-widen warning (deterministic)

9.4 Scale-first regression tests (bit materialization guard)

Add a regression test that guards the key scale-first property for bit operands:

bit inputs must remain bit-packed during dot/matmul (no full materialization to an int/float element buffer).

Implementation note (testability): this may require a test-only hook (e.g., allocation tracer, “materialized_bit_elements” counter, or a debug trace flag) so the test can assert that no allocation proportional to A.numel() * sizeof(int32) occurred.

9.5 Support-matrix completeness test

The support matrix must be executable as a test/tool:

It must fail CI if an op claims support for a dtype/structure/device combination that lacks an implementation or test coverage.

10) Acceptance criteria

Adding a new operation requires changing:
the op implementation,
one promotion rule table,
one coverage declaration,
tests. It must not require “hunt across the codebase”.
Complex dtypes are supported for float base dtypes only (complex_float16/32/64).
Overflow behavior is consistent:
overflow throws,
large integer matmul emits a risk warning when appropriate,
no auto-promotion to avoid overflow.

11) Open questions (to confirm before implementation)

Exact list of supported ops for “core coverage” in the support matrix (minimal set to enforce first).
Whether unsigned + signed mixing rules should default to promoting to signed or throwing in ops that can go negative.
Default behavior for numeric ops on bit when the semantic result is not representable in bit without widening: default widen vs error-by-design unless the caller explicitly requests an output dtype.