Do Extra with NumPy Array Sort Hints: Annotate & Validate Form & Dtype

array object can take many concrete varieties. It is likely to be a one-dimensional (1D) array of Booleans, or a three-dimensional (3D) array of 8-bit unsigned integers. Because the built-in perform isinstance() will present, each array is an occasion of np.ndarray, no matter form or the kind of components saved within the array, i.e., the dtype. Equally, many type-annotated interfaces nonetheless solely specify np.ndarray:

import numpy as np

def course of(
    x: np.ndarray,
    y: np.ndarray,
    ) -> np.ndarray: ...

Such sort annotations are inadequate: most interfaces have robust expectations of the form or dtype of handed arrays. Most code will fail if a 3D array is handed the place a 1D array is predicted, or an array of dates is handed the place an array of floats is predicted.

Taking full benefit of the generic np.ndarray, array form and dtype traits can now be absolutely specified:

def course of(
    x: np.ndarray[tuple[int], np.dtype[np.bool_]],
    y: np.ndarray[tuple[int, int, int], np.dtype[np.uint8]],
    ) -> np.ndarray[tuple[int], np.dtype[np.float64]]: ...

With such element, current variations of static evaluation instruments like mypy and pyright can discover points earlier than code is even run. Additional, run-time validators specialised for NumPy, like StaticFrame‘s sf.CallGuard, can re-use the identical annotations for run-time validation.

Generic Varieties in Python

Generic built-in containers reminiscent of listing and dict may be made concrete by specifying, for every interface, the contained varieties. A perform can declare it takes a listing of str with listing[str]; or a dict of str to bool may be specified with dict[str, bool].

The Generic np.ndarray

An np.ndarray is an N-dimensional array of a single ingredient sort (or dtype). The np.ndarray generic takes two sort parameters: the primary defines the form with a tuple, the second defines the ingredient sort with the generic np.dtype. Whereas np.ndarray has taken two sort parameters for a while, the definition of the primary parameter, form, was not full specified till NumPy 2.1.

The Form Sort Parameter

When creating an array with interfaces like np.empty or np.full, a form argument is given as a tuple. The size of the tuple defines the array’s dimensionality; the magnitude of every place defines the scale of that dimension. Thus a form (10,) is a 1D array of 10 components; a form (10, 100, 1000) is a 3 dimensional array of measurement 10 by 100 by 1000.

When utilizing a tuple to outline form within the np.ndarray generic, at current solely the variety of dimensions can typically be used for sort checking. Thus, a tuple[int] can specify a 1D array; a tuple[int, int, int] can specify a 3D array; a tuple[int, ...], specifying a tuple of zero or extra integers, denotes an N-dimensional array. It is likely to be doable sooner or later to type-check an np.ndarray with particular magnitudes per dimension (utilizing Literal), however this isn’t but broadly supported.

The dtype Sort Parameter

The NumPy dtype object defines ingredient varieties and, for some varieties, different traits reminiscent of measurement (for Unicode and string varieties) or unit (for np.datetime64 varieties). The dtype itself is generic, taking a NumPy “generic” sort as a kind parameter. Probably the most slim varieties specify particular ingredient traits, for instance np.uint8, np.float64, or np.bool_. Past these slim varieties, NumPy gives extra common varieties, reminiscent of np.integer, np.inexact, or np.quantity.

Making np.ndarray Concrete

The next examples illustrate concrete np.ndarray definitions:

A 1D array of Booleans:

np.ndarray[tuple[int], np.dtype[np.bool_]]

A 3D array of unsigned 8-bit integers:

np.ndarray[tuple[int, int, int], np.dtype[np.uint8]]

A two-dimensional (2D) array of Unicode strings:

np.ndarray[tuple[int, int], np.dtype[np.str_]]

A 1D array of any numeric sort:

np.ndarray[tuple[int], np.dtype[np.number]]

Static Sort Checking with Mypy

As soon as the generic np.ndarray is made concrete, mypy or related sort checkers can, for some code paths, establish values which are incompatible with an interface.

For instance, the perform beneath requires a 1D array of signed integers. As proven beneath, unsigned integers, or dimensionalities aside from one, fail mypy checks.

def process1(x: np.ndarray[tuple[int], np.dtype[np.signedinteger]]): ...

a1 = np.empty(100, dtype=np.int16)
process1(a1) # mypy passes

a2 = np.empty(100, dtype=np.uint8)
process1(a2) # mypy fails
# error: Argument 1 to "process1" has incompatible sort
# "ndarray[tuple[int], dtype[unsignedinteger[_8Bit]]]";
# anticipated "ndarray[tuple[int], dtype[signedinteger[Any]]]"  [arg-type]

a3 = np.empty((100, 100, 100), dtype=np.int64)
process1(a3) # mypy fails
# error: Argument 1 to "process1" has incompatible sort
# "ndarray[tuple[int, int, int], dtype[signedinteger[_64Bit]]]";
# anticipated "ndarray[tuple[int], dtype[signedinteger[Any]]]"

Runtime Validation with sf.CallGuard

Not all array operations can statically outline the form or dtype of a ensuing array. For that reason, static evaluation is not going to catch all mismatched interfaces. Higher than creating redundant validation code throughout many features, sort annotations may be re-used for run-time validation with instruments specialised for NumPy varieties.

The StaticFrame CallGuard interface presents two decorators, examine and warn, which elevate exceptions or warnings, respectively, on validation errors. These decorators will validate type-annotations in opposition to the traits of run-time objects.

For instance, by including sf.CallGuard.examine to the perform beneath, the arrays fail validation with expressive CallGuard exceptions:

import static_frame as sf

@sf.CallGuard.examine
def process2(x: np.ndarray[tuple[int], np.dtype[np.signedinteger]]): ...

b1 = np.empty(100, dtype=np.uint8)
process2(b1)
# static_frame.core.type_clinic.ClinicError:
# In args of (x: ndarray[tuple[int], dtype[signedinteger]]) -> Any
# └── In arg x
#     └── ndarray[tuple[int], dtype[signedinteger]]
#         └── dtype[signedinteger]
#             └── Anticipated signedinteger, offered uint8 invalid

b2 = np.empty((10, 100), dtype=np.int8)
process2(b2)
# static_frame.core.type_clinic.ClinicError:
# In args of (x: ndarray[tuple[int], dtype[signedinteger]]) -> Any
# └── In arg x
#     └── ndarray[tuple[int], dtype[signedinteger]]
#         └── tuple[int]
#             └── Anticipated tuple size of 1, offered tuple size of two

Conclusion

Extra may be completed to enhance NumPy typing. For instance, the np.object_ sort may very well be made generic such that Python varieties contained in an object array may very well be outlined. For instance, a 1D object array of pairs of integers may very well be annotated as:

np.ndarray[tuple[int], np.dtype[np.object_[tuple[int, int]]]]

Additional, items of np.datetime64 can’t but be statically specified. For instance, date items may very well be distinguished from nanosecond items with annotations like np.dtype[np.datetime64[Literal['D']]] or np.dtype[np.datetime64[Literal['ns']]].

Even with limitations, fully-specified NumPy sort annotations catch errors and enhance code high quality. As proven, Static Evaluation can establish mismatched form or dtype, and validation with sf.CallGuard can present robust run-time ensures.