Design#

Cubed is composed of five layers: from the storage layer at the bottom, to the Array API layer at the top:

Five layer diagram

Blue blocks are implemented in Cubed, green in Rechunker, and red in other projects like Zarr and Beam.

Let’s go through the layers from the bottom:

Storage#

Every array in Cubed is backed by a Zarr array. This means that the array type inherits Zarr attributes including the underlying store (which may be on local disk, or a cloud store, for example), as well as the shape, dtype, and chunks. Chunks are the unit of storage and computation in this system.

Runtime#

Cubed uses external runtimes for computation. It follows the Rechunker model (and uses its algorithm) to delegate tasks to stateless executors, which include Python (in-process), Lithops, Modal, and Apache Beam.

Primitive operations#

There are two primitive operations on arrays:

blockwise

Apply a function to multiple blocks from multiple inputs, expressed using concise indexing rules.

rechunk

Change the chunking of an array, without changing its shape or dtype.

Core operations#

These are built on top of the primitive operations, and provide functions that are needed to implement all array operations.

elemwise

Apply a function elementwise to array arguments, respecting broadcasting.

map_blocks

Apply a function to corresponding blocks from multiple input arrays.

map_direct

Apply a function across blocks of a new array, reading directly from side inputs (not necessarily in a blockwise fashion).

index

Subset an array, along one or more axes.

reduction

Apply a function to reduce an array along one or more axes.

arg_reduction

A reduction that returns the array indexes, not the values.

Array API#

The new Python Array API was chosen for the public API as it provides a useful, well-defined subset of the NumPy API. There are a few extensions, including Zarr IO, random number generation, and operations like map_blocks which are heavily used in Dask applications.