Data layout¶
This page discusses how data is laid down in memory, and it provides some hints about how to use NumPy efficiently with the memory layout used by the LiteBIRD Simulation Framework. We focus mostly on the detectors time-ordered-data (TOD), as they are the largest object we expect to handle frequently.
The LiteBIRD Simulation Framework stores TOD as a matrix with one row
per detector and the other detector attributes as arrays with one
entry per detector. This means that, given an observation obs
, you
access quantities with patterns like obs.fknee[12]
(as opposed to
obs.detectors[12].fknee
). Note you can easily write single lines
that operate on all the detectors:
# Apply to each detector its own calibration factor
obs.tod *= obs.calibration_factors[:, None]
# Generate white noise at different level for each detector
obs.tod = (np.random.normal(size=obs.tod.shape)
* obs.wn_level[:, None])
TOD as two-dimensional arrays¶
Organizing the TOD in a matrix is ideal because it enables fast access
to (1) many samples for one detector and (2) many detectors for one
time sample. In fact, TOD are a large set of samples. Each sample is
identified by the time at which it was taken and by the detector that
took it. If all the detectors are synchronously sampled, these samples
can be organized in a matrix D-by-T, where D is the number of
detectors and T is the number of time samples. Following
numpy.ndarray
notation, we denote it as tod[row_index,
col_index]
. Accessing the \(i\)-th row can be done using the
notation tod[i]
: for instance, tod[3]
gets the fourth row (the
TOD of the fourth detector); accessing the fourth column, i.e., the
fourth sample of each detector, can be done with tod[:, 3]
(remember that in Python we start counting from 0). So, the
way the framework keeps TODs in memory makes easy to operate on the
«detector dimension» as well as on the «time dimension».
You can take advantage of NumPy functions to easily propagate
operations along one dimension or another. For instance, taking the
time stream of the mean signal across the focal plane is done with
tod.mean(axis=0)
. In general, an extremely large number of
operations can be expressed in terms of numpy
functions and they
can operate easily and efficiently over axes of multi-dimensional
arrays. Similarly, if C
is a cross-talk matrix, the operation C
@ tod
mixes the TODs.
The previous examples are not only easy to write, they are also (very likely)
as fast as they can be (compared to different data layouts). This is
particularly true for the last example. The matrix multiplication C @ tod
calls internally some highly optimized dense linear algebra routines.
Even when there is not really communications between detectors
involved, having data in a 2-dimensional array produces (arguably)
cleaner code and sometimes faster code (never slower). Suppose you
have a list of calibration_factors
and a list of time streams
tod
. You can apply the calibration factors with
for calibration_factor, det_tod in zip(calibration_factors, tod):
det_tod *= calibration_factor
but if calibration_factors
is an array and tod
a matrix you can
achieve the same result with the easier and (typically) faster
tod *= calibration_factors[:, None]
Splitting the TOD matrix¶
When working on a laptop (or a single compute node) we can live in the ideal case discussed above. We can benefit both from the API simplifications and the performance of the compiled codes that the numerical libraries typically call. Moreover, these libraries are often threaded (or easily threadable) and therefore we can in principle leverage on all the processors available.
However, when working on clusters, we have to split this matrix into pieces. We resort to supercomputers either when we want more CPU power than a laptop or because we need more memory to store our data. Both motivations apply to full-scale LiteBIRD simulations (4000 detectors, sampled at 20 GHz for 3 years take approximately 10 TB). Therefore, we have to distribute the matrix and a compute node has access only to the portion of the matrix that is in its memory.
To do this, we split the matrix into blocks that are stored as (sub-)matrices. At the present stage, there is no constraint on the shape of these blocks, they can also be a single row or a single column. The important thing is that, if the block spans more than a row, it is stored as a matrix (a 2-dimensional array). This means that, for the time- and detector-chunk assigned to a compute node, all the benefits discussed in the previous section apply.
Choosing the shape of the blocks¶
The most convenient block shape depends on the application:
Some operations prefer to store an entire row. For example, \(4\pi\) beam convolution has a memory overhead of the order of the GB for each beam, which is in principle different for each detector. For this operation is therefore more convenient to hold a detector for the entire length of the mission.
However, other operations prefer to have an entire column. For example, computing the common mode across the focal plane requires information from all the detectors at every time sample. This is trivial if the full detector column is in the memory, but it is very complicated if it is scattered across many compute nodes.
Another example is cross-talks. Mixing the TOD of different detectors is trivial when everything is in memory, but otherwise it requires sending large messages across the network. Sending messages is not only a performance issue, it means that probably a custom algorithm has to be written, and this algorithm will probably require MPI communications, which are notoriously hard to develop and debug.
Since there is no one-size-fit-all solution, the LiteBIRD Simulation Framework keeps the most generic approach. The shape of the blocks depends on the application, and it is possible to change it during the execution. (However, this takes time and should thus be done only when strictly necessary.)
Caveats¶
An important thing to remember is that, if the TOD matrix is chunked along the detector dimension, only the corresponding portion of the property array is detained in memory.
This has a few important implications:
Regardless if and how the TOD is distributed, both
obs.tod[i]
andobs.wn_level[i]
refer to the same detector;obs.tod
andobs.wn_level
have the same length;Operations like
obs.tod * obs.wn_level[:, None]
are correct.