Data layout
===========

This page discusses how data is laid down in memory, and it provides
some hints about how to use NumPy efficiently with the memory layout
used by the LiteBIRD Simulation Framework. We focus mostly on the
detectors time-ordered-data (TOD), as they are the largest object we
expect to handle frequently.

The LiteBIRD Simulation Framework stores TOD as a matrix with one row
per detector and the other detector attributes as arrays with one
entry per detector. This means that, given an observation ``obs``, you
access quantities with patterns like ``obs.fknee[12]`` (as opposed to
``obs.detectors[12].fknee``). Note you can easily write single lines
that operate on all the detectors::

  # Apply to each detector its own calibration factor
  obs.tod *=  obs.calibration_factors[:, None]
  # Generate white noise at different level for each detector
  obs.tod = (np.random.normal(size=obs.tod.shape)
             * obs.wn_level[:, None])


TOD as two-dimensional arrays
-----------------------------

Organizing the TOD in a matrix is ideal because it enables fast access
to (1) many samples for one detector and (2) many detectors for one
time sample. In fact, TOD are a large set of samples. Each sample is
identified by the time at which it was taken and by the detector that
took it. If all the detectors are synchronously sampled, these samples
can be organized in a matrix D-by-T, where D is the number of
detectors and T is the number of time samples. Following
``numpy.ndarray`` notation, we denote it as ``tod[row_index,
col_index]``. Accessing the :math:`i`-th row can be done using the
notation ``tod[i]``: for instance, ``tod[3]`` gets the fourth row (the
TOD of the fourth detector); accessing the fourth column, i.e., the
fourth sample of each detector, can be done with ``tod[:, 3]`` (remember that in Python we start counting from 0). So, the
way the framework keeps TODs in memory makes easy to operate on the
«detector dimension» as well as on the «time dimension».

You can take advantage of NumPy functions to easily propagate
operations along one dimension or another. For instance, taking the
time stream of the mean signal across the focal plane is done with
``tod.mean(axis=0)``. In general, an extremely large number of
operations can be expressed in terms of ``numpy`` functions and they
can operate easily and efficiently over axes of multi-dimensional
arrays. Similarly, if ``C`` is a cross-talk matrix, the operation ``C
@ tod`` mixes the TODs.
   
The previous examples are not only easy to write, they are also (very likely)
as fast as they can be (compared to different data layouts). This is
particularly true for the last example. The matrix multiplication ``C @ tod``
calls internally some highly optimized dense linear algebra routines.

Even when there is not really communications between detectors
involved, having data in a 2-dimensional array produces (arguably)
cleaner code and sometimes faster code (never slower). Suppose you
have a *list* of ``calibration_factors`` and a *list* of time streams
``tod``. You can apply the calibration factors with

.. code-block:: python 

  for calibration_factor, det_tod in zip(calibration_factors, tod):
      det_tod *= calibration_factor

but if ``calibration_factors`` is an *array* and ``tod`` a *matrix* you can
achieve the same result with the easier and (typically) faster

.. code-block:: python 

  tod *= calibration_factors[:, None]

Splitting the TOD matrix
------------------------

When working on a laptop (or a single compute node) we can live in the
ideal case discussed above. We can benefit both from the API
simplifications and the performance of the compiled codes that the
numerical libraries typically call. Moreover, these libraries are
often threaded (or easily threadable) and therefore we can in
principle leverage on all the processors available.

However, when working on clusters, we have to split this matrix into
pieces. We resort to supercomputers either when we want more CPU power
than a laptop or because we need more memory to store our data.
Both motivations apply to full-scale LiteBIRD simulations (4000
detectors, sampled at 20 GHz for 3 years take approximately 10 TB).
Therefore, we have to distribute the matrix and a compute node has
access only to the portion of the matrix that is in its memory.

To do this, we split the matrix into blocks that are stored as
(sub-)matrices. At the present stage, there is no constraint on the
shape of these blocks, they can also be a single row or a single
column. The important thing is that, if the block spans more than a
row, it is stored as a matrix (a 2-dimensional array). This means
that, for the time- and detector-chunk assigned to a compute node, all
the benefits discussed in the previous section apply.


Choosing the shape of the blocks
--------------------------------

The most convenient block shape depends on the application:

- Some operations prefer to store an entire row. For example,
  :math:`4\pi` beam convolution has a memory overhead of the order of
  the GB for each beam, which is in principle different for each
  detector. For this operation is therefore more convenient to hold a
  detector for the entire length of the mission.

- However, other operations prefer to have an entire column. For
  example, computing the common mode across the focal plane requires
  information from all the detectors at every time sample. This is
  trivial if the full detector column is in the memory, but it is very
  complicated if it is scattered across many compute nodes.

  Another example is cross-talks. Mixing the TOD of different
  detectors is trivial when everything is in memory, but otherwise it
  requires sending large messages across the network. Sending messages
  is not only a performance issue, it means that probably a custom
  algorithm has to be written, and this algorithm will probably
  require MPI communications, which are notoriously hard to develop
  and debug.

Since there is no one-size-fit-all solution, the LiteBIRD Simulation
Framework keeps the most generic approach. The shape of the blocks
depends on the application, and it is possible to change it during the
execution. (However, this takes time and should thus be done only when
strictly necessary.)


Caveats
-------

An important thing to remember is that, if the TOD matrix is chunked
along the detector dimension, only the corresponding portion of the
property array is detained in memory.
  
This has a few important implications:

1. Regardless if and how the TOD is distributed, both ``obs.tod[i]``
   and ``obs.wn_level[i]`` refer to the same detector;

2. ``obs.tod`` and ``obs.wn_level`` have the same length;

3. Operations like ``obs.tod * obs.wn_level[:, None]`` are correct.