Lazy Evaluation in Python

As part of a project at work, I have had to write some code to load a collection of HDF5 files containing thousands of small cutout images (2D arrays of floating point values). The application that uses these files (hereafter, the “consumer”) will not need to use all of the cutouts that are contained within the collection of input files. Ideally, I don’t want the consumer to waste time and memory by loading the entire contents of all of the files. I also want the code to be clean and maintainable; if the format of the input files changes, I do not want to change the code within the consumer.

To manage the latter concern, I began implementing a set of classes for reading and abstracting the input files and I placed these classes into the sub-project containing the program that produces the output files (the ‘producer’). The producer is primarily maintained by another developer. However, by adding my file-reading classes to the same project and then adding integration tests, I can guarantee that, whenever a change is made to the producer that breaks the file-reader, the tests will fail and effectively set off an alarm until the file reading classes have been updated. The code within the consumer does not need to change; it only needs to import the latest version of the file reader.

To handle the first concern, I have used lazy evaluation, specifically lazy initialization, in this case. I created a class called CutoutCollection, which handles the file loading and abstraction. The CutoutCollection appears to contain many objects of the class Cutout, however, the Cutout objects are not actually created until they are requested by name (id number). Furthermore, the contents of each Cutout are not loaded from the underlying HDF5 files until they are requested. Here is a section of the implementation of Cutout:

class Cutout:
    """
    Represents a single cutout image.
    """
    __slots__ = ['_id', '_group', '_flux', '_wcs', '_pixels']

    def __init__(self, object_id: int, cutout_group: h5py.Group):
        """
        Constructs the Cutout from the HDF5 Group corresponding to the cutout

        Parameters
        ----------

        object_id: int
            The id number of the galaxy shown in the cutout image.

        cutout_group: HDF5 Group
            The HDF5 Group that will be ingested to create the `Cutout` object.
        """

        self._id = object_id

        self._group = cutout_group

        # nothing is read from the group yet. we only read form the group when 
        # data is requested.

        self._flux = None

        self._wcs = None

        self._pixels = None

    @property
    def flux(self) -> float:
        """The flux of the object shown in the cutout."""
        self._delayed_init()
        return self._flux

    def _delayed_init(self):
        """
        Performs delayed initialization
        """
        if self._flux is None:
            attrs = self._group.attrs

            self._flux = float(attrs['FLUX'])

            self._wcs = Wcs(attrs)

            self._pixels = Pixels(self._group['image'])

Note that the __init__() sets the internal variables corresponding to loaded quanties to None. When the flux property is accessed, it first calls _delayed_init(), which completes the actual initializaton. If the object has already been initialized, then _delayed_init() does nothing. The Wcs and Pixels classes also delay their initialization until their member data is requested (not shown here).

The CutoutCollection class looks like this:

    class CutoutCollection:
    """
    Provides an interface for handling the contents of the HDF5 files created by 
    SIR_CreateCuts.

    Upon initialization, only the metadata from the HDF5 files is actually loaded. 
    The contents of the files are loaded lazily (i.e., on-demand) in order to 
    prevent unnecessary I/O.
    """

    def __init__(self, filename: str):
        """
        Parameters
        ----------

        filename: str
            The name of a JSON file containing a list of HDf5 files, produced 
            by `SIR_CreateCuts`.
        """

        # The name of the input JSON file
        self._json_filename = filename

        # A dict, mapping integer object IDs to HDF5 Group objects
        self._cutout_index = None

        # A list containing the HDF5 Files that are currently loaded
        self._hdf5_files = None

        self._load_json(filename)

    def __getitem__(self, object_id: int):
        """

        Parameters
        ----------

        object_id : int
            The id number of an object

        Returns
        -------

        cutout: Cutout
            The Cutout object associated with the specified `object_id`

        Raises: IndexError
            If the object_id does not exist in the `CutoutCollection`.
        """
        self._make_index_if_necessary()

        return self._cutout_index[object_id]

    def __iter__(self):
        self._make_index_if_necessary()

        return (cutout for cutout in self._cutout_index.values())

    def __contains__(self, item: int) -> bool:
        self._make_index_if_necessary()

        return item in self._cutout_index.keys()

    def __len__(self):
        self._make_index_if_necessary()

        return len(self._cutout_index)

    def get_ids(self):
        self._make_index_if_necessary()

        return self._cutout_index.keys()

    def _make_index_if_necessary(self):
        """
        Indexes the cutouts by integer ID if the index has not already been 
        created.

        Side Effects
        ------------

        Creates the `self._cutout_index` dictionary.
        """
        if self._cutout_index is not None:
            return

        self._cutout_index = {}

        for hdf5_file in self._hdf5_files:
            for object_id, group in hdf5_file.items():
                oid = int(object_id)
                self._cutout_index[oid] = Cutout(oid, group)

    def _load_json(self, filename: str):
        """
        Loads a JSON file containing a list of HDF5 files and then loads each 
        of the HDF5 files.

        Note that the contents of the HDF5 files are not read at this point. 
        Reading is deferred until the contents are requested, using 
        `__get__item()`.

        Parameters
        ----------

        filename: str
            The name of a JSON file containing a list of HDF5 files. The JSON 
            file is assumed to be located in the relevant working directory 
            (i.e., the directory which contains the data directory).
        """
        self._reset()

        full_filename = os.path.abspath(filename)

        work_dir = os.path.dirname(full_filename)

        with open(full_filename) as json_file:
            h5_filenames = json.load(json_file)

        for h5_filename in h5_filenames:
            self._load_hdf5(os.path.join(work_dir, 'data', h5_filename))

    def _load_hdf5(self, filename: str):
        """
        Loads an HDF5 file produced by `SIR_CreateCuts`.

        This is a lightweight method which only creates the HDF5 File objects. 
        It does not read the file contents.

        Parameters
        ----------

        filename: str
            The full path and filename of the file to be loaded.
        """
        if self._hdf5_files is None:
            self._hdf5_files = []

        self._hdf5_files.append(h5py.File(filename, mode='r'))

    def _reset(self):
        """
        Clears the contents of the object; returns it to its initial state so 
        that previously-loaded files are not confused with more recently loaded 
        files.
        """
        if self._hdf5_files is not None:
            for f in self._hdf5_files:
                f.close()

        self._cutouts = None
        self._cutout_index = None
        self._hdf5_files = None

The method, _make_index_if_necessary() is called whenever the contents of the collection are requested. Until data is requested, the CutoutCollection object only contains references to the HDF5 files that it represents. I could have made this even more lightweight by delaying the index creation and only adding objects to the _cutout_index dictionary when they are requested. I’ve chosen not to do that because the current method works well enough, due to the fact that the Cutout objects also use lazy initialization.

You can find find some timing results which illustrate performance of the code here.