Lazy Evaluation in Python
As part of a project at work, I have had to write some code to load a collection of HDF5 files containing thousands of small cutout images (2D arrays of floating point values). The application that uses these files (hereafter, the “consumer”) will not need to use all of the cutouts that are contained within the collection of input files. Ideally, I don’t want the consumer to waste time and memory by loading the entire contents of all of the files. I also want the code to be clean and maintainable; if the format of the input files changes, I do not want to change the code within the consumer.
To manage the latter concern, I began implementing a set of classes for reading and abstracting the input files and I placed these classes into the sub-project containing the program that produces the output files (the ‘producer’). The producer is primarily maintained by another developer. However, by adding my file-reading classes to the same project and then adding integration tests, I can guarantee that, whenever a change is made to the producer that breaks the file-reader, the tests will fail and effectively set off an alarm until the file reading classes have been updated. The code within the consumer does not need to change; it only needs to import the latest version of the file reader.
To handle the first concern, I have used lazy evaluation, specifically lazy initialization, in this case. I created a class called CutoutCollection
, which handles the file loading and abstraction. The CutoutCollection
appears to contain many objects of the class Cutout
, however, the Cutout
objects are not actually created until they are requested by name (id number). Furthermore, the contents of each Cutout
are not loaded from the underlying HDF5 files until they are requested. Here is a section of the implementation of Cutout
:
class Cutout:
"""
Represents a single cutout image.
"""
__slots__ = ['_id', '_group', '_flux', '_wcs', '_pixels']
def __init__(self, object_id: int, cutout_group: h5py.Group):
"""
Constructs the Cutout from the HDF5 Group corresponding to the cutout
Parameters
----------
object_id: int
The id number of the galaxy shown in the cutout image.
cutout_group: HDF5 Group
The HDF5 Group that will be ingested to create the `Cutout` object.
"""
self._id = object_id
self._group = cutout_group
# nothing is read from the group yet. we only read form the group when
# data is requested.
self._flux = None
self._wcs = None
self._pixels = None
@property
def flux(self) -> float:
"""The flux of the object shown in the cutout."""
self._delayed_init()
return self._flux
def _delayed_init(self):
"""
Performs delayed initialization
"""
if self._flux is None:
attrs = self._group.attrs
self._flux = float(attrs['FLUX'])
self._wcs = Wcs(attrs)
self._pixels = Pixels(self._group['image'])
Note that the __init__()
sets the internal variables corresponding to loaded quanties to None
. When the flux
property is accessed, it first calls _delayed_init()
, which completes the actual initializaton. If the object has already been initialized, then _delayed_init()
does nothing. The Wcs
and Pixels
classes also delay their initialization until their member data is requested (not shown here).
The CutoutCollection
class looks like this:
class CutoutCollection:
"""
Provides an interface for handling the contents of the HDF5 files created by
SIR_CreateCuts.
Upon initialization, only the metadata from the HDF5 files is actually loaded.
The contents of the files are loaded lazily (i.e., on-demand) in order to
prevent unnecessary I/O.
"""
def __init__(self, filename: str):
"""
Parameters
----------
filename: str
The name of a JSON file containing a list of HDf5 files, produced
by `SIR_CreateCuts`.
"""
# The name of the input JSON file
self._json_filename = filename
# A dict, mapping integer object IDs to HDF5 Group objects
self._cutout_index = None
# A list containing the HDF5 Files that are currently loaded
self._hdf5_files = None
self._load_json(filename)
def __getitem__(self, object_id: int):
"""
Parameters
----------
object_id : int
The id number of an object
Returns
-------
cutout: Cutout
The Cutout object associated with the specified `object_id`
Raises: IndexError
If the object_id does not exist in the `CutoutCollection`.
"""
self._make_index_if_necessary()
return self._cutout_index[object_id]
def __iter__(self):
self._make_index_if_necessary()
return (cutout for cutout in self._cutout_index.values())
def __contains__(self, item: int) -> bool:
self._make_index_if_necessary()
return item in self._cutout_index.keys()
def __len__(self):
self._make_index_if_necessary()
return len(self._cutout_index)
def get_ids(self):
self._make_index_if_necessary()
return self._cutout_index.keys()
def _make_index_if_necessary(self):
"""
Indexes the cutouts by integer ID if the index has not already been
created.
Side Effects
------------
Creates the `self._cutout_index` dictionary.
"""
if self._cutout_index is not None:
return
self._cutout_index = {}
for hdf5_file in self._hdf5_files:
for object_id, group in hdf5_file.items():
oid = int(object_id)
self._cutout_index[oid] = Cutout(oid, group)
def _load_json(self, filename: str):
"""
Loads a JSON file containing a list of HDF5 files and then loads each
of the HDF5 files.
Note that the contents of the HDF5 files are not read at this point.
Reading is deferred until the contents are requested, using
`__get__item()`.
Parameters
----------
filename: str
The name of a JSON file containing a list of HDF5 files. The JSON
file is assumed to be located in the relevant working directory
(i.e., the directory which contains the data directory).
"""
self._reset()
full_filename = os.path.abspath(filename)
work_dir = os.path.dirname(full_filename)
with open(full_filename) as json_file:
h5_filenames = json.load(json_file)
for h5_filename in h5_filenames:
self._load_hdf5(os.path.join(work_dir, 'data', h5_filename))
def _load_hdf5(self, filename: str):
"""
Loads an HDF5 file produced by `SIR_CreateCuts`.
This is a lightweight method which only creates the HDF5 File objects.
It does not read the file contents.
Parameters
----------
filename: str
The full path and filename of the file to be loaded.
"""
if self._hdf5_files is None:
self._hdf5_files = []
self._hdf5_files.append(h5py.File(filename, mode='r'))
def _reset(self):
"""
Clears the contents of the object; returns it to its initial state so
that previously-loaded files are not confused with more recently loaded
files.
"""
if self._hdf5_files is not None:
for f in self._hdf5_files:
f.close()
self._cutouts = None
self._cutout_index = None
self._hdf5_files = None
The method, _make_index_if_necessary()
is called whenever the contents of the collection are requested. Until data is requested, the CutoutCollection
object only contains references to the HDF5 files that it represents. I could have made this even more lightweight by delaying the index creation and only adding objects to the _cutout_index
dictionary when they are requested. I’ve chosen not to do that because the current method works well enough, due to the fact that the Cutout
objects also use lazy initialization.
You can find find some timing results which illustrate performance of the code here.