Welcome Skills Experience Education Projects Blog

Highlighted Projects

Infrastructure Projects
Software projects

Infrastructure Projects

Migration of SDC-US to CephFS

I led the migration of NASA’s data center for Euclid (SDC-US) from a legacy, low-performance, filesystem to a Ceph cluster-based filesystem, enabling much higher performance, robustness, and improved scalability. Read more...

When I became the leader of SDC-US, in late 2023, the data center was incapable of fulfilling its data processing responsibilities, due to a severe bottleneck—a single 8 Gbps connection between the Oracle file server and the SAN storage appliance. Hardware for a Ceph storage cluster had already been purchased before I became the team leader, but it would not arrive until several months after I entered the role.

As a stop-gap measure, to make the SDC-US fully-operational, I instructed the team to temporarily re-purpose a compute node as a file server. We set up a software RAID array using several 14 TB NVMe drives and shared the file system with the compute nodes via NFS. The new file system node had a 25 Gbps Ethernet link and much lower latency, which greatly improved upon the previous solution. The I/O-intensive Euclid processing pipelines used the new file system while the older filesystem continued to be used for “colder” long-term storage. This allowed us to have a functional data center, while the Ceph cluster was being prepared.

Once the hardware for the Ceph storage cluster arrived and had been provisioned, the team installed and configured Ceph. I led the testing process, which involved stress testing, performance testing, and simulation of hardware failures. When we determined that the system was ready for operations, I orchestrated the migration of data to CephFS and performed the switch-over. A significant sub-task within the orchestration process involved migrating the systems in the remainder of the data center from CentOS 7 to RHEL 9.

We currently have roughly 2 PB of usable storage on our Ceph cluster and the ability to easily increase the capacity, as the mission evolves. The system has sustained a throughput of slightly over 25 GB/s (over 210 Gbps) during heavy processing events.

I should point out that I did not do most of the actual work in setting up the Ceph cluster. The work was done by a small team of systems engineers—mostly by an individual who I interviewed for the role. I simply provided guidance on the configuration of the cluster and I defined the filesystem pools. Then, when the system had been configured, I performed much of the testing, identified issues, and planned the migration process. Since this was our organization’s first experience with CephFS, we hired a Ceph consulting firm to perform an audit of our cluster. They produced a report with a few useful suggestions for improvements to our configuration settings.

JSP Science Platform Prototype

I created a science platform prototype for the NASA-NSF-DOE Joint Survey Processing project, which was a proposed project to jointly re-process data from the Euclid, Roman, and Rubin telescopes. The ultimate goal of the science platform was to allow astronomers to log in and perform custom analyses on large datasets without needing to download data to their local machines. Read more...

I created two Singularity container images: One contained a suite of hundreds of astronomy-related libraries and applications (including some custom ones) as well as many software development tools, to allow users to develop their own custom software within the platform. The other image contained most of the system software that was used to build the platform. Vagrant was used to create a cluster of VMs to simulate a real computing cluster. JupyterHub & JupyterLab provided the main user interface, with SSH access also available. Users’ home directories were stored within separate disk images that could be compressed when not in use. LDAP was set up for storing user account information.

The same Singularity image that the users accessed within the science platform was also used to process a subset of sample data at NERSC. The key point is that the users of the platform would have access to the exact same software that was used to perform the joint processing of the data from the Euclid, Roman, and Rubin telescopes. The platform would have also provided access to the data produced by each survey, along with the jointly-processed data products.

Unfortunately, the JSP project was not funded. The application image that I developed was used as the starting point for an experimental science platform initiative at IPAC.

Research Cluster at UC Riverside

In 2014, my first task as an Assistant Project Scientist at UC Riverside involved building a small computing cluster. The system consisted of 8 SuperMicro MicroBlade servers, 1 workstation / head node, and a network switch. Read more...

I was hired by UC Riverside to develop NebulOS, a distributed data analysis platform for scientists. In order to test NebulOS during development, I needed a few compute nodes. So, I submitted a request for my supervisor to buy some hardware. Once the hardware arrived, I installed Ubuntu 14.04 on all nodes and set up a network. Ansible was used for software configuration. I set up a small NFS share, installed the Hadoop Distributed File System, and started working on NebulOS.

Once NebulOS was in a working state, my supervisor used the cluster for analyzing his cosmological “multiverse” simulation (an ensemble of cosmological simulations) and several other researchers were also given access to the system (they were essentially beta testers for NebulOS). The cluster was also used by students during a “big data” summer school program, in which I taught the students the basics of Apache Spark.

Basic specs of each worker node were modest: 4-core Xeon (Haswell) CPU, 32 GB RAM, 2× 4 TB HDDs, 1 Gbps Ethernet.

Software projects

SpectraDecontaminator

Dates	January 2017 - present
Type	Work assignment
Language	Python 3
Libraries & Frameworks	AstroPy SciPy NumPy Numba h5py Euclid-specific libraries

An approximation of what the spectra on a detector might look like if the spectra were visible, rather than infrared. Notice that spectra overlap.

A contaminated spectrum, the contamination model, and the decontaminated spectrum.

The central class within the module is also called `SpectraDecontaminator`.

Summary

The 2D spectral decontamination module is a key component of Euclid’s infrared spectra processing pipeline. One of the instruments onboard Euclid is a “slitless” spectrometer, which captures many spectra per exposure by simply dispersing the incoming light onto the pixels of the detector array with a grism (diffraction grating + prism). The spectra of neighboring objects in the sky overlap one another. The overlapping is known as contamination. The job of the decontamination module is to:

Flag regions of strong contamination in each individual spectrum.
Attempt to model and remove the contamination from each spectrum.
Update the variance layer of the resulting decontaminated spectra.
Output a file containing each individual spectrum in an exposure, with each spectrum cropped out so that it is effectively presented in isolation.

The next few steps in the pipeline measure the flux of the decontaminated spectra to obtain 1D spectra and then combine multiple spectral observations of individual galaxies from different exposures into a single 1D spectrum for that galaxy. These spectra are ultimately used in a separate analysis pipeline, which determines the redshift of each galaxy that is observed.

The general algorithm was determined in the early stages of the Euclid project; it was my job to decide upon most of the implementation details, implement the algorithm, and then work with the astronomers, upstream developers, and downstream developers to make improvements over time. For more details on the spectral pipeline, refer to the official paper. More information about Euclid’s spectrometer can be found here.

The code was used as the starting point for the decontamination module for the Roman Space Telescope.

Decontamination inSpector

Dates	Primarily May 2018 - June 2018
Type	Work-related productivity tool
Language	Python 3
Libraries & Frameworks	PyQt5 Matplotlib NumPy Astropy h5py

Viewing the bounding boxes of spectra within one detector.

Viewing some plots of spectra of a single object in different exposures.

Summary

The Decontamination inSpector is a Qt-based GUI application for inspecting the output of Euclid’s SpectraDecontamination module. Before the inSpector was created, there was no convenient way to view and examine the details of what the SpectraDecontamination module was doing; the process involved creating plots and comparing spectral coordinates with the pixel coordinates in detector images. The inSpector allows us to easily see where the spectra are located within each detector of each set of exposures and examine the details of the contamination procedure. It also allows us to easily view and inspect the neighboring spectra, by simply clicking on them or by showing the list of contaminating spectra and selecting the contaminants that way.

The inSpector gives us an easy way to get a holistic view of the decontamination procedure and see issues. It was especially useful in identifying issues with the input data because it is fairly obvious when there is an issue with the locations of the spectra, or the measured fluxes of the spectra are wrong, or when the locations of some spectra are missing from the input files. A video demonstration is posted here.

nrstatic

Dates	Primarily June 2025 - July 2025
Type	Personal productivity tool
Languages	Python 3, Bash, JavaScript, HTML5
Libraries & Frameworks	Markdown markdown-checklists markdown-captions beautifulsoup4 Matplotlib Pillow nanogallery2 MathJax jQuery Mermaid highlight.js

Example input of the `simplot` environment.

Summary

nrstatic is a static web page generator that I developed to build websites for myself (specifically for writing technical blog posts) because I became frustrated with maintaining WordPress and I could not find an existing static website generator that had the features that I wanted. With nrstatic, Python code can be executed inline to generate parts of the page and variables can be set for later use (nearly everything within the document page can be parameterized). The output of bash commands can also be used during the page-building process. There are built-in environments for plotting and creating other types of figures while keeping the source code for the figures in the source Markdown text document. Image galleries can also be generated very conveniently. $\rm\LaTeX$ is used for typesetting math and there is a standardized method for hiding and revealing sections of the page so that the webpages contain a lot of information without looking extremely cluttered. Read more about nrstatic in the introductory blog post.

Approx-net

Dates	September 2016
Type	Work-related research / learning exercise
Language	C++11
Libraries & Frameworks	OpenMP

Summary

I spent September of 2016 reading through the literature on deep neural networks and getting up to speed on the current state of the art (at that point, “attention” was seen as the next big thing, but the Transformers paper had not been published quite yet.) In my final week of work at UC Riverside, I implemented a deep neural network code, called Approx-net, to gain a more complete understanding. The name is a reference to the fact that artificial neural networks are, in principle, universal approximators. The Approx-net code used stochastic gradient descent and the backpropagation algorithm. I included the ability to output a Graphviz graphical representation of the current state of the network so I could observe the network as it learned. Visualization is only practical for very small networks, of course, but it was an interesting exercise. Here is a blog post about one experiment that I performed using the code, and there an accompanying animation on YouTube.

SPHEREx Photo-z

Dates	September 2015 - August 2016
Type	Job assignment
Languages	C++11, Bash, Python
Libraries & Frameworks	OpenMP, NebulOS

A demonstration of the template-fitting / redshift estimation procedure with a few representative templates shown.

Summary

During the proposal stage of the SPHEREx project, I was asked to improve upon the rough draft of the photometric redshift estimation pipeline simulation code, which was being used to demonstrate what SPHEREx would be capable of achieving, cosmologically. For any non-astronomers who might be reading this, I should probably explain that photometry is essentially the process of measuring the intensity (flux) of light that passes through an optical filter and redshift (denoted as $z$) is a measure of how significantly the wavelength of light increases between being emitted and being observed, due to relative velocities or gravitational (spacetime geometry) effects.

The code did the following:

Created simulated images of what SPHEREx would be able to observe, using a catalog of data from the COSMOS field and information about the telescope’s proposed hardware design.
Measured the fluxes of sources in the simulated observations.
Estimated the redshift of each source (galaxy / star) in the simulated observations.

Many more details are available in this article. I completely re-wrote the initial draft of the code, fixing several algorithmic errors in the process, and achieved a roughly 10× increase in performance per thread. I set up an Amazon Machine Image (AMI) so that the simulation pipeline could be run on top of NebulOS in a virtual cluster within AWS, which further improved the performance by using multiple compute nodes. Since the performance was significantly increased, a much larger library of spectral templates could be added and the photometric redshift estimation part of the code became more fine-grained; the redshifts could be determined with greater precision.

Since it was built into an AMI containing NebulOS, the team could easily start up a new virtual cluster in AWS and run a new simulation whenever a significant change was made to the telescope’s design or the survey plan.

NebulOS

Dates	Primarily August, 2014 – May, 2015
Type	Job assignment
Languages	C++11, Python 2.7, Bash
Libraries & Frameworks	Apache Mesos libHDFS Boost (string algorithms & Python binding) Linux system calls Curlpp SQLite (for caching old task info to disk) AWS API (for the automated AWS installation script)

Summary of the interaction between Mesos and the NebulOS framework.

Example usage of the NebulOS Python module interface being used in streaming / interactive mode. The command, 'from nebulos import Processor' must be executed prior to this.

Performance scaling on Amazon Web Services' d2.xlarge EC2 instances (with slow HDD storage) when reading single-block files (approx 110 MB per file). The reading speed reported is the net speed, which includes the overhead of submitting the tasks to NebulOS framework, launching the tasks, and retrieving the output of each task.

Summary

NebulOS is a flexible, user-friendly, Big Data analysis platform. It can be thought of as a cluster operating system that allows a user to treat a group of Linux systems (e.g., a typical data center) as a single machine. Apache Mesos and the Hadoop Distributed File System (HDFS) act as the OS kernel and file system, respectively. The component that differentiates NebulOS from other Big Data systems is its Mesos-based framework, which allows the user to…

run pre-existing software on the cluster, without modification.
easily write monitoring code in any language to examine the standard error and output streams, memory usage, and CPU usage of tasks launched on the system.
write code which performs actions, based upon the behavior of the individual tasks. For instance, tasks that meet certain user-defined conditions can be terminated and automatically relaunched with modified parameters or modified input data.

Of course, the system is also able to handle node failures seamlessly and it is aware of data locality; tasks are preferentially performed on nodes that contain the greatest amount of relevant data. The user interface is Python-based, so that the user can issue commands interactively or write Python scripts to build more complex analysis routines. More details can be found here.

NebulOS can be installed on local hardware and there is also an AWS installation script (within the AMI image) for automatically building a NebulOS cluster of EC2 instances.

Motivation

Researchers at UC Riverside (primarily Miguel A. Aragon-Calvo) needed a Big Data framework to efficiently analyze cosmological simulation data. They desired a fault-tolerant system with high data throughput, capable of being used with existing software. Automated task monitoring was also highly desired, since the system would also be used to perform a large number of simulations with a simulation code prone to hanging. Rather than building a solution that would only work for a specific application, the researchers decided that a general-purpose framework would be more valuable.

Existing tools, such as TORQUE, Hadoop Mapreduce, Hama, and Spark are not ideal for analyzing terabytes or petabytes of scientific data in custom binary formats because these tools are either difficult to use or do not simultaneously allow high data throughput, fault tolerance, and flexibility. Ideally, scientists would like to use pre-existing analysis software so that time, resources, and effort can be spent on doing science rather than writing software. So, the framework needed to be able to handle pre-existing software and get out of the way of the user.

I was hired to develop the framework, described above. I evaluated several available technologies and eventually decided to use Apache Mesos and HDFS as the core components of the system. I then began implementing and testing the software and receiving feedback from potential users. A draft paper, announcing the software, can be found here.

Pretty Parametric Plots

Dates	Primarily February, 2014 – April, 2014
Type	Personal project, for fun
Languages	JavaScript, PHP, HTML5, CSS3, SVG
Libraries & Frameworks	jQuery jQueryTOOLS Spectrum Colorpicker

Summary

Pretty Parametric Plots is a small web application that generates artistic-looking parametric plots as in-line SVG images. The plotting algorithm adaptively spaces the SVG control points (i.e., Bézier spline control points) along the path so that fine details of the plot are faithfully represented without wasting control points on the regions of low curvature. This allows the web browser’s SVG rendering engine to render high-quality plots quickly. Without adaptive spacing, there is a significant trade-off between quality and rendering time. A slightly more detailed description can be found here.

Motivation

While playing with the parametric plotter plugin that comes with InkScape, I stumbled upon an interesting class of functions whose plots were especially visually-pleasing. The quality and performance of the InkScape plotter (and other plotters) was disappointing, so I wrote a program in C++, capable of generating much higher-quality plots and saving the plots as SVG images. A few days later, I decided to learn JavaScript. I ported the C++ code to JavaScript as a learning exercise. In the process, I learned to use jQuery and a few other JavaScript libraries. I also became familiar with the DOM, and CSS3.

`GSnap`

Dates	December, 2011 – September, 2013
Type	Ph.D. research tool (and for fun)
Languages	C++11
Libraries & Frameworks	Qt Framework OpenMP

Example volume rendering with dust attenuation.

A galaxy merger simulation from my Ph.D. dissertation, visualized with GSnap. This is best viewed full-screen, rather than in this tiny box.

A plot showing how stellar velocity dispersion varies with stellar age after a galactic merger. The merger leaves an imprint on the dynamics of the remnant galaxy. For more info, see the paper.

Summary

GSnap is a tool for analyzing, viewing, and manipulating snapshots from galaxy simulations. GSnap was initially written to measure the velocity dispersion of particles in galaxy simulations, but it is now useful for interactively exploring (rotating and zooming) snapshots, measuring distances between objects (and sizes of objects), interpolating between snapshots, manipulating / editing snapshots, and creating high quality visualizations of the stars and gas in galaxy simulation snapshots. In addition to a GUI, GSnap offers a powerful command line interface, which allows the user to operate the program from a script, and a built-in ECMAScript interpreter, which allows the user to potentially extend GSnap’s functionality. Currently, only very specialized GADGET-2 and GADGET-3, type 1 snapshots are supported, however, adding a new file format is a fairly straightforward task. No significant work has been done on the project since September of 2013, when I was finishing up my Ph.D. research. For more information, visit GSnap’s web page. My blog post about improvements to GSnap and the blog post about the interpolation scheme are also helpful.

Refer to the paper, Stellar Velocity Dispersion in Dissipative Galaxy Mergers with Star Formation, for examples of the types of analysis that can be performed with GSnap.

Motivation

I began working on GSnap because I needed to efficiently and automatically analyze thousands of snapshots from galaxy merger simulations, as part of my Ph.D. research. No existing software was able to perform the analyses that I needed, so I developed the code myself. I continued adding features to the software for fun after its minimal functionality had been implemented. Most notably, the volume rendering and snapshot interpolation features were irrelevant to my actual research, but it was fun and educational to implement and yielded a lot of nice images.

A gallery of images, plus two more videos

The video below is a demonstration of the cubic spline snapshot interpolation scheme in action. It shows the gas instead of the stars:

The final GSnap video shows a merger from another perspective, with the star particles explicitly shown as points:

PNG Tagger

Dates	Primarily February, 2013 – March, 2013
Type	Personal productivity
Languages	C++11
Libraries & Frameworks	Qt Framework

Summary

PNG Tagger is a photo organization tool that allows Facebook-like tags, descriptions, and date information to be stored in the metadata of a PNG image. This allows people to share photos while retaining the tag information, without the need to update a separate database. Furthermore, the photos are viewed and tagged offline, so an Internet connection is not needed. Private photos can remain private because they never need to be uploaded to a public server.

In total, less than two full weeks of work have gone into PNG Tagger, but I may add a few more features eventually. For instance, the program could eventually make it easy to filter photos by date range and search for keywords, specific people, and locations. The only way to do this at the moment, with images that are tagged using PNG Tagger, is to use a program like pngmeta, along with grep to do the search from the command line.

Motivation

While visiting my family in Virginia during February of 2013, we scanned many family photos. We have thousands of old family photos and we often have to look at the backs of photos to see who appears in the photo, as well as the date and the location of the photo. I wanted an easy way to find all of that information while looking at the scanned images, so I began working on PNG tagger during my vacation.

Direct N-body

Dates	June, 2010 – August, 2010
Type	Ph.D. research
Languages	C++03
Libraries & Frameworks	Boost OpenMP Magick++

The actual density profile of a model galaxy (black) and the analytic Hernquist profile (red).

Summary

As part of my dissertation research, I wrote an N-body code for studying the dynamics of galaxy mergers. The code performed the following tasks:

Constructed dynamically stable model galaxies.
Placed model galaxies on a user-defined collision course.
Evolved the system of particles forward in time, using an adaptive time-step leapfrog integrator.
Computed mass-weighted statistics on the stellar dynamics of the system at fixed time intervals. Optionally, a toy model for dust attenuation could be used to perform flux-weighted statistics.
Generated simple visualizations of the merger from various directions while the statistics were being computed.

The code is described in more detail in the resulting publication.

Motivation

When I began my Ph.D. research, I did not have access to any of the standard tools for building model galaxies, nor the codes used for analyzing simulations performed by the standard simulation codes. Additionally, I wanted to have a better understanding of how galaxies are modeled and how N-body simulations of galaxy mergers worked, so I wrote my own tools from scratch. It was a great learning experience for both software engineering and astrophysics.

OpenConvection

Dates	January, 2008 – April, 2008
Type	Master's degree research
Languages	MATLAB
Libraries & Frameworks	Built-in MATLAB functionality

Magnetic field lines on the outer shell of the modeling region.

The electric potential during quiescent times (low cross-polar-cap potential). The equipotentials roughly correspond to pure E × B drift paths. Thus, this plot hints at the well-known duskward bulge of the plasmasphere.

The Birkeland (i.e. magnetic field-aligned) current density in the northern hemisphere, mapped along field lines to the equatorial plane. Red indicates that the current is flowing parallel to the field; blue indicates anti-parallel flow. The position of Earth is indicated with a circle.

Summary

Given solar wind conditions and a magnetic field model, OpenConvection computes the electrical currents, potential distribution, particle drift velocities, and pressure in the inner magnetosphere and ionosphere. It is an implementation of the Rice Convection Algorithm that I generalized to include hemispheric asymmetry, caused by seasonal variation.

Motivation

The fact that Earth’s axis of rotation is not perpendicular to its orbital plane causes the conductivity of the northern and southern hemispheres of the ionosphere to differ considerably throughout the year. The effect is similar to the seasonal variation in the average surface temperature. The conductivity changes because UV radiation from the Sun is responsible for ionizing the upper atmosphere; when there are more ions, the conductivity increases. The existing version of the Rice Convection Model did not account for this asymmetry and the authors were unwilling to share their source code, so I was forced to implement the algorithm myself, based on the published descriptions.