Projects
Three projects that show how I think about problems — from building production infrastructure, to designing algorithms for subtle signals, to leading teams through ambiguous questions.
Open Nova Catalog
Classical novae — thermonuclear explosions on the surfaces of white dwarf stars — produce enormous amounts of observational data spread across dozens of public archives, each with different formats, conventions, and quirks. Researchers who want to study these events spend months manually collecting and reconciling data before they can do any actual science.
I'm building a serverless data platform that solves this problem end-to-end: automated ingestion from public archives, validation and identity resolution across heterogeneous sources, and a clean frontend for exploring the results.
The system runs entirely on AWS, built with CDK in Python. The backend uses 17 Lambda functions and 7 Step Functions workflows to orchestrate ingestion, validation, and artifact generation. Data lands in DynamoDB (single-table design), with processed products served through S3 and CloudFront. The ingestion pipeline handles coordinate-based deduplication, profile-driven FITS validation, SHA-256 fingerprinting, and explicit quarantine semantics for irreconcilable conflicts between sources.
I wrote 30+ architectural decision records before writing code — covering everything from identity resolution strategy to an immutable release model where all data products are written to a new S3 prefix before an atomic pointer update makes them visible. Rollback is a single JSON write.
The frontend is React/Next.js with interactive Plotly.js visualizations: spectra waterfall plots, multi-regime light curves, and density-preserving downsampled time series. Every service boundary uses Pydantic models with mypy strict enforcement. End-to-end smoke tests run against live AWS infrastructure.
Time-Series Signal Analysis & Algorithm Validation
Much of my scientific work involves extracting subtle periodic signals from noisy, nonstationary data — observations where the thing you're looking for is buried under instrumental artifacts, evolving baselines, and environmental drift.
I designed piecewise time-series algorithms to tackle this: establishing pre-event baselines using Lomb–Scargle periodograms, building adaptive foreground models to subtract dominant transient contributions, and applying session-specific normalization to control systematic variance. I incorporated acquisition-condition covariates using a reference channel to correct environment-induced periodic artifacts, enabling reliable comparison of pre- and post-event periodic structure.
The methodology was validated by demonstrating consistent recovery of baseline periodicity and quantifying statistically significant changes — the kind of rigorous algorithm validation that translates directly to biosignal processing, sensor fusion, or any domain where you need to find a real signal in messy data.
Applied Machine Learning Portfolio
Over two years, I designed and led a project-based machine learning course where student teams tackled real-world problems end-to-end: scoping the question, sourcing and cleaning data, selecting and validating models, and delivering reproducible results. Thirty projects across two cohorts, spanning regression, deep learning, cluster analysis, transformers, and exploratory data analysis.
My role was part technical lead, part project manager — translating vague problem statements into concrete implementation plans, guiding teams through feature engineering and model evaluation, and holding the bar on reproducibility. The range of domains (from climate data to medical imaging to text classification) forced constant context-switching and a strong emphasis on methodology over domain-specific tricks.