Developer in Statistical Metagenomics with the Penn-CHOP Microbiome Program (PCMP)

What is Metagenomics?

If genetics is the study of individual genes and genomics is the study of the collections of genes that make up individual genomes, metagenomics is the study of populations of genomes. Specifically multi-species/genera/families/etc. populations, such as the human gut microbiome which is the main focus of my work.

What exactly do you do?

I'm the only designated software engineer in the PCMP; others in the program range from pure wet lab scientists to PhD bioinformaticians. At a high level, my job is to set standards for software across the program and ensure everything meets those standards. In practice, that also means that I am responsible for the brunt of the implementation and refactoring work on published work. In addition to all this, I've come into the role of architecting, creating, and deploying internal services such as preprocessing automation, compute resources, and utility services.

What stack do you use?

I work on a lot of different projects so I jump between a bunch of tech stacks. Python is the most common language with R following. For pipelines I typically use Snakemake, a Python workflow DSL, and Prefect, a Python automation framework. Compute is split between in-house HPCs and AWS. Version control is split between Enterprise GitHub and GitHub. CI/CD is largely implemented through GitHub Actions.

How do you use ML/AI?

Machine learning and, to a growing degree, generative artificial intelligence are hugely important in bioinformatics. There is a whole host of deep learning tools that are integrated into our pipelines and downstream analyses commonly involve unsupervised classification techniques. I have also begun to integrate self hosted agents and knowledge bases into pipelines but the administrative hurdles to deploying those into practice are significant in a research hospital.

Can I see?

Yes! Most of my work is open source and I've listed out projects I've worked on below.

Metagenomics Pipelines

Most of the work I do with the PCMP involves getting sequencing data from one form and transforming it to another. Doing this in a way that is well documented, reproducible, and performant is critical for being able to publish the results of downstream analysis.

Sunbeam: a robust, extensible metagenomic sequencing pipeline

A pipeline written in snakemake that simplifies and automates many of the steps in metagenomic sequencing analysis. It uses conda and docker to manage dependencies and can be deployed on most Linux workstations and clusters. It is easily extensible to perform further analyses. To read more, check out the paper in Microbiome.

AutoBfx: An automation framework for metagenomic analysis

AutoBfx is starting life focused on a separate part of the metagenomic sequencing data's journey from sunbeam, automating the most common steps from directly off the machine through to early analysis. It is built on Prefect and uses an event driven architecture to automatically process new runs off the sequencers, interoperating with SLURM for computationally intensive jobs. It provides a central, fault-tolerant data pipeline (with dashboard) which saves the valuable time of bioinformaticians for more important tasks further down the analytical pipeline. The codebase is currently on enterprise GH.

Autobfx: A modern, extensible metagenomic analysis pipeline

You're not seeing double, I started an offshoot of the above Autobfx to tackle the backend of the metagenomic analysis process. Like sunbeam, this tool is extensible, reproducible, and simple to use. Unlike sunbeam, it solves some of the main frustrations with the strictness of snakemake while not sacrificing on reproducibility or reusability. Autobfx allows users to run pre-made pipelines, define their own using existing components, or create plugins to integrate new tools to the ecosystem. It can interoperate with numerous compute and storage options, provides an interactive dashboard for monitoring jobs, and is installable with pip.

sbx_assembly

A Sunbeam extension for assembly of contigs using Megahit, gene annotation using Prodigal, and annotation using Blast and Diamond. It can also map reads to contigs and calculat per-base coverage using Minimap2 and samtools.

sbx_coassembly

A Sunbeam extension to perform co-assembly of reads from arbitrary groups of samples from a given project using Megahit.

sbx_mapping

A Sunbeam extension for mapping reads against reference genomes using bwa and performing custom filtering.

sbx_kraken

A Sunbeam extension for taxonomic assignment of reads to databases using Kraken. You can get pre-built kraken databases at the Kraken homepage or build your own then specify the path to your database in the config.

sbx_demic

A Sunbeam extension for estimating bacterial growth rates via peak-to-trough ratios (PTRs) using DEMIC. In preparation to use demic, reads are first assembled using Megahit, then binned by inferred genome using MaxBin2, after which reads are mapped back onto contigs using Bowtie2 and Samtools.

sbx_gene_clusters

A Sunbeam extension for reads-level alignment to gene clusters of interest, e.g. bai operon or butyrate producing genes.

sbx_genome_assembly

A Sunbeam extension for de novo microbial genome assembly. This pipeline uses SPAdes for single genome assembly and CheckM and read map coverage of the assembled genome for quality assessment (sbx_SPARC rules). In addition, this pipeline uses hmmer to identify SCCGs (sbx_SCCG rules).

sbx_virus_id

A Sunbeam extension that identifies, filters, and annotates viruses from metagenomic samples. It can be configured to use multiple different assemblers and virus identification softwares.

sbx_marker_magu

A Sunbeam extension for detecting and quantifying phages, bacteria, and archaea in whole genome shotgun reads from human-derived samples using Marker-MAGu.

sbx_mgv

A Sunbeam extension for classifying viral sequences using MGV.

sbx_seeker

A Sunbeam extension for discriminating between virus and phage sequences using the alignment-free Seeker deep learning algorithm.

q2-unassigner

A QIIME2 plugin for evaluating the closeness of 16S rRNA marker gene sequences to named bacterial species using unassigner.

WFRCWF

Still in the early stages of development, WFRCWF is a pipeline DSL that will hopefully fix some of our biggest pet peeves with Snakemake and Nextflow. The basic concept is to adopt Nextflow's DAG building paradigm (explicit declaration; no solvers) but simplify it, connect it more strongly to the filesystem, and write it in python (python and R are by far more popular in the bioinformatic space than Java and its derivatives). NOTE: See Autobfx above for the most recent developments on this front.

Statistical Packages

Some of the cleanest and most straightforward work I do is in helping researchers developing a new statistical method to create a well-packaged software solution to accompany publication. Easy to use and install statistical packages help drive citations for the paper.

DEMIC: dynamic estimator of bacterial groth rates

A dynamic estimator of microbial communities. It employs a multi-sample algorithm based on contigs and coverage values to infer the relative distances of contigs from the replication origin and to accurately compare bacterial growth rates between samples.

unassigner

Evaluates consistency with named bacterial species for short 16S rRNA marker gene sequences. This is achieved by ruling out taxonmic bins for each input 16S sequence rather than by trying to assign them to specific bins, avoiding certain pitfalls of the other method such as disparate matching bins.

ZIBR: Zero-Inflated Beta Random Effects Model

A two-part zero-inflated Beta regression model with random effects for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data.

DAFOT: Detector of Active Flow On a Tree

The main goal of this package is to provide a new method for two-sample testing for microbial compositional data by leveraging the phylogenetic tree information. Empirical evidence from real data sets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, this package implements a new maximum type test that is particularly powerful against sparse phylogenetic composition differences and enjoys certain optimality.

Utility Libraries

Some methods just aren't worth publishing, whether they're too mundane or too unique to bother. But these methods still need to be reliable and easy to use internally.

ShotgunUniFrac

A dual use program for downloading and extracting genes from NCBI and for creating phylogenetic trees for many marker genes and merging the results into one. This was my first from-scratch project for the PCMP.

PyCov3

A package for generating cov3 files which are generated from sam files giving coverage information and a fasta file giving binned contigs. Cov3 files are used as input for the DEMIC R package which calculates PTR, an estimate for bacterial growth rates.

primertrim

Detect short primer sequences in FASTQ reads and trim the reads accordingly.

DNAbc

Identify DNA barcodes in FASTQ data files and write demultiplexed data.

heyfastq

A Python library for reading, writing, and applying common mutations to FASTQ files.

Services

I've built and modified some services for tracking and storing sequencing data as it passes through the PCMP.

PCMP Metadata Checker

A Flask app used by bioinformaticians and collaborators to ensure their metadata meet the standards of the PCMP. It maintains a database of projects and their metadata sheet(s). I have experimented with numerous methods for deployment for this app including on PythonAnywhere, on a dedicated CHOP server with nginx, as a serverless app on AWS (similar to this site), and, the current method, as a containerized microservice in a CHOP hosted kubernetes cluster. Unfortunately for you it is hosted on Enterprise GH to take advantage of hosted CI/CD runners and its domain is restricted to CHOP systems. But you can easily run a dev version with a test database, instructions in the readme.

PCMP Sample Registry

A Flask app used by the CHOP Microbiome Center to track every sample they sequence. A Python library enables various mutations on the registry while the app provides data views with compiled statistics. It is deployed to a Kubernetes cluster and interacts with a CHOP hosted PostgreSQL database via SQLAlchemy. The actual site is only available on CHOP's intranet but instructions for setting up your own demo site and database are provided in the README.

DevOps Tools

I've developed some tools for improving bioinformatics software development. A lot revolves around the Sunbeam ecosystem for new extensions and testing.

Conda Env Check: a Conda environment health check workflow

A GitHub Action that can provide a number of health metrics for Conda environments including verifying that it is createable, verifying that any pinned versions are retrievable, and checking versions against latest and solved.

sbx_test_action

A GitHub Action that manages setup of the core Sunbeam pipeline for testing extensions in a CI environment. It automatically handles branching, forking, and user information to always test the correct version of any given extension.

sbx_template

A template repository that allows for quick and easy creation of new Sunbeam extensions. It includes workflows to automatically fill in the new extension name and organization in many files.