Seattle, Washington, United States
Previously as a Research Scientist at the Gates Foundation, my mission focused on empowering global health scientists to create the most up-to-date and accurate scientific datasets to guide decision and modeling. Our team, Extralit Labs, is currently revolutionizing how scientific knowledge is extracted, with an innovative AI-assisted tool that accurately extracts datasets from research papers to create a comprehensive database. By open-sourcing this tool, I am committed to accessibility and collaboration in the scientific community. My technical repertoire, coming a machine learning and software engineering background, encompasses developing and deploying optimized machine learning, deep learning, data pipelines, and application logic on optimized infrastructure. During my PhD, my research interest was on graph and NLP algorithms to address large-scale bioinformatics challenges. I’m interested in where machine learning can be applied to uncover insights in unstructured and heterogeneous datasets, ultimately contributing to the fight against diseases such as cancer and Alzheimer's.
Building evidence-grounded Clinical AI
Developing an end-to-end solution for 3x faster and more accurate scientific literature data extraction (e.g. given full-page tables in PDFs) with LLMs. GitHub: https://github.com/extralit/extralit
Open-sourcing an AI-assisted data extraction tool to create databases from unstructured scientific literature.
• Collaborated with statisticians, bioinformaticians and genomics scientists to apply ML to QC of sequencing pipelines for clinical impact. • Architected a pipeline to harmonize 5 datasets, simulate 3 sequencing parameters, extract 100's of custom features from TBs of unstructured genomics data using Python & Spark. Optimized with dynamic programming, saving over 36% HPC runtime and 10's TBs.
• Maintained data processing pipelines to compute order processing dates by developing ETL code on IBM DataStage and scheduling automated jobs on enterprise Linux servers. • Developed a general-purpose data visualization tool for business reports using Bokeh and Python to interface with Hadoop and Hive using SQL queries.