Greater Vancouver Metropolitan Area
Applied Data Scientist and Bioinformatician with over 7 years of experience in biomedical research, specializing in reproducible data analytics workflows and software tools that drive measurable impact across genomics and population health research. Proven track record of building end-to-end pipelines that transform raw sequencing and epidemiological data into analysis-ready outputs, and applying ML, AI, and statistical methods to uncover meaningful insights and expedite data delivery. 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞𝐬 & 𝐓𝐨𝐨𝐥𝐬: Python, R, SQL | 𝐌𝐋/𝐀𝐈: PyTorch, Scikit-learn Transformers, HuggingFace LLMs and others | 𝐁𝐢𝐨𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐜𝐬: BWA, STAR, DESeq2, Qiime, samtools, bamtools and others | 𝐈𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞: AWS, HPC, Docker, Git, CWL, and Snakemake.
𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 𝐨𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐒𝐞𝐪𝐮𝐞𝐧𝐜𝐢𝐧𝐠 𝐚𝐧𝐝 𝐁𝐢𝐨𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐜𝐬 𝐂𝐨𝐧𝐬𝐨𝐫𝐭𝐢𝐮𝐦 (𝐒𝐁𝐂) 𝐜𝐨𝐫𝐞 𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐢𝐧𝐠 𝐟𝐚𝐜𝐢𝐥𝐢𝐭𝐲 𝐢𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐝𝐚𝐭𝐚 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭, 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠, 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 𝐜𝐨𝐧𝐭𝐫𝐨𝐥 𝐚𝐧𝐝 𝐚𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐨𝐟 𝐍𝐆𝐒 𝐚𝐧𝐝 𝐒𝐚𝐧𝐠𝐞𝐫 𝐝𝐚𝐭𝐚. 𝐊𝐞𝐲 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐦𝐞𝐧𝐭𝐬: ❇️ Delivered 30+ next-generation sequencing (NGS) projects, including DNA, RNA, and Amplicon sequencing, using Python, R, Bash, bioinformatics libraries, and Snakemake workflows deployed in a high-performance cluster (HPC). ❇️ Collaborated extensively with wet lab scientists to identify suitable data analysis techniques for each individual project and troubleshoot data quality issues during data generation. ❇️ Managed all data-related operations of the facility, including general data management, supervision of junior bioinformatics staff, and integration of new sequencing instruments into existing workflows.
𝐋𝐞𝐝 𝐝𝐚𝐭𝐚 𝐚𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 𝐟𝐨𝐫 𝐂𝐚𝐧𝐚𝐝𝐢𝐚𝐧 𝐂𝐇𝐈𝐋𝐃 𝐂𝐨𝐡𝐨𝐫𝐭 𝐒𝐭𝐮𝐝𝐲 𝐩𝐫𝐨𝐣𝐞𝐜𝐭𝐬. 𝐊𝐞𝐲 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐦𝐞𝐧𝐭𝐬: ❇️ Reduced manual curation time by 66% after developing an AI-based recommendation system to suggest the most suitable ontology terms to assign to 1500+ free-text survey responses from study participants. Used Python libraries (e.g., PyTorch, Pandas, Transformers, Scikit-learn, Matplotlib, and others) and HuggingFace open-source LLMs. ❇️ Reduced data processing times by 77% after developing an AI-based data extraction tool that processed images of scanned medical laboratory forms, extracted key information, and organized it into an analysis-ready format. Used Python libraries (e.g., PyTorch, Pandas, Transformers, Scikit-learn, Matplotlib, PaddleOCR, and others) and a combination of open-source multimodal LLMs and traditional OCR techniques in an HPC environment. ❇️ Investigated possible associations between mothers' exposures during pregnancy and their children's health outcomes by applying statistical and machine learning methods. Used Python, R, SQL, and data science libraries (manuscript under review).
MBB342 (taught 3 times): Led bioinformatics labs for 30 + students for an introductory genomics and bioinformatics course. Students gained hands-on experience with relevant software and databases. Revised and created new labs to modernize course content. BISC102 (taught 3 times): Led tutorial sessions for 40+ students for an introductory biology course.
𝐂𝐨𝐧𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐭𝐨 𝐝𝐚𝐭𝐚 𝐜𝐨𝐧𝐬𝐨𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 𝐞𝐟𝐟𝐨𝐫𝐭𝐬 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐎𝐯𝐚𝐫𝐢𝐚𝐧 𝐂𝐚𝐧𝐜𝐞𝐫 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 𝐂𝐞𝐧𝐭𝐞𝐫 (𝐎𝐕𝐂𝐀𝐑𝐄). 𝐊𝐞𝐲 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐦𝐞𝐧𝐭𝐬: ❇️ Expedited the delivery of custom data requests to various ovarian cancer research projects by developing three extract, transform, and load (ETL) pipelines that consolidated clinical records from multiple sources and data formats into a central SQL database. Used R, SQL, and Git for version control. ❇️ Enabled researchers across multiple ovarian cancer studies and non-technical stakeholders to independently track and explore data stored in a central SQLite database by creating an interactive PowerBI dashboard.
𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐭𝐡𝐞 𝐜𝐨𝐥𝐥𝐞𝐜𝐭𝐢𝐨𝐧, 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠, 𝐚𝐧𝐝 𝐢𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐝𝐚𝐭𝐚 𝐚𝐧𝐝 𝐦𝐞𝐭𝐚𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐭𝐡𝐞 𝟒𝐃𝐍𝐮𝐜𝐥𝐞𝐨𝐦𝐞 𝐏𝐫𝐨𝐣𝐞𝐜𝐭. 𝐊𝐞𝐲 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐦𝐞𝐧𝐭𝐬: ❇️ Streamlined analysis and visualization of thousands of next-generation sequencing (NGS) experiments through a web data platform by deploying over a dozen automated data processing pipelines using Python, bioinformatics libraries, AWS (S3 and EC2), Bash scripting, Docker, CWL, and Git for version control. ❇️ Facilitated downstream data analysis by identifying critical genomic regions of chromatin contact through PCA and other statistical methods applied to chromatin contact maps. Used Python, Jupyter Notebooks, and scientific computing libraries (e.g., Pandas, NumPy, Matplotlib).
𝐊𝐞𝐲 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐦𝐞𝐧𝐭𝐬: ❇️ Identified a potential aging biomarker candidate by performing RNA-seq data analysis on human blood samples. Used R and bioinformatics libraries (e.g., STAR, DESeq2).