New York, New York, United States
I was a Software Engineer under the bioinformatics R&D database Engineering team to design, build and maintain data aggregation and sample tracking platforms on AWS for oncology and clinical research studies. As part of a team that leads the data aspects of collaborations with multi-national partners, our platforms enable researchers to discover new disease categories, diagnoses, therapies, and genomic linkages to improve healthcare, disease prevention, and treatment. Mainly, I focus on designing and delivering excellent clinical applications, including databases, ETL data pipelines of scientific, healthcare, and digital device data, and well-designed web portals, all hosted on the cloud.
Working on healthcare news recommendation engine for HCP.
• Built and maintained an Oncology data extracting engine that processes 10s TB of semi-structured physician diagnosis notes and Electronic Medical Records data using text mining and NLP technics into structured data under predefined data models. • Created ELT pipelines extracting cancer genomic tests data from GCP Cloud SQL and loading into Redshift applying the data extracting engine. Serving model result to MySQL database that powers Sema4 intelligent oncology analytics platform, and also archived historicial data to S3 Glacier for cost optimization. • Architectured data security system with DevOps and IT on Protected Health Information (PHI) by network isolation using Private PVC and bastion node, data encryption using AES-256 and KMS, and access monitoring using detection scripts scanning unauthorized access and inappropriate data permission. • Served as scrum master for the team, planned and drove backlog grooming, sprint planning, daily stand-ups, conducted sprint reviews and retrospectives, also worked with the stakeholders to maintain the backlog, ensuring stories are well-defined and accurately pointed, tracked sprint metrics including velocity and burndown charts.
• Utilizing text analytics (NLP) methods to discover insights on shifts of firm's exposures to systematic risks by analyzing Risk Disclosure (Form10-K Item 1A) of various public companies during the past 12 years. • Processed 5000+ daily user search logs from the SEC website via Blue Hive cluster using Python scripting
Business Intelligence Analyst Jun.2018–Aug.2018 • Built a data pipeline from Hive to Tableau to support product managers by tracking oversea advertisement revenue and relative performance using Ad KPI metrics (CPM, CTR, CVR, ACP, etc.) • Constructed an online AB test to compare the performance of two advertisement recommendation system and explained the underlying reasons for the 19% gaps in ARPU (Average Revenue Per User)