New York City Metropolitan Area
Hi there! I am a first-generation recent graduate with a Mathematics bachelor's degree and a minor in Economics. As a software engineer, I am committed to the field of computer science and continually striving to enhance my skills. My work ethic, sense of responsibility, and determination were all cultivated during my childhood. From the age of six, I gathered recyclable materials by the Shing Mun River in Hong Kong SAR every weekend. My hobbies include chess, working out, playing basketball and American football, and studying math, physics, and computer science-related topics. Also, I love anime and my favorite anime is Evangelion. I'm always looking to connect with engineers working on large-scale distributed systems — if you've made interesting design decisions at scale, I'd love to learn from you! caleb-leung-kwan-ho.github.io/github-website/ Let's connect :)
Engineered a multistage, event-driven ingestion pipeline using a producer-consumer architecture, aggregating ~3,000 documents/day across 10+ integrations into PostgreSQL, S3, and OpenSearch— replacing a monolithic single-function pipeline with independently scalable stages, allowing targeted resource allocation to bottleneck stages Designed the extraction and transformation logic within the ETL pipeline, implementing a custom parser that normalizes heterogeneous HTML/PDF content into structured formats and automatically identifies critical metadata Engineered a custom in-house OCR solution that fully replaced AWS Textract, eliminating ~$300K/year in third-party document processing costs by running text extraction entirely within the existing ECS infrastructure. Improved fault tolerance— Textract silently failed on malformed documents, whereas the new pipeline processes all document types to completion within ≤400 MB memory-bounded processes, enabling predictable horizontal scaling under high-volume workloads Designed a custom chunking library for financial documents that preserves paragraph boundaries, table integrity, and section header coherence, replacing LangChain's RecursiveCharacterTextSplitter — achieving ~99% boundary integrity across sampled production documents and significantly improving downstream retrieval accuracy. Implemented a hybrid matching algorithm combining semantic similarity and fuzzy string matching to identify and align section headers across heterogeneous financial documents, improving header classification accuracy by ~50% over the prior approach Built a document classification model achieving 98.5% accuracy by optimizing embedding strategies and scoring functions for financial data categorization Restructuring the application's OAuth 2.0 authentication flow by migrating from a custom-built implementation to Arctic with PKCE and CSRF state validation, reducing maintenance overhead
• Refactored legacy Django codebase to resolve critical model inconsistencies, significantly improving application stability and PDF rendering performance. • Optimized a dynamic PDF generation engine that automated legal document assembly, reducing end-to-end turnaround from days (manual lawyer review) to seconds, eliminating the latency of the human bottleneck in the form creation workflow. • Built an email notification system integrated with core application logic, resolving a production gap where clients were not receiving transactional notifications.