Post by Ho Wai Mak

Data Scientist | Quantitative Researcher | Machine Learning | Python, C++, Bash, SQL, R, Julia

πŸ† Global Impact in AI: Our 2026 EY AI & Data Challenge Journey I am thrilled to share that Kevin Xu and I competed in the 2026 EY AI & Data Challenge, tackling the critical issue of clean water supply in South Africa. After 10 days of intense iteration and over 200 experiments, our team achieved a dominant standing across multiple scales: - Top 60 Globally (54th) out of 3000+ participants - Podium Finishes: πŸ₯‡1st in Hong Kong, πŸ₯ˆ2nd in East Asia, and πŸ₯‰3rd in the UK. The core difficulty was Geographic Out-of-Distribution (OOD) prediction, forecasting water quality in river regions the model had never encountered. Most models failed by relying on "spatial proxies" (features that memorised locations rather than physics). Our Experiment Journey & Key Takeaways: - Overcoming "Spatial Proxies": The fundamental difficulty was OOD prediction, forecasting water quality in regions our model had never seen. We had to ruthlessly eliminate "spatial proxies", features that encode a geographic location rather than the actual underlying physical processes. - The Breakthrough: We discovered that point-level and local-scale physical features generalised effectively across unseen regions, whereas broader grid-level and regional features failed catastrophically. - Our Final Architecture: We developed a novel GAM-Residual Ensemble Boosting architecture. This two-stage model combines an interpretable Generalised Additive Model (GAM) that interpolates trends with ensemble tree (XGBoost, LightGBM, and CatBoost) residual corrections. - The Results: Our final model achieved an RΒ² score of 0.4309, more than doubling the competition baseline of 0.203. By prioritising geographic generalisation, we’ve developed a model that isn't just a technical achievement; it’s a scalable tool for real-world environmental monitoring in data-scarce regions where traditional ground sensors are absent. Our work was driven by the sobering reality shared by the World Health Organization that billions still lack safe water. By focusing on geographic generalisation, we aim to support the goals outlined by the UN Environment Programme to mitigate water stress caused by climate change. A huge thank you to EY for organising a challenge that highlights how data and AI can be leveraged for environmental good and public health. It was an incredibly rewarding experience to apply machine learning to a problem that impacts the livelihoods of billions. πŸ“„ Technical Report: Explore our methodology on GAM-Residual Ensemble Boosting and Spatial Proxy Theory on ResearchGate DOI:Β 10.13140/RG.2.2.26233.12647. https://lnkd.in/e8wZFkYP πŸ’» Open-Source: Access our complete feature engineering and model training pipeline on GitHub. https://lnkd.in/ePQUgwYP #AI #MachineLearning #GeospatialAI #OOD #DataScience #EYChallenge #EYAIChallenge2026 #DataChallenge #BetterWorkingWorld #WaterQuality #ESG #Sustainability #AIForGood #TechForGood #SDG6

Post content