Post by Nazril Ravi Pratama

Ex-CRM Data Officer + Best Intern #1 of All Divisions at Jalan Langit Foundation | Data Science & AI Researcher | Freelance Data Scientist & AI Engineer | GEMASTIK 2025 Data Mining Finalist | Data Science Student

🥇1st Place National Data Mining Competition, Explore.AH! 2026 Thrilled to share that our team (me and Muhammad Habib Nur Aiman) Juara di Malang from Universitas Negeri Surabaya won 1st place at Explore.AH! 2026 National Data Mining Competition by ITB Asia Malang. We worked with the U.S. Crime Dataset (Jan. 2020 – Sept. 2024) covering 982,638 records, 28 features, 21 LAPD districts and 140 crime types across Los Angeles. The core challenge: only 9.01% of cases ended in arrest. We set out to understand why and build something actionable from it. Preprocessing & Feature Engineering Raw data required domain-knowledge-driven cleaning rather than generic imputation. 66.81% missing values in weapon columns were filled with 'NONE' as a valid category. Coordinates at (0,0) referencing the Gulf of Guinea were removed. We engineered 24 new features from 28 original columns, including cyclical time encoding via sin/cos transformations to preserve circular continuity, dist_to_cbd_km via Haversine approximation to LA City Hall, crime_category grouping 140 crime types into 9 semantic classes, and report_delay_days which proved to be among the strongest predictors in our model. Methodology Rather than applying a single model, we built an integrated multi-pillar architecture across 8 analytical layers. Clustering via UMAP + HDBSCAN identified 52 micro-zones with Silhouette Score 0.66 and Davies-Bouldin Index 0.47, with cluster labels propagated as features into downstream models. Association Rule Mining with FP-Growth revealed violent crime predicts weapon presence at 99.94% confidence with Lift 3.02. Classification using LightGBM achieved ROC-AUC 0.81 on a severely imbalanced 9% positive class, with SHAP revealing dist_to_cbd_km as the strongest arrest predictor over crime type. Spatial Autocorrelation via Global Moran's I of 0.1755 with p-value 0.028 confirmed non-random clustering, with LISA identifying 4 High-High hotspot districts in south LA. Forecasting with Prophet outperformed SARIMA with MAPE 50.71% versus 63.53%. COVID-19 structural breaks were addressed through Difference-in-Differences, proving lockdown causally reduced crime by 54 cases per month in high-crime areas with coefficient -53.58 and p-value 0.035. Survival Analysis via Cox Proportional Hazards found weapon-involved cases resolved 2x faster with HR 2.08 while property crimes resolved 69% slower with HR 0.31. A Fairness Audit confirmed consistent accuracy across ethnic groups with maximum disparity of 4.12%. All findings are deployed in an interactive Decision Support System for direct use by law enforcement and policymakers. Live dashboard: https://lnkd.in/dWQM3HwD Grateful to my teammate and our supervisor for pushing this to what it became.