Singapore, Singapore
Building large-scale (TBs to PBs), flexible, and secure Data Lakehouse with modern data stacks such as Git, Hadoop/S3, Kubernetes, Hive Metastore, Snowflake/Redshift, Iceberg, Glue/EMR/Spark, Trino/Athena, Confluent Kafka, dbt, Airflow, and more. Applying CI/CD with Git and containerized technologies to build reproducible and monitorable things in data engineering, like in software engineering. Designing and building highly scalable and available multi-tenancy searching and reporting services for using big data and the right distributed systems to serve highly concurrent queries with large datasets.
- Building data tools and pipelines that support AI-powered applications - Migrating data for a complex system
- Building an agnostic Data Platform on Cloud using open-source projects and Kubernetes - Migrating old EMR pipelines to the new data platform - Building data modeling for complex data problems
Deployed Data Platform on Cloud - AWS services such as Glue, Athena, Redshift, and S3. - GitLab CI/CD - SQL + DBT + SQL as code + Airflow + self-service CI/CD. - Airflow - Datahub - Gitlab - Terraform - Data Platform with Versioning
- Designed and deployed data models for data pipelines and data services. - Designed and deployed high availability, scalable, low latency search system with hundreds millions of data using many advanced search technology: Elasticsearch and Vespa.ai. - Built a custom Spark Thrift Server with Kerberos Authentication and Ranger Authorization that made Spark work as ETL Query Engine on Hive/Hadoop data. - Designed and deployed complex report systems using Lambda Architecture. - Designed and deployed Data as a Service (API) using Cassandra, Spring Boot, and more. - Maintained Cloudera Platform. - Ability to deploy Confluent Platform including Kafka, Kafka Connect Source/Sink (Mysql, Elasticsearch).
- Developed and optimized data models and pipelines for large datasets using SparkSQL. - Developed a headless crawler tool for crawling single-page applications.