Singapore
We are looking for research interns to work on foundational areas for coding language models, including pre-training data, mid-training data, synthetic data generation, evaluation, and agentic coding.
Responsibilities
* Explore data-centric methods for improving coding LLMs, including data filtering, quality assessment, deduplication, data mixture, and diversity analysis.
* Build synthetic data and evaluation pipelines for code generation, code editing, repo-level reasoning, tool use, and multi-step coding tasks.
* Run experiments to analyze how data, model, and training strategies affect coding capabilities
* Work with large-scale code corpora, developer activity data, and agentic coding trajectories.
Requirements
Preferred Qualifications
plus.
What We Offer
* Access to large-scale real-world coding data and agentic trajectories.
* Rich compute resources and model APIs for fast research iteration.
* Opportunities to work on real-world coding model applications and the full model development loop.