Athens
Why Join Us
GRNET S.A. provides Internet connectivity, high-quality e-Infrastructures and advanced services to the Greek Educational, Academic and Research community, aiming at minimizing the digital divide and at ensuring equal participation of its members in the global Society of Knowledge. GRNET provides advanced services to the following sectors: Education, Research, Health, Culture.
In 2026, GRNET is expected to host the DAEDALUS supercomputer, which is expected to rank among the Europe's top supercomputers and will also serve the Greek AI factory - Pharos with special needs for AI workflows. DAEDALUS is based on HPE's NVIDIA GH200 direct liquid-cooled architecture, designed for about 89 petaflops sustained (115 petaflops peak) for traditional HPC, AI and Big Data/HPDA workloads across CPU and GPU-accelerated partitions backed by 1 PB of high-performance NVMe and 10 PB of usable storage.
As a Distributed AI Support Engineer, you will help researchers, startups, and industry teams turn this cutting-edge infrastructure into real-world AI breakthroughs, working alongside leading European universities, supercomputing centres, and industrial partners in the broader EuroHPC ecosystem. More specifically, you will contribute to the following focus areas. You are not expected to know all the technologies listed below. We are looking for strong AI and Python programming skills, solid fundamentals, and motivation to learn the necessary tools and workflows.
Focus Areas
Provide first-line support for AI on HPC workloads (LLM, computer vision and other GPU-accelerated workloads): ticket triage, quick diagnosis of failed runs, escalation when hardware issues are suspected. Support users in writing/reviewing/debugging Slurm job scripts launching multi-GPU/multi-node jobs via torchrun, accelerate launch or deepspeed, and support Ray/DeepSpeed and vLLM inference workflows where appropriate.
Maintain and test shared AI/LLM and computer-vision stacks for HPC and Cloud (PyTorch, DDP/FSDP, Hugging Face Transformers & Accelerate, PEFT/LoRA, Unsloth, DeepSpeed, Bitsandbytes, TensorFlow, RAPIDS, Ray, vLLM and related tooling), ensuring compatibility with NVIDIA drivers, CUDA and NCCL. Design, publish and support recommended Apptainer/Singularity containers (including NGCbased images) for training, fine-tuning, inference and RAG.
Diagnose common AI/LLM failures (CUDA errors, NCCL timeouts, GPU OOM, distributed hangs, misconfigured environment). Validate driver/CUDA/NCCL stacks and profile/tune workloads using PyTorch Profiler, NVIDIA Nsight (Systems/Compute), TensorBoard, MLflow and Weights & Biases (WandB).
Guide users on scalable distributed training with PyTorch DDP/FSDP and DeepSpeed (ZeRO/pipeline/tensor parallelism), plus Ray and higher-level frameworks (PyTorch Lightning, Hydra), mapped to node/GPU topology. Support 8-bit/4-bit quantisation and QLoRA workflows (Unsloth, Bitsandbytes) and large-scale inference frameworks (vLLM, NVIDIA TensorRT-LLM, Triton Inference Server); contribute to AI/LLM and computer-vision benchmarking.
Advise on effective storage use for tokenised datasets, vector indices, checkpoints and logs (layout, sharding, cleanup). Troubleshoot dataloader/I/O bottlenecks and recommend suitable formats and caching/staging, including use of NVIDIA DALI, WebDataset, RAPIDS and Dask where appropriate.
Monitor AI/LLM usage metrics (GPU hours, job success rates, queue waiting times, typical model sizes/frameworks) to drive improvements in stacks, docs and training. Support Access Call evaluation via technical review of AI/LLM proposals and resource feasibility checks.
Develop and maintain task-oriented documentation and cookbooks for AI/LLM workflows on HPC and Cloud. Prepare hands-on tutorials/demos (PyTorch, TensorFlow, Hugging Face Transformers, vLLM, Ray/DeepSpeed, RAPIDS, JupyterLab/ TensorBoard/ MLflow).
Prepare technical reports on trainings offered; maintain dashboards/databases for trainings, KPIs and survey data. Prepare web content (news, training/service pages), coordinate announcements (newsletters, social media), and support stakeholders and user access processes.
Key Technologies and Tools
Requirements
Required Qualifications
Desirable Qualifications
Benefits
GRNET provides a creative, dynamic and challenging working environment, that encourages team spirit, cooperation and continuous learning of state-of-the-art technology.
GRNET is an equal opportunity employer that is committed to diversity and inclusion in the workplace. People with a diverse range of backgrounds are encouraged to apply. We do not discriminate against any person based upon their race, age, color, gender identity and expression, disability, national origin, medical conditions, religion, parental status, or any other characteristics protected by law.
All applications will be treated with strict confidentiality.