About The Role

The role is responsible for the architecture and deployment of large language model systems, moving past simple API wrappers to build robust, scalable agentic workflows and retrieval-augmented generation (RAG) architectures. The focus is on bridging the gap between cutting-edge research and stable production software that handles high-concurrency enterprise workloads.

The engineer will collaborate with infrastructure and product teams to optimize inference latency, implement sophisticated grounding mechanisms, and establish rigorous automated evaluation pipelines to ensure model safety and accuracy in real-world environments.

Key Responsibilities

Architect and maintain production-grade RAG pipelines using LangChain or LlamaIndex, integrating advanced retrieval techniques like hybrid search and reranking.
Implement and manage vector database infrastructure (Pinecone, Weaviate, or Milvus) to support high-dimensional similarity search at scale.
Develop and deploy systematic evaluation frameworks utilizing 'LLM-as-a-judge' and deterministic benchmarking to quantify model performance and prevent regressions.
Execute fine-tuning jobs using PEFT techniques such as LoRA and QLoRA to adapt open-source models (Llama 3, Mistral) to domain-specific tasks.
Build and optimize backend services in Python (FastAPI/pydantic) to serve model outputs with low latency, incorporating streaming and caching strategies.
Design observability and monitoring systems to track token usage, cost, and hallucination rates in live production environments.

What We Are Looking For

3–6 years of experience in software engineering or machine learning, with at least 1 year of hands-on experience deploying LLMs in a production capacity.
Deep technical proficiency in Python and familiarity with deep learning frameworks such as PyTorch or JAX.
Proven experience with orchestration tools and vector databases for semantic search and memory management.
Solid understanding of NLP fundamentals, including tokenization, attention mechanisms, and transformer architectures.
Strong background in cloud infrastructure (AWS/GCP) and containerization using Docker and Kubernetes.
Bonus: Experience with model quantization (GGUF, AWQ), low-level inference optimization (vLLM, TensorRT-LLM), or contributions to major open-source AI projects.
Show more Show less

LLM / GenAI Engineer

Similar Jobs