Skip to main content

Job Description

   Back

Evaluation Engineer

09-04-2026 12:09:42

Job_304000

4 - 8 years

  • Pune, Maharashtra, India (PUN)

Role Summary 

We are seeking an AI Evaluation Engineer to build and operate the end‑to‑end evaluation system for Infoblox IQ—our agentic, network‑troubleshooting assistant. You will design evaluation datasets, build automated evaluators (LLM‑as‑judge + rule‑based), validate agent tool‑use correctness, assess workflow success, and define CI/CD gating that ensures safe, trustworthy behavior in production. 

 

Responsibilities 

Implement the IQ Assistant’s end-to-end evaluation framework (Model, Agent Execution, Workflow, System Reliability, UX) 

Build test harnesses, evaluation pipelines, and automated scoring systems similar to industry‑standard auto‑evaluators. 

Maintain regression gates and continuously monitor quality‑of‑service metrics. 

Create and maintain reference datasets. 

Develop binary pass/fail rubrics, gold responses, and evaluation criteria (accuracy, completeness, hallucination, citation faithfulness). 

Implement LLM‑as‑judge and rule‑based evaluators for answer correctness, reasoning validity, grounding, and safety. 

Validate agent tool‑call behavior: argument correctness, retries, timeout handling, and safe use of operational APIs.  

Evaluate multi‑step reasoning chains and agent trajectories, ensuring logical, safe paths. 

Measure retrieval precision/recall, grounding faithfulness, citation accuracy, and indexing correctness. 

Conduct safety evals: hallucination detection, prompt injection, jailbreak attempts, policy violations, harmful actions. 

Observability: Instrument evaluation runs to capture reasoning traces, tool I/O, retrieved docs, chain‑of‑thought, and latency.  

RCA: Debug broken workflows, identify error propagation, and produce trace‑based RCA reports. 

CI/CD: Add eval suites to CI/CD pipelines (smoke tests & full nightly packs) with strict release gates.  

Continuous Monitoring: Implement production sampling, drift detection, and metric dashboards. 

Work with AI Engineering, Product, SMEs, and Platform teams to define acceptance thresholds and close execution gaps. 

Design and maintain benchmark evaluation suites for assistant quality, including curated datasets and scenario-based test cases. 

• Run model and agent benchmarking across releases, tracking quality regressions and improvements. 

Design human-in-the-loop evaluation workflows, including annotation guidelines, labeling tools, and quality control processes. 

Design and run evaluation experiments (model selection, A/B testing, prompt variations, agent strategy comparisons) to measure improvements. 

Evaluate trade-offs between quality, latency, and cost across models and agent workflows. 

Develop failure taxonomies to categorize assistant errors (retrieval failure, reasoning errors, tool misuse, hallucination). 

Technical Skills 

 

Strong proficiency in Python, including writing production‑quality evaluation code.  

Experience building LLM/agent evaluation pipelines: model scoring, automated metrics, trace inspection, dataset management.  

Knowledge of network troubleshooting domains is a plus. 

Familiarity with RAG workflows, embeddings, vector stores, and grounding metrics.  

Tooling for observability: tracing, spans, metrics dashboards. 

Experience with CI/CD, regression suites, and automated gating. 

Experience evaluating LLM behavior: hallucination detection, completeness, factuality, and citation correctness.  

Understanding multi‐turn agent evaluation, reasoning path assessment, and tool‑use correctness.  

Familiarity with safety assessment: jailbreak testing, adversarial inputs, bias/toxicity checks. 

Experience with evaluation frameworks such as OpenAI Evals, LangSmith, Ragas, DeepEval. 

Experience with experiment design and statistical analysis for evaluation metrics. 

Experience building and managing large evaluation datasets and labeling pipelines. 

Experience designing evaluation prompts and scoring prompts for LLM-as-judge systems. 

Experience with red teaming and adversarial testing of LLM systems. 

Work with engineering teams to drive model and agent improvements based on evaluation findings.