Evaluation¶

Articles¶

vibrantlabsai/ragas - Supercharge Your LLM Application Evaluations
promptfoo/promptfoo - Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI
confident-ai/deepeval - The LLM Evaluation Framework
AgentEvalHQ/AgentEval - AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET
elbruno/elbruno-ai-evaluation - AI Testing & Observability Toolkit for .NET - deterministic evaluators, synthetic data, golden datasets, regression detection
SWE-bench/SWE-bench - SWE-bench: Can Language Models Resolve Real-world Github Issues?
harbor-framework/terminal-bench - A benchmark for LLMs on complicated tasks in the terminal
Aider-AI/aider/benchmark
openai/human-eval - Code for the paper “Evaluating Large Language Models Trained on Code”