Compare with
Other Products on the Market

Brief overview of engine-agnostic tools for search and LLM evaluation — and how they differ from TestMySearch.

TestMySearch (highlights)

  • Engine-agnostic batch runner. Fetches results from multiple search engines/configurations and evaluates them together.
  • IR metrics & stats. nDCG, MAP, Precision/Recall, overlap/rank-correlation, and pairwise statistical tests with clear visuals.
  • LLM-powered assessment. Optional LLM judging of document relevance and automatic query generation to expand coverage.
  • Reports & workflow. Sandboxes, Baskets, and Generated Reports for side-by-side comparisons and decision-ready summaries.

See details: Metrics · Virtual Assessor · A/B Testing

Engine-agnostic tools considered

Product Primary focus IR metrics LLM-based eval Monitoring Missing vs. our product Links
Evidently Open-source evaluation & observability for ML/LLM systems (drift, quality checks, test suites). General-purpose metrics (classification, regression, NLP). IR metrics require custom setup. Yes — supports LLM judges and model-graded checks. Yes — dashboards and monitoring.
  • No built-in multi-engine search batch runner.
  • No search-specific side-by-side configuration reports.
  • No integrated Expected Results & Query Sets workflow.
GitHub · Website
Promptfoo Open-source LLM evals, red teaming, guardrails; model-graded scoring. Generic scoring; not IR-focused by default. Yes — model-graded evals and adversarial tests. Primarily testing, not monitoring.
  • No orchestration to fetch results from multiple search engines.
  • No standard IR metric suite or pairwise statistical tests.
  • No reporting layer for search configuration comparisons.
Website · GitHub
DeepEval Open-source LLM evaluation framework (pytest‑like). Generic LLM metrics (e.g., hallucination, relevancy, RAGAS); not IR‑specific by default. Yes — uses LLMs and local NLP models. Evaluation-first; hosted monitoring via Confident AI.
  • No multi-engine search batch runner or connectors.
  • No integrated Expected Results & side-by-side IR reports.
GitHub · Confident AI
pytrec_eval Python bindings for the classic TREC evaluation measures. Yes — nDCG, MAP, Precision@k, etc. (via trec_eval). No — not LLM‑graded. No — library only.
  • No result collection from search engines.
  • No LLM-powered assessments or query generation.
  • No reporting/visual comparison across configurations.
GitHub · PyPI
trec_eval Reference IR evaluation tool used by the TREC community. Yes — canonical TREC measures (MAP, nDCG, etc.). No — not LLM‑graded. No — CLI tool.
  • No orchestration of queries or results fetching.
  • No LLM assessments or query generation.
  • No dashboarding or side-by-side reporting.
GitHub
Pyserini Lucene‑based IR toolkit for reproducible baselines with datasets, indexes, and evaluation scripts. Yes — supports standard benchmarks (e.g., BEIR) with evaluation utilities. No — not focused on LLM‑graded evals. No — toolkit/library.
  • Not a multi-engine production A/B and reporting layer.
  • No LLM assessors or query generation workflow.
  • No Sandboxes/Baskets/Reports workflow for decision-making.
GitHub
Quepid Human-in-the-loop relevance tuning and test cases across engines. Yes — calculates metrics over judged queries/test cases. No — no built-in LLM assessor or query generation. Limited — primarily tuning rather than monitoring.
  • No automated multi-engine batch runner for large query sets.
  • No statistical pairwise tests and domain-overlap/rank-correlation visualizations.
  • No integrated LLM Judgement reports.
Website · GitHub
RRE (Rated Ranking Evaluator) Open-source offline IR evaluation framework for search quality. Yes — supports standard IR measures (nDCG, MAP, etc.). No — no LLM‑based judging. No — framework, not a monitoring suite.
  • No built-in connectors and batch orchestration across multiple engines.
  • No visual side-by-side configuration reports with pairwise significance.
  • No LLM query generation or judgment reports.
GitHub · Overview

Last updated 2025-08-12.

Why to pick TestMySearch

Multi-engine, offline-first

Run batch tests across engines/configs safely, then ship with confidence. Complement with online A/B as needed.

LLM judgments & query generation

Bootstrap or expand coverage with LLM-generated queries and document-level LLM assessors.

Rich reports

NDCG/MAP, precision/recall, overlap, rank-correlation, and pairwise tests — all in decision-ready views.

Pragmatic workflow

Accounts, Sandboxes, Baskets and Processors streamline end‑to‑end evaluation.