Large Language Models

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

EElena BruchesVVadim AlperovichDDari BaturovaRRoman DerunetsDDaniil GrebenkinGGeorgy MkrtchyanOOleg SedukhinMMikhail KlementevIIvan BondarenkoNNikolay BushkovSStanislav Moiseev
Published
January 26, 2026
Authors
11
Word Count
6,218

TAM-Eval benchmarks LLMs for realistic test maintenance tasks.

Abstract

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

Key Takeaways

  • 1

    LLMs show promise but limited capabilities in test maintenance.

  • 2

    Go is the most suitable language for LLMs in testing.

  • 3

    Execution errors are the dominant failure mode for LLMs.

Limitations

  • High computational requirements for running the benchmark.

  • Significant variability in performance across tasks and languages.

Keywords

Large Language Modelsunit testingtest suite maintenancetest automationtest generationoracle predictiontest file level evaluationrepository contexttest suite pass ratecode coveragemutation testingagentic workflowsreference-free protocol

More in Large Language Models

View all
TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance | Paperchime