Large Language Models

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

YYapei ChangKKyle LoMMohit IyyerLLuca Soldaini
Published
February 9, 2026
Authors
4
Word Count
8,771
Code
Includes code

Framework to mine and evaluate how-to procedures for improving LLM reasoning capabilities.

Abstract

Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

Key Takeaways

  • 1

    LLMs generate plausible but sometimes incomplete procedures, missing critical steps that break real-world usability.

  • 2

    How2Everything framework extracts 351,000 validated procedures from the web across fourteen diverse topic categories.

  • 3

    How2Score evaluation protocol uses human annotators to identify critical failures that prevent procedural goals from being achieved.

Limitations

  • Initial inter-annotator agreement was very low (Krippendorff's alpha 0.273), making critical failure definition challenging.

  • Existing evaluation methods like BLEU scores and perplexity fail to capture whether procedures actually work in practice.

Keywords

goal-conditioned procedure generationHow2MineHow2BenchHow2ScoreLLM judgedistillationreinforcement learningpretrainingclosed loop evaluation

More in Large Language Models

View all
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs | Paperchime