LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

Published 14 Apr 2026 in cs.LG and cs.SE | (2604.12218v1)

Abstract: System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in LLMs offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data -- a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

Abstract PDF Upgrade to Chat

Authors (1)

Disha Patel

Summary

The paper introduces prompt-based LLM methods like GPT-4 with SLCP that close performance gaps in label-scarce scenarios.
It shows fine-tuned transformer models achieve highest accuracy with F1 scores up to 98.9% in rich-data environments.
Cost and latency trade-offs reveal LLMs are suited for low-throughput, high-value diagnostics compared to faster classical approaches.

LLM-Enhanced Log Anomaly Detection: An Authoritative Analysis

Introduction

Automated log anomaly detection is essential for the reliability and maintainability of modern large-scale software systems. As log data grows in scale and heterogeneity, traditional anomaly detection pipelines—typically consisting of log parsing followed by machine learning classifiers—face increasing challenges in robustness, adaptability, and operational cost. The development of LLMs trained on extensive textual corpora including code and logs introduces new paradigms in automated log analysis, offering the potential for superior generalization, reduced need for domain-specific engineering, and improved label efficiency. This paper systematically benchmarks LLM-based methods against both traditional pipelines and fine-tuned transformer models for log anomaly detection, yielding actionable findings for researchers and practitioners.

Methodological Framework

The study evaluates three principal categories for log anomaly detection:

Traditional Parser-Based Pipelines: These systems (e.g., Drain, Spell, AEL) parse unstructured log messages into templates, then apply classifiers such as Logistic Regression, Random Forests, SVM, and Isolation Forest for anomaly detection.
Fine-Tuned Transformer Models: Vanilla transformer architectures (BERT, RoBERTa, DeBERTa-v3), endowed with substantial model capacity and contextual understanding, are directly fine-tuned on raw log data.
Prompt-Based LLMs: State-of-the-art models (GPT-3.5, GPT-4, LLaMA-3) are evaluated in both zero-shot and few-shot configurations. Furthermore, the study introduces Structured Log Context Prompting (SLCP), which provides the LLM with enhanced task-relevant structure via system context, temporal metadata, semantic markers, and explicit instructions.

Evaluation uses four public log datasets from LogHub (HDFS, BGL, Thunderbird, Spirit) reflecting diverse operational environments. Standard metrics (Precision, Recall, F1, AUC) are reported, with F1 emphasized due to severe class imbalance. The context-sensitive trade-off between accuracy, latency, and cost, particularly relevant for high-frequency deployment environments, is systematically analyzed.

Quantitative Results and Comparative Analysis

Fine-tuned transformers set the accuracy upper bound (DeBERTa-v3: F1 = 95.3%–98.9%), provided sufficient labeled data is available. These models consistently outperform both parser-based and prompt-based alternatives across all datasets.

Prompt-based LLMs, notably GPT-4 with SLCP, exhibit strong zero-shot F1 scores (up to 91.2%), superior to several traditional supervised pipelines, despite the absence of any fine-tuning. This demonstrates compelling out-of-the-box performance, particularly valuable in real-world, label-scarce scenarios.

SLCP, the proposed prompting strategy, yields statistically significant improvements: SLCP augments zero-shot LLM performance by 2.9–3.1 percentage points in F1, with semantic markers as the dominant contributing factor. Few-shot enhancements further close the gap between prompt-based and fine-tuned models.

Classical methods remain competitive: Drain+RF achieves F1 = 86.4%–95.1%, underscoring that well-engineered baselines should not be disregarded, especially under latency and cost constraints.

Cost and latency differentials are pronounced: GPT-4 deployment incurs substantial operational cost and latency ($8.40/1000 predictions; 890 ms/prediction), making it suitable primarily for low-throughput or high-value diagnostic situations. Fine-tuned transformers and classical models are orders of magnitude cheaper and faster.

Label efficiency highlights regime transitions: Prompt-based LLMs substantially outperform both transformers and classical methods when label availability is severely limited (<5%). Beyond 25% label availability, fine-tuned DeBERTa-v3 dominates.

Failure Modes and Ablation Insights

Traditional pipelines primarily misclassify novel or out-of-distribution log templates and subtle anomalies manifesting as atypical event sequences. Fine-tuned transformers are robust to new templates but are bounded by sequence length limitations. Prompt-based LLMs exhibit high semantic understanding yet display instability across runs (due to non-determinism with temperature settings) and struggle with detection of purely numerical or threshold-based anomalies.

Ablation on SLCP components demonstrates additive improvements, with semantic markers materially critical (+2.1% F1 alone). Window size analysis reveals classical methods' susceptibility to hyperparameter tuning, whereas transformer-based methods exhibit more robust scalability across different window lengths.

Practical Implications and Deployment Guidelines

The study's thorough benchmarking leads to clear deployment heuristics:

High accuracy and labeled data: Prioritize fine-tuned DeBERTa-v3.
Low-label environments, moderate cost tolerance: Use GPT-4 with SLCP.
Cost-sensitive, label-poor settings: Favor locally deployed LLaMA-3 with SLCP.
High-throughput/low-latency: Opt for classical methods such as Drain+RF.
Rapidly evolving log formats: Prefer transformer or LLM-based solutions for robustness to format drift.

Limitations and Prospects for Future Research

Evaluations are limited to English-language logs and publicly available datasets, which may not capture all production complexities. Cost analyses depend on prevailing API pricing, and the focus is restricted to binary anomaly detection. Avenues for future exploration include multi-class anomaly detection, online streaming scenarios, application to multilingual and structured (multimodal) log-data, and end-to-end integrations with AIOps toolchains.

Conclusion

This comprehensive benchmark establishes that fine-tuned transformer models currently yield the highest log anomaly detection performance in label-rich settings, but LLMs, particularly with structured prompting, deliver robust accuracy under label constraints—a vital advantage for practical system diagnostics. The development of advanced prompting techniques like SLCP is critical to extracting maximal zero-shot utility from LLMs. The results offer a concrete foundation for selecting and deploying anomaly detection pipelines tailored to operational constraints, with implications for ongoing research in log analysis, trustworthy AI diagnostics, and self-healing systems.

Reference: "LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of LLMs for Automated System Diagnostics" (2604.12218).

Markdown Report Issue