TM-RUGPULL: A Temporary Sound, Multimodal Dataset for Early Detection of RUG Pulls Across the Tokenized Ecosystem

Published 25 Feb 2026 in cs.CR | (2602.21529v1)

Abstract: Rug-pull attacks pose a systemic threat across the blockchain ecosystem, yet research into early detection is hindered by the lack of scientific-grade datasets. Existing resources often suffer from temporal data leakage, narrow modality, and ambiguous labeling, particularly outside DeFi contexts. To address these limitations, we present TM-RugPull, a rigorously curated, leakage-resistant dataset of 1,028 token projects spanning DeFi, meme coins, NFTs, and celebrity-themed tokens. RugPull enforces strict temporal hygiene by extracting all features on chain behavior, smart contract metadata, and OSINT signals strictly from the first half of each project's lifespan. Labels are grounded in forensic reports and longevity criteria, verified through multi-expert consensus. This dataset enables causally valid, multimodal analysis of rug-pull dynamics and establishes a new benchmark for reproducible fraud detection research.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a rigorously curated multimodal dataset for early detection of rug pulls, integrating strict temporal alignment and diverse token categories.
It employs on-chain metrics and OSINT signals collected before project midpoints to prevent data leakage and improve detection reliability.
Empirical benchmarks demonstrate enhanced class separability and probability calibration, establishing a robust foundation for blockchain fraud analytics.

TM-RugPull: Multimodal, Temporally Rigorous Dataset for Early Rug Pull Detection

Introduction

TM-RugPull advances the state of blockchain fraud research through the introduction of a rigorously curated, multimodal dataset tailored for early detection of rug-pull scams across the tokenized ecosystem. The dataset addresses critical deficiencies in prior benchmarks, which include temporal leakage, narrowed modality (often limited to DeFi on-chain metrics), and unreliable labeling protocols that constrain causal analysis and downstream detection efficacy. TM-RugPull’s contributions are distinguished by strict temporal hygiene, inclusion of diverse token categories, and the integration of both on-chain and OSINT modalities—offering a realistic, reproducible foundation for causality-centered fraud analytics.

Dataset Construction and Scope

TM-RugPull encompasses 1,028 token projects sourced from Ethereum, BSC, Polygon, Arbitrum, and Fantom, spanning 2016 to 2025. The dataset’s coverage extends beyond DeFi to include meme coins, NFT-based games, and celebrity-themed tokens, explicitly reflecting the distribution of rug-pull phenotypes observed in production environments. Data were collected from live, deployed smart contracts, excluding synthetic and simulated samples to ensure representational fidelity.

Projects are labeled via a tripartite operational definition of rug pull conditioned on (i) total liquidity withdrawal, (ii) cessation of meaningful on-chain activity, and (iii) asset price collapse—verified through post-mortem forensic reports (CertiK, De.Fi, Rekt.news) and consensus by multiple domain analysts. Only projects fulfilling all conditions for a minimum duration (72h) are labeled as scams, while legitimate tokens must demonstrate sustained on-chain activity and absence of scam reports for over a year.

Each feature is extracted strictly prior to an explicit project midpoint, preventing the incorporation of post-collapse artifacts. This temporal boundary is consistently applied across projects regardless of lifespan, as evidenced by the midpoint ratio distribution.

Figure 1: The data collection pipeline aggregates heterogeneous on-chain, OSINT, and labeling signals, temporally aligned and verified for robust rug-pull analysis.

Multimodal and Cross-Domain Coverage

TM-RugPull’s schema integrates:

On-chain behavioral metrics (token concentration, holder variance, transaction statistics).
Smart contract metadata (platform, token standard, contract status).
OSINT-derived features (Google Search volume, Twitter/X activity), temporally capped at the midpoint.

The dataset spans heterogeneous token categories, addressing gaps in prior research which focused primarily on DeFi.

Figure 2: Token category distribution highlights 40%+ of rug pulls emanate from meme, NFT, and celebrity tokens—underscoring the necessity for non-DeFi coverage.

Statistically significant divergences in on-chain structure between scam and legitimate projects are observed, supporting the operational discriminative power of early-stage concentration and distribution features.

Figure 3: Scam tokens demonstrate markedly higher concentration and holder variance in the top 1%—key early indicators not reliant on post-attack signals.

Temporal Alignment and Leakage Resistance

All on-chain and OSINT modalities are collected and aligned strictly prior to the enforced temporal boundary (project midpoint). This prevents leakage common in prior datasets, where post-rug-pull phenomena such as price collapse or coordinated defamation skew both features and labels.

Longitudinal analysis of OSINT patterns reveals that off-chain hype (as captured by social media and search spikes) frequently precedes both the midpoint and observable on-chain anomalies, reinforcing the validity of these signals for proactive detection.

Figure 4: OSINT activity (social/search volume) peaks substantially before the midpoint, substantiating the causal role of off-chain signals in scam emergence.

Benchmarking: Effects of Diversity and Modality

Empirical evaluations are performed using identical classification pipelines on both the full, cross-domain dataset and a DeFi-only subset. The inclusion of heterogeneous token categories and aligned multimodal features results in substantially improved class separability and more confident probability calibration, relative to DeFi-centric baselines.

Figure 5: The full benchmark exhibits superior probability separation, indicative of greater discriminative signal and reduced labeling ambiguity.

The distribution of midpoint ratios confirms consistently applied temporal boundaries across variable project lifetimes.

Figure 6: Midpoint ratios cluster around 0.5 across projects, confirming unbiased temporal enforcement.

Design Rationale and Implications

The dataset’s methodology serves to mitigate data leakage, hindsight bias, and overfitting to domain-specific artifacts—common pitfalls in blockchain fraud research. The inclusion of multi-modal and domain-diverse features increases generalization robustness, directly impacting the reliability of machine learning-based risk scoring in asynchronous and data-scarce tokenized environments.

By anchoring all features temporally before the rug-pull, TM-RugPull is causally aligned with operational early-warning use cases. Its human-in-the-loop labeling protocol and exclusion of ambiguous or low-fidelity samples establish a high-integrity standard for future fraud detection benchmarks.

Conclusion

TM-RugPull establishes a dataset-centric foundation for multimodal, causally valid research into early-stage rug-pull detection. Through stringent temporal hygiene, broad token domain coverage, and expert-verifiable labeling, it eliminates chronic sources of leakage and ambiguity found in existing datasets. Statistical analyses validate strong discriminatory power of early on-chain and OSINT signals (token concentration, pre-collapse hype), and experimental benchmarks demonstrate enhanced probability calibration and class separability. TM-RugPull is publicly available for reproducible research, providing a scalable substrate for future progress in trustworthy blockchain analytics and general anomaly detection in decentralized finance and beyond.

Markdown Report Issue