What Data Does AI Use to Analyze the Crypto Market?
AI doesn’t “predict crypto” from a single magic chart. In practice, most AI systems analyze the crypto market by combining multiple data streams—market microstructure (trades + order books), derivatives metrics, on-chain activity, sentiment and news signals, and broader macro context—then cleaning, aligning, and transforming that data into features a model can learn from.
Why AI needs multiple crypto data sources
Crypto is fragmented (many exchanges, many venues, 24/7 trading) and influenced by very different drivers depending on the asset and market regime. That’s why serious AI workflows usually blend:
- Price + volume (what happened)
- Order books (how liquidity is positioned right now)
- Derivatives (how leveraged traders are positioned)
- On-chain (what’s happening on the blockchain itself)
- Sentiment/news (what people believe and react to)
- Macro + correlations (what external markets and policy are doing)
Institutional-grade crypto data vendors explicitly separate these categories (trades/OHLCV, Level 1–2 order book depth, derivatives metrics, etc.) because each answers a different question about market behavior. (Kaiko)
1) Spot price data (OHLCV) and basic market stats
What it includes
The most common “starter dataset” for AI is OHLCV:
- Open, High, Low, Close prices over a time interval (1m, 5m, 1h, 1d…)
- Volume traded during that interval
Often accompanied by: - Market cap (price × circulating supply)
- Circulating supply (if available)
- VWAP (volume-weighted average price)
- Return series (log returns, percent changes)
Why AI uses it
OHLCV is:
- Easy to obtain
- Easy to align across time
- Useful for regime detection (trend vs chop), volatility estimation, and return prediction baselines
Typical AI features engineered from OHLCV
- Returns (1-step, multi-step), momentum, moving averages
- Volatility (rolling std, ATR), drawdown
- Volume spikes, volume trends
- Price structure features (candle body/wick ratios)
- Cross-asset relationships (BTC → altcoin lead/lag)
Important warning: “volume” can be unreliable
Crypto has a long history of inflated or fake volume on some venues (e.g., wash trading), which can mislead any model that treats reported volume as ground truth. (Investopedia)
2) Trade-level (“tick”) data
What it includes
Tick data records individual prints:
- timestamp (often milliseconds)
- price
- size/quantity
- side (buyer-initiated vs seller-initiated, if inferred)
- trade ID
Why AI uses it
Tick data helps models learn micro-patterns that OHLCV hides, such as:
- short-term order flow imbalance
- bursty volatility
- reaction time after news/on-chain events
Common AI uses
- High-frequency features (trade intensity, average trade size)
- Event studies (“how does price move 1–10 minutes after X?”)
- Better slippage simulation in backtests
Data providers often advertise “every tick trade” coverage for derivatives/spot because it matters for robust modeling and backtesting. (data.coindesk.com)
3) Order book data (Level 1 / Level 2, depth, liquidity)
What it includes
Order book data is the live supply/demand stack:
- Level 1: best bid/ask, spread
- Level 2: multiple price levels of bids/asks (market depth)
Sometimes: - order book updates (diff streams)
- snapshots at intervals
Why AI uses it
Order books are where liquidity and intent show up. AI models use order book data to estimate:
- liquidity (how hard it is to move price)
- short-term support/resistance from depth clusters
- imbalance (more bids than asks near mid price)
Kaiko and other providers describe Level 1–2 market data as covering trading activity, order books, and liquidity insights across many venues—exactly the ingredients used in microstructure-based signals. (Kaiko)
Example engineered features
- Bid-ask spread, spread changes
- Depth at N basis points from mid
- Order book imbalance (OBI)
- “Spoofing-like” patterns (rapid add/cancel—harder to do reliably without full message data)
4) Derivatives data (futures, perpetuals, options)
Derivatives often drive crypto’s biggest moves because leverage amplifies forced liquidations and cascades.
Core derivatives datasets AI uses
- Open interest (OI): total outstanding contracts
- Funding rates: cost of holding long/short positions in perpetual swaps
- Derivatives volume
- Liquidations (where available)
- Options implied volatility (IV), skew (puts vs calls), term structure
CoinMarketCap explicitly tracks derivatives market metrics like open interest, trading volume, and funding rates, and CoinDesk Data describes granular datasets including tick trades, open interest, and funding rate updates. (CoinMarketCap)
How AI models use derivatives
- Detect crowded positioning (OI surges + one-sided funding)
- Identify mean-reversion risk (extreme funding often precedes pullbacks)
- Volatility forecasting (IV as a forward-looking signal)
- Tail-risk indicators (options skew can reflect crash hedging demand)
5) On-chain data (blockchain activity and network health)
What it includes
On-chain data comes directly from blockchains (or indexed interpretations of them), such as:
- transaction count
- active addresses
- transfer volume
- fees, block space usage
- hashrate / mining metrics (for PoW chains)
- exchange inflows/outflows (requires labeling/clustering)
- realized cap and profit/loss style metrics (depends on methodology)
Binance’s explanations of on-chain analysis describe it as using blockchain-recorded indicators like active addresses, transaction volume, mining/hashrate-related data, etc. (Binance)
Glassnode positions itself around “on-chain market intelligence,” providing on-chain metrics via API and dashboards. (Glassnode)
Why AI uses it
On-chain data is valuable because it can reflect usage, flows, and participant behavior that aren’t visible in exchange-only price charts.
Practical examples of on-chain signals
- Exchange netflows: large net inflows can imply potential sell pressure; outflows may imply accumulation (interpret carefully)
- Activity proxies: active addresses / transaction count (noisy, chain-dependent)
- Fee pressure: demand for block space can correlate with network usage and speculative intensity
Important caveat
On-chain data is not always “clean truth.” It can be:
- distorted by batching, bridging, L2 activity, and internal exchange movements
- heavily methodology-dependent (e.g., how “entities” are clustered and labeled)
6) DeFi and DEX data (liquidity pools, TVL, on-chain order flow)
If you trade or model assets that depend on DeFi venues, AI may also ingest:
- TVL (total value locked)
- pool liquidity and changes
- swap volumes and price impact
- on-chain MEV indicators (specialized)
- stablecoin supply changes in ecosystems
Many analysts treat these as part of “on-chain” + “market microstructure,” but the key point is: DEX activity is its own flow channel, and AI systems often include it when relevant.
7) News data and event feeds
What it includes
- Headlines + timestamps
- article metadata (source, tags)
- structured “event” labels (listing, hack, lawsuit, ETF news, upgrade)
- sentiment scores (rule-based or model-based)
- engagement/votes (in some platforms)
CryptoPanic, for example, provides a developer API for real-time crypto news and sentiment signals. (CryptoPanic)
Why AI uses it
Crypto reacts quickly to information, and NLP models can:
- detect topic shifts (regulation vs ETF vs hack)
- estimate event impact probabilities
- avoid trading into major scheduled events (or at least flag risk)
Common NLP modeling approaches
- headline embeddings + classifier/regression
- topic modeling (clusters of themes)
- “event study” labels (before/after abnormal return windows)
8) Social sentiment data (Twitter/X, Reddit, etc.)
What it includes
- post text + timestamp
- author metadata (followers, verified, etc.)
- engagement (likes/retweets/replies)
- derived sentiment (positive/negative/neutral)
- keyword frequency, topic distribution
Why AI uses it
Sentiment can proxy:
- attention
- hype cycles
- fear/greed swings
- coordination / narrative momentum
There is published research on using public Twitter sentiment to predict crypto returns (with varying results and many caveats). (ScienceDirect)
Common engineered features
- sentiment score rolling averages
- “tweet volume” spikes (attention proxy)
- influencer-weighted sentiment (risky; often overfits)
- narrative/topic shifts (e.g., “staking,” “ETF,” “airdrop”)
Major pitfalls
- bots and spam
- sarcasm and multilingual ambiguity
- regime dependence (sentiment works until it doesn’t)
- data access restrictions and sampling bias
9) Search trends and attention data (Google Trends, Wikipedia views)
What it includes
- search interest indices (not absolute counts)
- keyword-level time series (e.g., “bitcoin,” “buy crypto,” “ethereum ETF”)
- regional breakdowns (sometimes useful)
Why AI uses it
Search activity can act as a proxy for:
- retail attention
- emerging narratives
- FOMO or panic
Academic work and preprints have explored Google Trends as a predictor/feature for Bitcoin and related asset movements (results vary by period and method). (UPCommons)
10) Market “sentiment indices” (Fear & Greed and similar composites)
What it includes
Some products combine multiple inputs (volatility, momentum, surveys, social signals, etc.) into a single score.
Example: the Crypto Fear & Greed Index provides a 0–100 sentiment gauge, and it documents that it aggregates multiple data points to produce the score. (Alternative.me)
Why AI uses it
Composite indices can be used as:
- a quick regime feature (risk-on vs risk-off)
- a filter for mean reversion strategies (extremes sometimes matter)
But they can also hide important details (you lose granularity).
11) Macro and cross-asset data (rates, equities, FX, commodities)
Even if you only trade crypto, many AI systems add:
- US interest rates / yields
- dollar strength proxies
- equity index returns (e.g., Nasdaq)
- volatility indices (e.g., VIX—if used)
- major risk events calendar
Reason: crypto can trade like a high-beta risk asset in some regimes, and macro can dominate.
12) “Asset fundamentals” and project metadata
Depending on the use case, AI may also ingest:
- token emissions schedules
- unlock events
- staking APR changes
- governance proposals outcomes
- exchange listings/delistings
- protocol upgrades (forks, hard forks, major releases)
This data is often event-driven and must be normalized carefully.
How AI turns raw crypto data into model inputs
Collecting data is only half the job. Most of the “edge” is in preprocessing:
1) Normalization
- unify symbols (BTC-USD vs XBTUSD)
- align timestamps (UTC), fix missing intervals
- adjust for exchange-specific quirks
2) Feature engineering
- rolling stats (mean, std, z-scores)
- microstructure features (spread, imbalance)
- on-chain deltas (netflow changes)
- NLP embeddings for text data
3) Labeling and target design
- next-period return direction
- volatility forecast
- probability of drawdown
- classification: “breakout,” “range,” “crash risk”
4) Backtesting with realistic assumptions
- slippage + fees
- latency (especially if using order books)
- survivorship bias (dead coins)
- data snooping (too many features, too little signal)
The biggest data quality problems in crypto (and how AI can fail)
Fake volume and manipulated microstructure
If volume is inflated, a model can “learn” patterns that are artifacts. This is why venue selection and cross-validation across exchanges matters. (Investopedia)
Exchange fragmentation
The “true price” differs across venues; arbitrage can be slow during stress.
On-chain interpretation risk
Wallet labeling, entity clustering, and exchange flow estimates vary by provider and methodology. (Glassnode)
Non-stationarity
Crypto regimes change fast (new narratives, new market structure, new regulations). A model trained on 2021 may fail badly in 2022–2023 style conditions.
Practical “data stack” examples (what many AI systems actually use)
Starter stack (simpler swing-trading research)
- OHLCV (multiple timeframes)
- derivatives: funding + open interest
- one sentiment proxy (Fear & Greed or simple news sentiment)
More advanced stack (systematic + risk-aware)
- multi-exchange trades + order books
- futures/perps + options IV/skew
- on-chain netflows + activity metrics
- NLP embeddings from curated news feed
- macro cross-asset returns/regime features
Key takeaway
AI analyzes crypto best when it treats the market as a multi-signal system: price action, liquidity, leverage positioning, blockchain flows, and attention/sentiment all provide different pieces of information. The “secret sauce” is usually not a single indicator—it’s data quality + alignment + robust validation across market regimes.
References (sources used)
- CoinDesk Data – Cryptocurrency Derivatives Data (open interest, funding rate updates, tick trades). (data.coindesk.com)
- CoinMarketCap – Derivatives market metrics (open interest, funding rates, derivatives volumes). (CoinMarketCap)
- Kaiko – Level 1 & Level 2 market data (order books, liquidity, spot/derivatives coverage). (Kaiko)
- Binance Square / Academy – On-chain analysis and on-chain metrics examples (active addresses, transactions, hashrate, etc.). (Binance)
- Glassnode – On-chain market intelligence platform and metrics access. (Glassnode)
- CryptoPanic – Developer API for crypto news and sentiment feeds. (CryptoPanic)
- Research on sentiment/attention data:
- Kraaijeveld & De Smedt (2020) on Twitter sentiment predictive power for crypto returns. (ScienceDirect)
- Arratia et al. (PDF) and other studies exploring Google Trends as a predictor/feature. (UPCommons)
- Alternative.me + Kraken explainer – Crypto Fear & Greed Index concept and aggregation of inputs. (Alternative.me)
- Investopedia – volume/market indicators and concerns about manipulated/fake volume. (Investopedia)