CryptoNER โ€” XLM-RoBERTa

A multilingual (English + Chinese) named entity recognition model fine-tuned on crypto/finance news. Recognizes 5 entity types relevant to the blockchain and digital asset space.

Labels

Label Description Examples
EXCHANGE Centralized crypto trading platforms (CEX) Binance, Coinbase, OKX, Bybit
ORG Companies, funds, banks, regulators, government agencies BlackRock, SEC, Federal Reserve, a16z
PERSON Named individuals Vitalik Buterin, Michael Saylor, CZ
COUNTRY Countries, regions, geopolitical blocs United States, EU, Singapore, ไธญๅ›ฝ
PROJECT Blockchain networks, L1/L2 chains, DeFi protocols Ethereum, Solana, Uniswap, Aave

Note: Cryptocurrency tickers (BTC, ETH, USDT) are not extracted by this model โ€” they are handled separately via symbol matching.

Training

  • Base model: xlm-roberta-base
  • Task: Token classification (BIO tagging scheme)
  • Training samples: 7,784
  • Eval samples: 865
  • Epochs: 10
  • Batch size: 16
  • Learning rate: 2e-5
  • Max sequence length: 256

Data

Training data was collected from crypto/finance news feeds (English and Chinese) and annotated using a DeepSeek LLM teacher model via a knowledge distillation pipeline. The raw corpus contains ~8,600 news items spanning 30+ days.

Label distribution in training corpus:

Entity type Count Share
ORG 12,881 43.0%
PROJECT 6,343 21.2%
COUNTRY 3,937 13.1%
PERSON 3,871 12.9%
EXCHANGE 2,921 9.8%

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ethanzhrepo/cryptoner-xlm-roberta",
    aggregation_strategy="simple",
)

results = ner("Binance and Coinbase are facing scrutiny from the SEC in the United States.")
for r in results:
    print(r["word"], "โ†’", r["entity_group"], f"({r['score']:.2f})")
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")
model = AutoModelForTokenClassification.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")

inputs = tokenizer("Vitalik Buterin announced Ethereum upgrades.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions):
    label = model.config.id2label[pred.item()]
    if label != "O":
        print(f"{token:20s} {label}")

Limitations

  • Optimized for crypto/finance news; performance on general-domain text will be lower
  • ORG is a broad catch-all category that includes media outlets and research divisions due to the 5-label taxonomy
  • DEX protocols (Uniswap, dYdX) are labeled PROJECT, not EXCHANGE
Downloads last month
70
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ethanzhrepo/cryptoner-xlm-roberta

Finetuned
(3801)
this model