CryptoNER โ XLM-RoBERTa
A multilingual (English + Chinese) named entity recognition model fine-tuned on crypto/finance news. Recognizes 5 entity types relevant to the blockchain and digital asset space.
Labels
| Label | Description | Examples |
|---|---|---|
EXCHANGE |
Centralized crypto trading platforms (CEX) | Binance, Coinbase, OKX, Bybit |
ORG |
Companies, funds, banks, regulators, government agencies | BlackRock, SEC, Federal Reserve, a16z |
PERSON |
Named individuals | Vitalik Buterin, Michael Saylor, CZ |
COUNTRY |
Countries, regions, geopolitical blocs | United States, EU, Singapore, ไธญๅฝ |
PROJECT |
Blockchain networks, L1/L2 chains, DeFi protocols | Ethereum, Solana, Uniswap, Aave |
Note: Cryptocurrency tickers (BTC, ETH, USDT) are not extracted by this model โ they are handled separately via symbol matching.
Training
- Base model:
xlm-roberta-base - Task: Token classification (BIO tagging scheme)
- Training samples: 7,784
- Eval samples: 865
- Epochs: 10
- Batch size: 16
- Learning rate: 2e-5
- Max sequence length: 256
Data
Training data was collected from crypto/finance news feeds (English and Chinese) and annotated using a DeepSeek LLM teacher model via a knowledge distillation pipeline. The raw corpus contains ~8,600 news items spanning 30+ days.
Label distribution in training corpus:
| Entity type | Count | Share |
|---|---|---|
| ORG | 12,881 | 43.0% |
| PROJECT | 6,343 | 21.2% |
| COUNTRY | 3,937 | 13.1% |
| PERSON | 3,871 | 12.9% |
| EXCHANGE | 2,921 | 9.8% |
Usage
from transformers import pipeline
ner = pipeline(
"token-classification",
model="ethanzhrepo/cryptoner-xlm-roberta",
aggregation_strategy="simple",
)
results = ner("Binance and Coinbase are facing scrutiny from the SEC in the United States.")
for r in results:
print(r["word"], "โ", r["entity_group"], f"({r['score']:.2f})")
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")
model = AutoModelForTokenClassification.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")
inputs = tokenizer("Vitalik Buterin announced Ethereum upgrades.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions):
label = model.config.id2label[pred.item()]
if label != "O":
print(f"{token:20s} {label}")
Limitations
- Optimized for crypto/finance news; performance on general-domain text will be lower
ORGis a broad catch-all category that includes media outlets and research divisions due to the 5-label taxonomy- DEX protocols (Uniswap, dYdX) are labeled
PROJECT, notEXCHANGE
- Downloads last month
- 70
Model tree for ethanzhrepo/cryptoner-xlm-roberta
Base model
FacebookAI/xlm-roberta-base