CryptoNER — XLM-RoBERTa

A multilingual (English + Chinese) named entity recognition model fine-tuned on crypto/finance news. Recognizes 5 entity types relevant to the blockchain and digital asset space.

Labels

Label	Description	Examples
`EXCHANGE`	Centralized crypto trading platforms (CEX)	Binance, Coinbase, OKX, Bybit
`ORG`	Companies, funds, banks, regulators, government agencies	BlackRock, SEC, Federal Reserve, a16z
`PERSON`	Named individuals	Vitalik Buterin, Michael Saylor, CZ
`COUNTRY`	Countries, regions, geopolitical blocs	United States, EU, Singapore, 中国
`PROJECT`	Blockchain networks, L1/L2 chains, DeFi protocols	Ethereum, Solana, Uniswap, Aave

Note: Cryptocurrency tickers (BTC, ETH, USDT) are not extracted by this model — they are handled separately via symbol matching.

Training

Base model: xlm-roberta-base
Task: Token classification (BIO tagging scheme)
Training samples: 7,784
Eval samples: 865
Epochs: 10
Batch size: 16
Learning rate: 2e-5
Max sequence length: 256

Data

Training data was collected from crypto/finance news feeds (English and Chinese) and annotated using a DeepSeek LLM teacher model via a knowledge distillation pipeline. The raw corpus contains ~8,600 news items spanning 30+ days.

Label distribution in training corpus:

Entity type	Count	Share
ORG	12,881	43.0%
PROJECT	6,343	21.2%
COUNTRY	3,937	13.1%
PERSON	3,871	12.9%
EXCHANGE	2,921	9.8%

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ethanzhrepo/cryptoner-xlm-roberta",
    aggregation_strategy="simple",
)

results = ner("Binance and Coinbase are facing scrutiny from the SEC in the United States.")
for r in results:
    print(r["word"], "→", r["entity_group"], f"({r['score']:.2f})")

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")
model = AutoModelForTokenClassification.from_pretrained("ethanzhrepo/cryptoner-xlm-roberta")

inputs = tokenizer("Vitalik Buterin announced Ethereum upgrades.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions):
    label = model.config.id2label[pred.item()]
    if label != "O":
        print(f"{token:20s} {label}")

Limitations

Optimized for crypto/finance news; performance on general-domain text will be lower
ORG is a broad catch-all category that includes media outlets and research divisions due to the 5-label taxonomy
DEX protocols (Uniswap, dYdX) are labeled PROJECT, not EXCHANGE

Downloads last month: 70

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for ethanzhrepo/cryptoner-xlm-roberta

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3801)

this model