--- license: mit language: - en inference: true base_model: - microsoft/codebert-base-mlm - web3se/SmartBERT-v2 pipeline_tag: fill-mask tags: - fill-mask - smart-contract - web3 - software-engineering - embedding - codebert library_name: transformers datasets: - web3se/smart-contract-intent-vul-dataset --- # SmartBERT V3 CodeBERT ![SmartBERT](https://huggingface.co/web3se/SmartBERT-v2/resolve/main/framework.png) ## Overview **SmartBERT V3** is a domain-adapted pre-trained programming language model for **smart contract code understanding**, built upon **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**. The model is further trained on **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of **function-level smart contract code**. SmartBERT V3 is particularly suitable for tasks such as: - Smart contract intent detection - Code representation learning - Code similarity analysis - Vulnerability detection - Smart contract classification Compared with **SmartBERT V2**, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions. --- ## Training Data SmartBERT V3 was trained on a total of **80,000 smart contracts**, including: - **16,000 contracts** used in **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** - **64,000 additional smart contracts** collected from public blockchain repositories All contracts are primarily written in **Solidity** and processed at the **function level** to better capture fine-grained semantic structures of smart contract code. --- ## Training Objective The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as **CodeBERT**. During training: - A subset of tokens in the input code is randomly masked - The model learns to predict these masked tokens from surrounding context This process enables the model to learn deeper **syntactic and semantic representations** of smart contract programs. --- ## Training Setup Training was conducted using the **HuggingFace Transformers** framework. - **Hardware:** 2 × Nvidia A100 (80GB) - **Training Duration:** Over **30 hours** - **Training Dataset:** 80,000 smart contracts - **Evaluation Dataset:** 1,500 smart contracts Example training configuration: ```python training_args = TrainingArguments( output_dir=OUTPUT_DIR, overwrite_output_dir=True, num_train_epochs=20, per_device_train_batch_size=64, save_steps=10000, save_total_limit=2, evaluation_strategy="steps", eval_steps=10000, resume_from_checkpoint=checkpoint ) ```` --- ## Preprocessing During preprocessing, all newline (`\n`) and tab (`\t`) characters in the *function* code were replaced with a single space to ensure a consistent input format for tokenization. --- ## Base Model SmartBERT V3 builds upon the following models: * **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm) * **Intermediate Model**: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2) --- ## Usage Example usage with HuggingFace Transformers: ```python from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3') tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3') code_example = "function totalSupply() external view (uint256);" fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) outputs = fill_mask(code_example) print(outputs) ``` --- ## How to Use To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT). --- ## Contributor * [Youwei Huang](https://www.devil.ren) * [Sen Fang](https://github.com/TomasAndersonFang) --- ## Citation ```tex @article{huang2025smart, title={Smart Contract Intent Detection with Pre-trained Programming Language Model}, author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin}, journal={arXiv preprint arXiv:2508.20086}, year={2025} } ``` --- ## Acknowledgement - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/) - [Macau University of Science and Technology](http://www.must.edu.mo) - CAS Mino (中科劢诺)