TripoSR iOS (ONNX)
Single image β 3D mesh, on your iPhone.
The ONNX-converted encoder from TripoSR by Stability AI Γ Tripo AI β optimized for on-device inference.
| 419M Parameters |
1.6 GB Model Size |
< 0.5s Inference (A100) |
ONNX Format |
MIT License |
Demo
| Input Photo | 3D Output |
![]() |
Benchmarks
Evaluated on GSO and OmniObject3D datasets. Results from the TripoSR paper.
F-Score @ 0.1 (higher is better)
Chamfer Distance (lower is better)
F-Score Across Thresholds
Cross-Dataset Comparison
Full Results Table
GSO Dataset
| Method | CD β | FS@0.1 β | FS@0.2 β | FS@0.5 β |
|---|---|---|---|---|
| One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
| OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
| TGS | 0.122 | 0.637 | 0.846 | 0.968 |
| TripoSR | 0.111 | 0.651 | 0.871 | 0.980 |
OmniObject3D Dataset
Architecture
One forward pass β no diffusion, no iterative denoising.
graph LR
A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
B --> C["Transformer Decoder<br/>+ Cross-Attention"]
C --> D["Post Processor<br/>Triplane Features"]
D --> E["Marching Cubes<br/>3D Mesh"]
style A fill:#4a9eff,stroke:#30363d,color:#fff
style B fill:#7c3aed,stroke:#30363d,color:#fff
style C fill:#7c3aed,stroke:#30363d,color:#fff
style D fill:#7c3aed,stroke:#30363d,color:#fff
style E fill:#3fb950,stroke:#30363d,color:#fff
| Component | Parameters | Role |
|---|---|---|
| DINO ViT-B/16 | ~86M | Pretrained image encoder |
| Transformer Decoder | ~268M | Cross-attention to image tokens |
| Triplane Post-Processor | ~65M | Tokens β triplane features (3x40x64x64) |
PyTorch vs. This Model
| Original | This Conversion | |
|---|---|---|
| Format | PyTorch | ONNX |
| Size | ~3 GB+ | 1.6 GB |
| Runs on | GPU server | iPhone / iPad / Mac |
| Dependencies | torch, einops, transformers | onnxruntime |
| Connectivity | Cloud API | Fully offline |
What I Learned Getting This to Work Well
Getting TripoSR to produce clean 3D meshes on a phone took more work than just converting the model to ONNX. The raw model expects a very specific kind of input β a single object, centered, on a neutral background β and if you just feed it a raw photo, the results are pretty rough.
The biggest improvement came from stripping the background before inference. I'm using Apple's Vision framework (VNGenerateForegroundInstanceMaskRequest on iOS 17+) to automatically detect and isolate the main subject. This is the same API that powers the "lift subject from background" feature in Photos β it's fast, runs on-device, and handles edges surprisingly well. The isolated subject gets composited onto a flat gray background (RGB 0.5, 0.5, 0.5), which matches what TripoSR was trained on.
The second big win was smart cropping and centering. After removing the background, I analyze the remaining foreground pixels to find the bounding box, then scale and center the subject so it fills roughly 85-95% of the frame. Too small and the model loses detail; too large and geometry gets clipped. The fill ratio adapts based on the object's shape β tall/narrow objects get a bit more breathing room, compact objects fill more of the frame. A small amount of padding (2-6%) prevents edge artifacts.
I also added a lightweight image enhancement pipeline before inference: noise reduction, luminance sharpening, and edge smoothing after the resize. Lanczos resampling (instead of bilinear) for the 512x512 resize made a noticeable difference in preserving fine detail. All of this runs through Core Image with Metal acceleration, so it adds minimal overhead.
The full pipeline β background removal, crop, center, enhance, infer β runs entirely on-device in Haplo AI. No server, no internet required.
Quick Start
Python
import onnxruntime as ort
import numpy as np
from PIL import Image
session = ort.InferenceSession(
"triposr_encoder.onnx",
providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider'
)
image = Image.open("photo.png").convert("RGB").resize((512, 512))
input_array = np.array(image).astype(np.float32) / 255.0
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
scene_codes = session.run(None, {"input_image": input_array})[0]
# scene_codes.shape == (1, 3, 40, 64, 64)
Swift (iOS)
import OnnxRuntimeBindings
let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: nil)
let inputTensor = try ORTValue(
tensorData: imageData,
elementType: .float,
shape: [1, 3, 512, 512]
)
let outputs = try session.run(
withInputs: ["input_image": inputTensor],
outputNames: ["scene_codes"]
)
Files
| File | Size | Description |
|---|---|---|
triposr_encoder.onnx |
2.6 MB | Model graph |
triposr_encoder.onnx.data |
1.6 GB | Weights |
Citation
@article{TripoSR2024,
title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
and Jampani, Varun and Cao, Yan-Pei},
journal={arXiv preprint arXiv:2403.02151},
year={2024}
}




