TripoSR iOS (ONNX)

Single image → 3D mesh, on your iPhone.

The ONNX-converted encoder from TripoSR by Stability AI × Tripo AI — optimized for on-device inference.

419M
_Parameters

1.6 GB
_{Model Size}

< 0.5s
_{Inference (A100)}

ONNX
_Format

MIT
_License

Demo

Input Photo	3D Output

Benchmarks

Evaluated on GSO and OmniObject3D datasets. Results from the TripoSR paper.

F-Score @ 0.1 (higher is better)

Chamfer Distance (lower is better)

F-Score Across Thresholds

Cross-Dataset Comparison

Full Results Table

GSO Dataset

Method	CD ↓	FS@0.1 ↑	FS@0.2 ↑	FS@0.5 ↑
One-2-3-45	0.227	0.382	0.630	0.878
OpenLRM	0.180	0.430	0.698	0.938
ZeroShape	0.160	0.489	0.757	0.952
TGS	0.122	0.637	0.846	0.968
TripoSR	0.111	0.651	0.871	0.980

OmniObject3D Dataset

Method	CD ↓	FS@0.1 ↑	FS@0.2 ↑	FS@0.5 ↑
One-2-3-45	0.197	0.445	0.698	0.907
ZeroShape	0.144	0.507	0.786	0.968
OpenLRM	0.155	0.486	0.759	0.959
TGS	0.142	0.602	0.818	0.949
TripoSR	0.102	0.677	0.890	0.986

Architecture

One forward pass — no diffusion, no iterative denoising.

graph LR
    A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
    B --> C["Transformer Decoder<br/>+ Cross-Attention"]
    C --> D["Post Processor<br/>Triplane Features"]
    D --> E["Marching Cubes<br/>3D Mesh"]

    style A fill:#4a9eff,stroke:#30363d,color:#fff
    style B fill:#7c3aed,stroke:#30363d,color:#fff
    style C fill:#7c3aed,stroke:#30363d,color:#fff
    style D fill:#7c3aed,stroke:#30363d,color:#fff
    style E fill:#3fb950,stroke:#30363d,color:#fff

Component	Parameters	Role
DINO ViT-B/16	~86M	Pretrained image encoder
Transformer Decoder	~268M	Cross-attention to image tokens
Triplane Post-Processor	~65M	Tokens → triplane features `(3x40x64x64)`

PyTorch vs. This Model

	Original	This Conversion
Format	PyTorch	ONNX
Size	~3 GB+	1.6 GB
Runs on	GPU server	iPhone / iPad / Mac
Dependencies	torch, einops, transformers	onnxruntime
Connectivity	Cloud API	Fully offline

What I Learned Getting This to Work Well

Getting TripoSR to produce clean 3D meshes on a phone took more work than just converting the model to ONNX. The raw model expects a very specific kind of input — a single object, centered, on a neutral background — and if you just feed it a raw photo, the results are pretty rough.

The biggest improvement came from stripping the background before inference. I'm using Apple's Vision framework (VNGenerateForegroundInstanceMaskRequest on iOS 17+) to automatically detect and isolate the main subject. This is the same API that powers the "lift subject from background" feature in Photos — it's fast, runs on-device, and handles edges surprisingly well. The isolated subject gets composited onto a flat gray background (RGB 0.5, 0.5, 0.5), which matches what TripoSR was trained on.

The second big win was smart cropping and centering. After removing the background, I analyze the remaining foreground pixels to find the bounding box, then scale and center the subject so it fills roughly 85-95% of the frame. Too small and the model loses detail; too large and geometry gets clipped. The fill ratio adapts based on the object's shape — tall/narrow objects get a bit more breathing room, compact objects fill more of the frame. A small amount of padding (2-6%) prevents edge artifacts.

I also added a lightweight image enhancement pipeline before inference: noise reduction, luminance sharpening, and edge smoothing after the resize. Lanczos resampling (instead of bilinear) for the 512x512 resize made a noticeable difference in preserving fine detail. All of this runs through Core Image with Metal acceleration, so it adds minimal overhead.

The full pipeline — background removal, crop, center, enhance, infer — runs entirely on-device in Haplo AI. No server, no internet required.

Quick Start

Python

import onnxruntime as ort
import numpy as np
from PIL import Image

session = ort.InferenceSession(
    "triposr_encoder.onnx",
    providers=['CPUExecutionProvider']  # or 'CoreMLExecutionProvider'
)

image = Image.open("photo.png").convert("RGB").resize((512, 512))
input_array = np.array(image).astype(np.float32) / 255.0
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]

scene_codes = session.run(None, {"input_image": input_array})[0]
# scene_codes.shape == (1, 3, 40, 64, 64)

Swift (iOS)

import OnnxRuntimeBindings

let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: nil)

let inputTensor = try ORTValue(
    tensorData: imageData,
    elementType: .float,
    shape: [1, 3, 512, 512]
)

let outputs = try session.run(
    withInputs: ["input_image": inputTensor],
    outputNames: ["scene_codes"]
)

Files

File	Size	Description
`triposr_encoder.onnx`	2.6 MB	Model graph
`triposr_encoder.onnx.data`	1.6 GB	Weights

Citation

@article{TripoSR2024,
  title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
  author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
          and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
          and Jampani, Varun and Cao, Yan-Pei},
  journal={arXiv preprint arXiv:2403.02151},
  year={2024}
}

_{MIT License • Based on TripoSR by Stability AI × Tripo AI}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for jc-builds/triposr-ios

TripoSR: Fast 3D Object Reconstruction from a Single Image

Paper • 2403.02151 • Published Mar 4, 2024 • 16