TripoSR iOS (ONNX)

Single image β†’ 3D mesh, on your iPhone.

The ONNX-converted encoder from TripoSR by Stability AI Γ— Tripo AI β€” optimized for on-device inference.


419M
Parameters
1.6 GB
Model Size
< 0.5s
Inference (A100)
ONNX
Format
MIT
License

Demo

Input Photo 3D Output

Benchmarks

Evaluated on GSO and OmniObject3D datasets. Results from the TripoSR paper.

F-Score @ 0.1  (higher is better)

F-Score Comparison

Chamfer Distance  (lower is better)

Chamfer Distance Comparison

F-Score Across Thresholds

F-Score Line Chart

Cross-Dataset Comparison

Grouped Comparison

Full Results Table

GSO Dataset

Method CD ↓ FS@0.1 ↑ FS@0.2 ↑ FS@0.5 ↑
One-2-3-45 0.227 0.382 0.630 0.878
OpenLRM 0.180 0.430 0.698 0.938
ZeroShape 0.160 0.489 0.757 0.952
TGS 0.122 0.637 0.846 0.968
TripoSR 0.111 0.651 0.871 0.980

OmniObject3D Dataset

Method CD ↓ FS@0.1 ↑ FS@0.2 ↑ FS@0.5 ↑
One-2-3-45 0.197 0.445 0.698 0.907
ZeroShape 0.144 0.507 0.786 0.968
OpenLRM 0.155 0.486 0.759 0.959
TGS 0.142 0.602 0.818 0.949
TripoSR 0.102 0.677 0.890 0.986

Architecture

One forward pass β€” no diffusion, no iterative denoising.

graph LR
    A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
    B --> C["Transformer Decoder<br/>+ Cross-Attention"]
    C --> D["Post Processor<br/>Triplane Features"]
    D --> E["Marching Cubes<br/>3D Mesh"]

    style A fill:#4a9eff,stroke:#30363d,color:#fff
    style B fill:#7c3aed,stroke:#30363d,color:#fff
    style C fill:#7c3aed,stroke:#30363d,color:#fff
    style D fill:#7c3aed,stroke:#30363d,color:#fff
    style E fill:#3fb950,stroke:#30363d,color:#fff
Component Parameters Role
DINO ViT-B/16 ~86M Pretrained image encoder
Transformer Decoder ~268M Cross-attention to image tokens
Triplane Post-Processor ~65M Tokens β†’ triplane features (3x40x64x64)

PyTorch vs. This Model

Original This Conversion
Format PyTorch ONNX
Size ~3 GB+ 1.6 GB
Runs on GPU server iPhone / iPad / Mac
Dependencies torch, einops, transformers onnxruntime
Connectivity Cloud API Fully offline

What I Learned Getting This to Work Well

Getting TripoSR to produce clean 3D meshes on a phone took more work than just converting the model to ONNX. The raw model expects a very specific kind of input β€” a single object, centered, on a neutral background β€” and if you just feed it a raw photo, the results are pretty rough.

The biggest improvement came from stripping the background before inference. I'm using Apple's Vision framework (VNGenerateForegroundInstanceMaskRequest on iOS 17+) to automatically detect and isolate the main subject. This is the same API that powers the "lift subject from background" feature in Photos β€” it's fast, runs on-device, and handles edges surprisingly well. The isolated subject gets composited onto a flat gray background (RGB 0.5, 0.5, 0.5), which matches what TripoSR was trained on.

The second big win was smart cropping and centering. After removing the background, I analyze the remaining foreground pixels to find the bounding box, then scale and center the subject so it fills roughly 85-95% of the frame. Too small and the model loses detail; too large and geometry gets clipped. The fill ratio adapts based on the object's shape β€” tall/narrow objects get a bit more breathing room, compact objects fill more of the frame. A small amount of padding (2-6%) prevents edge artifacts.

I also added a lightweight image enhancement pipeline before inference: noise reduction, luminance sharpening, and edge smoothing after the resize. Lanczos resampling (instead of bilinear) for the 512x512 resize made a noticeable difference in preserving fine detail. All of this runs through Core Image with Metal acceleration, so it adds minimal overhead.

The full pipeline β€” background removal, crop, center, enhance, infer β€” runs entirely on-device in Haplo AI. No server, no internet required.


Quick Start

Python
import onnxruntime as ort
import numpy as np
from PIL import Image

session = ort.InferenceSession(
    "triposr_encoder.onnx",
    providers=['CPUExecutionProvider']  # or 'CoreMLExecutionProvider'
)

image = Image.open("photo.png").convert("RGB").resize((512, 512))
input_array = np.array(image).astype(np.float32) / 255.0
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]

scene_codes = session.run(None, {"input_image": input_array})[0]
# scene_codes.shape == (1, 3, 40, 64, 64)
Swift (iOS)
import OnnxRuntimeBindings

let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: nil)

let inputTensor = try ORTValue(
    tensorData: imageData,
    elementType: .float,
    shape: [1, 3, 512, 512]
)

let outputs = try session.run(
    withInputs: ["input_image": inputTensor],
    outputNames: ["scene_codes"]
)

Files

File Size Description
triposr_encoder.onnx 2.6 MB Model graph
triposr_encoder.onnx.data 1.6 GB Weights

Citation

@article{TripoSR2024,
  title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
  author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
          and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
          and Jampani, Varun and Cao, Yan-Pei},
  journal={arXiv preprint arXiv:2403.02151},
  year={2024}
}
MIT License β€’ Based on TripoSR by Stability AI Γ— Tripo AI
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for jc-builds/triposr-ios