Mark as TensorRT library for download tracking

58d2da7 verified 7 months ago

8.28 kB

	---
	pipeline_tag: text-to-image
	library_name: tensorrt
	inference: false
	license: other
	license_name: stabilityai-ai-community
	license_link: LICENSE.md
	tags:
	- tensorrt
	- sd3.5-large
	- text-to-image
	- depth
	- canny
	- blur
	- controlnet
	- onnx
	- fp8
	extra_gated_prompt: >-
	By clicking "Agree", you agree to the [License
	Agreement](https://huggingface.co/stabilityai/stable-diffusion-3.5-large/blob/main/LICENSE.md)
	and acknowledge Stability AI's [Privacy
	Policy](https://stability.ai/privacy-policy).
	extra_gated_fields:
	Name: text
	Email: text
	Country: country
	Organization or Affiliation: text
	Receive email updates and promotions on Stability AI products, services, and research?:
	type: select
	options:
	- 'Yes'
	- 'No'
	What do you intend to use the model for?:
	type: select
	options:
	- Research
	- Personal use
	- Creative Professional
	- Startup
	- Enterprise
	I agree to the License Agreement and acknowledge Stability AI's Privacy Policy: checkbox
	language:
	- en
	---

	# Stable Diffusion 3.5 Large ControlNet TensorRT
	## Introduction

	This repository hosts the TensorRT-optimized version of Stable Diffusion 3.5 Large ControlNets, developed in collaboration between [Stability AI](https://stability.ai) and [NVIDIA](https://huggingface.co/nvidia). This implementation leverages NVIDIA's TensorRT deep learning inference library to deliver significant performance improvements while maintaining the exceptional image quality of the original model.

	Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency. The TensorRT optimization makes these capabilities accessible for production deployment and real-time applications.

	The following control types are available:

	- Canny - Use a Canny edge map to guide the structure of the generated image. This is especially useful for illustrations, but works with all styles.

	- Depth - use a depth map, generated by DepthFM, to guide generation. Some example use cases include generating architectural renderings, or texturing 3D assets.

	- Blur - can be used to perform extremely high fidelity upscaling. A common use case is to tile an input image, apply the ControlNet to each tile, and merge the tiles to produce a higher resolution image.

	## Model Details

	### Model Description
	This repository holds the ONNX export of the Depth, Canny and Blue ControlNet models in BF16 precision. The FP8 quantized models are also available for the Depth and Canny Controlnets.


	## Performance using TensorRT 10.13
	#### Depth ControlNet: Timings for 40 steps at 1024x1024


	\| Accelerator \| Precision \| VAE Encoder \| CLIP-G \| CLIP-L \| T5 \| MMDiT x 40 \| VAE Decoder \| Total \|
	\|-------------\|-----------\|-------------\|------------\|--------------\|--------------\|-----------------------\|---------------------\|------------------------\|
	\| H100 \| BF16 \| 74.97 ms \| 11.87 ms \| 4.90 ms \| 8.82 ms \| 18839.01 ms \| 117.38 ms \| 19097.19 ms \|
	\| H100 \| FP8 \| 31.24 ms \| 11.99 ms \| 4.96 ms \| 8.39 ms \| 9175.53 ms \| 36.36 ms \| 9308.86 ms \|

	#### Canny ControlNet: Timings for 60 steps at 1024x1024


	\| Accelerator \| Precision \| VAE Encoder \| CLIP-G \| CLIP-L \| T5 \| MMDiT x 60 \| VAE Decoder \| Total \|
	\|-------------\|-----------\|-------------\|------------\|--------------\|--------------\|-----------------------\|---------------------\|------------------------\|
	\| H100 \| BF16 \| 78.50 ms \| 12.29 ms \| 5.08 ms \| 8.65 ms \| 28057.08 ms \| 106.49 ms \| 28306.20 ms \|
	\| H100 \| FP8 \| 31.21 ms \| 12.17 ms \| 4.96 ms \| 8.35 ms \| 13936.82 ms \| 36.63 ms \| 14068.32 ms \|


	#### Blur ControlNet: Timings for 60 steps at 1024x1024

	\| Accelerator \| Precision \| VAE Encoder \| CLIP-G \| CLIP-L \| T5 \| MMDiT x 60 \| VAE Decoder \| Total \|
	\|-------------\|-----------\|-------------\|------------\|--------------\|--------------\|-----------------------\|---------------------\|------------------------\|
	\| H100 \| BF16 \| 74.48 ms \| 11.71 ms \| 4.86 ms \| 8.80 ms \| 28604.26 ms \| 113.24 ms \| 28859.06 ms \|



	## Usage Example
	1. Follow the [setup instructions](https://github.com/NVIDIA/TensorRT/blob/release/sd35/demo/Diffusion/README.md) on launching a TensorRT NGC container.
	```shell
	git clone https://github.com/NVIDIA/TensorRT.git
	cd TensorRT
	git checkout release/sd35
	docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:25.08-py3 /bin/bash
	```


	2. Install libraries and requirements
	```shell
	cd demo/Diffusion
	source setup.sh
	```

	3. Generate HuggingFace user access token
	To download model checkpoints for the Stable Diffusion 3.5 checkpoints, please request access on the[Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large), [Stable Diffusion 3.5 Large Depth ControlNet](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-controlnet-depth), [Stable Diffusion 3.5 Large Canny ControlNet](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-controlnet-canny), and [Stable Diffusion 3.5 Large Blur ControlNet](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-controlnet-blur) pages.
	You will then need to obtain a `read` access token to HuggingFace Hub and export as shown below. See [instructions](https://huggingface.co/docs/hub/security-tokens).

	```bash
	export HF_TOKEN=<your access token>
	```

	4. Perform TensorRT optimized inference:

	- Stable Diffusion 3.5 Large Depth ControlNet in BF16 precision

	```
	python3 demo_controlnet_sd35.py \
	"a photo of a man" \
	--version=3.5-large \
	--bf16 \
	--controlnet-type depth \
	--download-onnx-models \
	--denoising-steps=40 \
	--guidance-scale 4.5 \
	--build-static-batch \
	--use-cuda-graph \
	--hf-token=$HF_TOKEN
	```

	- Stable Diffusion 3.5 Large Depth ControlNet in FP8 precision

	```
	python3 demo_controlnet_sd35.py \
	"a photo of a man" \
	--version=3.5-large \
	--fp8 \
	--controlnet-type depth \
	--download-onnx-models \
	--denoising-steps=40 \
	--guidance-scale 4.5 \
	--build-static-batch \
	--use-cuda-graph \
	--hf-token=$HF_TOKEN
	```

	- Stable Diffusion 3.5 Large Canny ControlNet in BF16 precision

	```
	python3 demo_controlnet_sd35.py \
	"A Night time photo taken by Leica M11, portrait of a Japanese woman in a kimono, looking at the camera, Cherry blossoms" \
	--version=3.5-large \
	--bf16 \
	--controlnet-type canny \
	--download-onnx-models \
	--denoising-steps=60 \
	--guidance-scale 3.5 \
	--build-static-batch \
	--use-cuda-graph \
	--hf-token=$HF_TOKEN
	```

	- Stable Diffusion 3.5 Large Canny ControlNet in FP8 precision

	```
	python3 demo_controlnet_sd35.py \
	"A Night time photo taken by Leica M11, portrait of a Japanese woman in a kimono, looking at the camera, Cherry blossoms" \
	--version=3.5-large \
	--fp8 \
	--controlnet-type canny \
	--download-onnx-models \
	--denoising-steps=60 \
	--guidance-scale 3.5 \
	--build-static-batch \
	--use-cuda-graph \
	--hf-token=$HF_TOKEN
	```

	- Stable Diffusion 3.5 Large Blur ControlNet in BF16 precision

	```
	python3 demo_controlnet_sd35.py \
	"generated ai art, a tiny, lost rubber ducky in an action shot close-up, surfing the humongous waves, inside the tube, in the style of Kelly Slater" \
	--version=3.5-large \
	--bf16 \
	--controlnet-type blur \
	--download-onnx-models \
	--denoising-steps=60 \
	--guidance-scale 3.5 \
	--build-static-batch \
	--use-cuda-graph \
	--hf-token=$HF_TOKEN
	```