hmm�
For your mockup pipeline, the winning move is architectural.
Do not let a generative model âmergeâ the blank and the IP art as a single creative step. Make placement and product fidelity deterministic first. Then use generation only in a tiny, controlled way.
Background: why your current approach fails (and why it is normal)
1) Structural integrity loss is expected in diffusion-style âeditsâ
Even when you provide a mask, many inpainting pipelines still modify pixels outside the masked region. This is reported in Diffusers (âmask == 0 still changesâ) and in AUTOMATIC1111 (âunmasked area degrades/blurredâ). (GitHub)
So if your âblank is sacred,â a pure generative merge step is working against your requirement.
2) Texture hallucination is a direct consequence of denoising
Fabric weave, glaze, and micro-specular highlights are high-frequency details. Denoising tends to repaint them into âgeneric plausible texture.â The SDXL inpainting model notes that at strong inpainting settings, images can look less sharp even while retaining non-masked content. (Hugging Face)
3) Batch âdriftâ is a placement problem, not a prompt problem
If the model is deciding translation, scale, rotation, and perspective each time, you will get drift. You need template-locked transforms.
Direct answers to your three questions
Q1) Is an API-based multimodal approach fundamentally limited here?
Yes for pixel-perfect structural control.
An API âimage mergeâ is optimized for plausible outputs, not invariants. You cannot reliably enforce âthese pixels must not changeâ unless you own the full pipeline and you explicitly enforce it in post.
Use multimodal APIs for:
- auto-tagging, QC triage, prompt drafting, or mask suggestions
Do not use them as the core compositor if you need rigid constraints.
Q2) Would self-hosted SD + ControlNet be more reliable for preserving geometry?
More reliable, but still not a hard guarantee.
ControlNet adds strong spatial conditioning (edges, depth, etc.). That helps reduce silhouette drift. (GitHub)
But inpainting leakage can still occur. (GitHub)
So the best setup is: ControlNet + hard enforcement.
Q3) LoRA and img2img workflows for texture fidelity?
LoRA is a âstyle consistencyâ lever, not a âpixel invarianceâ lever.
It can help keep a consistent âcatalog lookâ across a product line, but it will not reliably stop geometry drift or protect micro-texture by itself. If the model is allowed to regenerate fabric or gloss, it will.
The texture-fidelity lever is: do not regenerate the blank texture at all. Preserve it deterministically.
The production-grade solution: âDeterministic compositor + optional local harmonizer + hard restoreâ
Core principle
The product blank is immutable outside the print zone.
This is exactly the kind of âlatent mask is not pixel-equivalentâ failure mode recent work like PELC is about. (arXiv)
Translation: do not trust latent-space masking to protect pixels. Enforce it in pixel space.
Step-by-step pipeline that fits your products (mugs, tees, cases)
Step 1: Create a SKU-angle template pack (one-time per SKU and camera angle)
For each SKU and each angle you sell, store:
-
Print mask: where ink may exist
-
Occlusion mask: handle, seams, folds, straps
-
Surface mapping from flat art space to image pixels
- homography for near-planar areas (phone case backs, many tee chest shots)
- cylindrical or piecewise warp for mugs
-
Optional control maps derived from the blank
- canny edges for silhouette guidance
- depth map if you use depth ControlNet
This eliminates your âlong-term memoryâ problem because coordinates are no longer guessed.
Step 2: Deterministic warp + compositing (no diffusion yet)
Take the IP art. Warp it onto the print zone with your template.
Then do a physically-plausible âprint-on-surfaceâ composite:
- Preserve blank shading by modulating the art with local luminance
- Preserve highlights (especially glossy cases) by reapplying a highlight layer from the blank over the print
- For tees, optionally simulate ink bleed with a tiny edge blur in the print mask boundary only
This step should already look âpretty real.â It is fast, repeatable, and measurable.
Step 3: Seam blending with classical CV (optional)
If you mainly see edge seams or color mismatch, try classical blending first:
- Poisson-style blending can integrate gradients smoothly
- OpenCV exposes this family via
seamlessClone and related functions (OpenCV Document)
Pitfall: Poisson/seamlessClone can show unexpected placement shifts due to mask ROI and center interpretation. This is reported in OpenCV issues and Q&A. (GitHub)
So treat it as a tool, not a guarantee. Validate alignment.
Step 4: Diffusion as a local harmonizer only (tiny mask, low strength)
Use SDXL inpainting only to fix what deterministic compositing cannot:
- micro edge interaction at the print boundary
- subtle contact shading at the edge
- very light specular continuity across the print on glossy materials
How to constrain it:
- Mask only the print + a thin boundary ring (not the whole product)
- Use low denoise strength (high strength repaints texture and softens detail) (Hugging Face)
- Add ControlNet Canny/Depth derived from the original blank to discourage geometry changes (GitHub)
- Use IP-Adapter to condition on the IP art appearance without letting the model âdecideâ placement (GitHub)
Reality check: even with this, some leakage can happen. (GitHub)
Step 5: Hard restore (mandatory if you want âpixel-perfect blankâ)
After diffusion returns an image:
- Final output = original blank outside editable region + edited result inside editable region
This is the enforcement step that turns a âsoft constraintâ model into a âhard constraintâ pipeline. It is the practical answer to âinpainting changes unmasked pixels.â (GitHub)
If you prototype in ComfyUI, nodes like âOverlay Inpainted Latentâ exist specifically to support this blend pattern. (RunComfy)
A very strong upgrade: per-pixel edit strength for boundary-only edits
Your ideal behavior is:
- boundary changes allowed
- interior texture of the blank must not change
Differential Diffusion is designed for this. It introduces a âchange mapâ so each pixel can have a different strength. (arXiv)
In practice for mockups:
- Make the change map high on a 5â30 px ring around the print boundary
- Make it near-zero inside the protected blank texture region
- Still do hard restore outside the overall editable region
Product-specific settings
T-shirts (fabric texture priority)
- Keep diffusion mask extremely tight
- Do more âprint realismâ via deterministic shading + highlight preservation
- Only let diffusion touch the boundary ring, never the entire shirt body
- Consider Differential Diffusion for boundary-only adaptation (arXiv)
Mugs (geometry and wrap priority)
- Use a cylindrical warp template for each mug angle
- Use ControlNet Canny from the blank to protect silhouette (GitHub)
- Diffusion only for boundary shading continuity
- Hard restore everything else
Phone cases (specular highlight priority)
- Treat highlights as part of the blank âlighting passâ
- Reapply blank specular layer after compositing and after diffusion (if you do diffusion at all)
- Avoid diffusion repainting the highlight field
Masking and template creation at scale (so this is not manual forever)
To scale SKU template creation:
- Use SAM for fast mask bootstrap (GitHub)
- Use Grounded-Segment-Anything when you want text-guided segmentation (faster batch template creation) (GitHub)
- If edge masks need extra precision, HQ-SAM is a common refinement approach (GitHub)
What I would implement first (minimal pivots, maximum leverage)
Phase 1: Stop drift and preserve blanks
-
Build SKU-angle templates with deterministic warps
-
Deterministic compositing only (no diffusion yet)
-
Add QA gates:
- pixel delta outside editable region must be 0
- silhouette edge match in protected region
This alone should remove geometry warps and batch drift.
Phase 2: Add âlocal harmonizerâ for realism
-
Add SDXL inpainting with:
- boundary-only mask
- low strength
- ControlNet Canny/Depth (GitHub)
- IP-Adapter reference conditioning (GitHub)
-
Hard restore outside editable region (GitHub)
Phase 3: Boundary-only precision upgrade
- Add Differential Diffusion change maps for even tighter âtexture protectionâ (arXiv)
When to consider âobject insertionâ research systems (AnyDoor, InsertDiffusion, GLIGEN)
These are useful if you want the model to do more of the integration work, but they are not usually necessary for your constrained âdecal on rigid productâ task.
- AnyDoor focuses on object-level relocation/insertion with optional shape controls (GitHub)
- InsertDiffusion focuses on identity-preserving insertion in a training-free architecture (GitHub)
- GLIGEN gives grounded control via boxes and phrases (GitHub)
For e-commerce mockups, deterministic placement beats âgenerative placementâ almost every time.
Summary bullets
- Hosted multimodal âmergeâ is structurally loose. It will keep warping blanks and repainting texture.
- Self-host SDXL + ControlNet + IP-Adapter helps, but you still need hard enforcement. (GitHub)
- The reliable architecture is deterministic warp compositing â optional boundary-only diffusion â hard restore outside editable region. (GitHub)
- For maximum texture protection, use boundary-only change maps (Differential Diffusion). (arXiv)