FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup

Introduction

Hi,
I’m trying to fine-tune LLama3-8B with the accelerate package using FSDP and flash-attention. I followed the official guide to config my accelerate with FSDP setting (Fully Sharded Data Parallel). However, when use the model._no_split_modules option (meaning no defined fsdp_transformer_layer_cls_to_wrap), the following error appears "ValueError: Could not find the transformer layer class L in the model.". On the other hand, if I defined fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer in the FSDP configuration file, the model printout seem to suggest that the layer is not wrapped by FSDP.

Software Info

OS
Linux

python
3.10

cuda
12.4

packages:

torch==2.4.1
transformers==4.44.2
datasets==2.21.0
accelerate==0.34.0
flash-attn==2.6.3

Accelerate config setup:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: [ ]
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Result running accelerate launch

The print(model) shows the FSDP auto wrap is not successful and the model layers are not distributed across GPUs

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
1 Like

Hi, i came across the same issue - did you find the way out by any chance? thanks

1 Like

hmm…?


You are mixing up two different signals.

  • The "class L" error is a real bug pattern in the “auto derive wrap classes from model._no_split_modules” path.
  • The “print(model) shows no wrapping” is often not a reliable indicator of whether FSDP is active, because (a) you might be printing before wrapping, or (b) you are expecting “layers distributed across GPUs”, which is not what FSDP does.

I will break this down in a way that maps directly to your two symptoms.


1) Why you get: ValueError: Could not find the transformer layer class L in the model

What Accelerate is doing under the hood

With fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP, Accelerate (and also PEFT’s helper) builds a list of class names to wrap like this:

  • Take model._no_split_modules if present
  • Convert it to a comma-separated string
  • split(",") to get the class-name list
  • For each class name, look up the actual Python class in the model and error if any are not found

You can see this exact join-then-split logic in a real Accelerate issue report (it quotes the relevant snippet from accelerate/utils/dataclasses.py). (GitHub)
You can also see the same pattern in PEFT’s fsdp_auto_wrap_policy, where it does:

",".join(model._no_split_modules) ... ).split(",") (Hugging Face)

How you end up searching for a class literally named "L"

That happens when model._no_split_modules is not a list, but a string.

Example of the failure mechanism:

  • Suppose model._no_split_modules == "LlamaDecoderLayer" (a string)
  • Then ",".join(model._no_split_modules) iterates characters and becomes:
    "L,l,a,m,a,D,e,c,o,d,e,r,L,a,y,e,r"
  • Then .split(",") yields ["L", "l", "a", ...]
  • First lookup is "L" and you get “could not find class L”

That explains your exact error message.

Why model._no_split_modules becomes a string in real projects

Common causes:

  • Some wrappers (custom code, older utilities, or accidental assignment) override _no_split_modules.
  • You are not wrapping the base Transformers model, but a wrapper around it (PEFT LoRA model, TRL wrapper, custom module) and the attribute is forwarded incorrectly.

This pattern is common enough that people report closely related “could not find transformer layer class to wrap” errors in Accelerate issues. (GitHub)

Fix for the "L" error (robust and minimal)

Right before you call accelerator.prepare(...) (or right before Trainer starts wrapping), force _no_split_modules into the expected type:

# Put this right before FSDP wrapping happens
ns = getattr(model, "_no_split_modules", None)

if isinstance(ns, str):
    model._no_split_modules = [ns]

# Optional: filter out names that are not found in this exact model variant
from accelerate.utils.dataclasses import get_module_class_from_name
if isinstance(getattr(model, "_no_split_modules", None), (list, tuple)):
    model._no_split_modules = [
        n for n in model._no_split_modules
        if get_module_class_from_name(model, n) is not None
    ]

The “filter out missing names” workaround is explicitly recommended in an Accelerate issue because some models define _no_split_modules entries that are not present in all variants. (GitHub)


2) Why fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer “works” but print(model) looks unwrapped

Two separate points.

A) FSDP does not “distribute layers across GPUs”

FSDP shards parameters, gradients, optimizer states across ranks. The model’s module graph still looks like the full model on every rank. This is the whole point: each rank runs forward/backward, but holds only shards outside compute. (PyTorch Documentation)

So even with correct wrapping, you will still see:

  • ModuleList(0-31): 32 x LlamaDecoderLayer(...)

That is normal. It is not pipeline parallelism.

B) print(model) is only meaningful if you print the wrapped object

With Accelerate, FSDP wrapping happens when you call accelerator.prepare(...) / prepare_model(...). The docs describe that the FSDP parameters are applied via the Accelerate config or plugin at prepare time. (Hugging Face)

So:

  • If you print the model before accelerator.prepare, it will be unwrapped.
  • If you are using Trainer, you often need to inspect trainer.model_wrapped (outer wrapper), not just trainer.model. Trainer docs explain model_wrapped is the most external wrapped model. (Hugging Face)

How to prove wrapping is active (better than print(model))

Use one of these objective checks:

Check 1: Is the top-level object actually FSDP?

from accelerate import Accelerator
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

accelerator = Accelerator()

model = ...
model = accelerator.prepare(model)

if accelerator.is_main_process:
    print("type(model) =", type(model))
    print("is FSDP =", isinstance(model, FSDP))

Check 2: Checkpoint layout should be sharded per rank

Accelerate recommends SHARDED_STATE_DICT and shows that accelerator.save_state(...) produces per-rank shards. (Hugging Face)

If you do:

accelerator.save_state("ckpt")

you should see shard files per process under ckpt/pytorch_model_0/ and ckpt/optimizer_0/ as in the docs. (Hugging Face)

If you only get one monolithic checkpoint, you likely are not in FSDP mode.


3) Your config has one high-probability pitfall: key names and nesting

A) Make sure fsdp_config: is actually a nested mapping

In your pasted YAML, everything after fsdp_config: is aligned with it (no indentation). That would make fsdp_config empty and all the fsdp keys become top-level, which Accelerate will not read as FSDP config.

Correct structure is:

distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_cpu_ram_efficient_loading: true
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  # ...

B) Your config key fsdp_backward_prefetch vs fsdp_backward_prefetch_policy

The Accelerate FSDP guide documents the config option as fsdp_backward_prefetch_policy. (Hugging Face)
The Accelerate CLI docs also list --fsdp_backward_prefetch_policy. (Hugging Face)

But many older configs and issues show fsdp_backward_prefetch. (GitHub)

Best practice:

  • Run accelerate config update on your config file so keys match your installed Accelerate version. The CLI explicitly supports updating configs. (Hugging Face)

C) Make sure you are editing the config file you are actually using

Accelerate stores the default config in the HF cache location unless you pass --config_file. The CLI docs spell out the default path behavior. (Hugging Face)

Concrete fix:

  • Launch with accelerate launch --config_file /path/to/your.yaml ...
  • Or check accelerate env output and confirm it points at the same file you edited.

4) Additional FSDP-specific pitfalls that match “it looks like it didn’t wrap”

Pitfall: environment variables override what you think you configured

People often export FSDP_TRANSFORMER_CLS_TO_WRAP / FSDP_AUTO_WRAP_POLICY in their shell or job scripts. In a recent Accelerate issue, the user sets these env vars and gets “could not find transformer layer class …” when preparing a model that does not contain the specified classes. (GitHub)

Action:

  • env | grep -E 'FSDP_|ACCELERATE_USE_FSDP' and unset anything stale.

Pitfall: fsdp_cpu_ram_efficient_loading: true requires init order discipline

Accelerate docs note that RAM-efficient loading requires the distributed process group to be initialized before calling from_pretrained, and also requires fsdp_sync_module_states=True. (Hugging Face)

If you load the model before Accelerate initializes distributed, you can get inconsistent weights or confusing behavior.

Safe pattern in an Accelerate-native script:

  • Create Accelerator() first
  • Then call from_pretrained(...)
  • Then call accelerator.prepare(model, ...)

5) What I think is happening in your specific case

Highest probability, based on your exact error string and your output:

  1. When relying on _no_split_modules, your model._no_split_modules is a string, not a list. That produces the per-character join and makes the first “class name” equal to "L". (Hugging Face)
  2. When you explicitly set fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer, you avoid the _no_split_modules bug, but you are concluding “not wrapped” from print(model). That is unreliable unless you printed after accelerator.prepare and inspected the wrapped object. Also, even correctly wrapped, the module graph still shows all 32 decoder layers because that is normal for FSDP. (PyTorch Documentation)
  3. There may also be a YAML nesting mistake (your pasted config shows fsdp_config: with no indented children), which would make Accelerate ignore all FSDP options entirely. This would fully explain “nothing wraps” even when you think you set LlamaDecoderLayer.

6) Similar cases online and where to read

Directly similar issue threads

Official docs worth following (and re-reading with the above pitfalls in mind)

PyTorch grounding (for mental model and debugging)


Bullet summary

  • "class L" strongly indicates model._no_split_modules is a string, causing character-wise join then split. Fix by forcing it to a list and optionally filtering missing names. (Hugging Face)
  • print(model) is not enough. Print after accelerator.prepare and check isinstance(model, FSDP). Also remember FSDP shards parameters, not “layers across GPUs”. (PyTorch Documentation)
  • Verify your YAML nesting under fsdp_config: and consider accelerate config update to normalize key names like fsdp_backward_prefetch_policy. (Hugging Face)