hmm…?
You are mixing up two different signals.
- The
"class L" error is a real bug pattern in the “auto derive wrap classes from model._no_split_modules” path.
- The “print(model) shows no wrapping” is often not a reliable indicator of whether FSDP is active, because (a) you might be printing before wrapping, or (b) you are expecting “layers distributed across GPUs”, which is not what FSDP does.
I will break this down in a way that maps directly to your two symptoms.
1) Why you get: ValueError: Could not find the transformer layer class L in the model
What Accelerate is doing under the hood
With fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP, Accelerate (and also PEFT’s helper) builds a list of class names to wrap like this:
- Take
model._no_split_modules if present
- Convert it to a comma-separated string
split(",") to get the class-name list
- For each class name, look up the actual Python class in the model and error if any are not found
You can see this exact join-then-split logic in a real Accelerate issue report (it quotes the relevant snippet from accelerate/utils/dataclasses.py). (GitHub)
You can also see the same pattern in PEFT’s fsdp_auto_wrap_policy, where it does:
",".join(model._no_split_modules) ... ).split(",") (Hugging Face)
How you end up searching for a class literally named "L"
That happens when model._no_split_modules is not a list, but a string.
Example of the failure mechanism:
- Suppose
model._no_split_modules == "LlamaDecoderLayer" (a string)
- Then
",".join(model._no_split_modules) iterates characters and becomes:
"L,l,a,m,a,D,e,c,o,d,e,r,L,a,y,e,r"
- Then
.split(",") yields ["L", "l", "a", ...]
- First lookup is
"L" and you get “could not find class L”
That explains your exact error message.
Why model._no_split_modules becomes a string in real projects
Common causes:
- Some wrappers (custom code, older utilities, or accidental assignment) override
_no_split_modules.
- You are not wrapping the base Transformers model, but a wrapper around it (PEFT LoRA model, TRL wrapper, custom module) and the attribute is forwarded incorrectly.
This pattern is common enough that people report closely related “could not find transformer layer class to wrap” errors in Accelerate issues. (GitHub)
Fix for the "L" error (robust and minimal)
Right before you call accelerator.prepare(...) (or right before Trainer starts wrapping), force _no_split_modules into the expected type:
# Put this right before FSDP wrapping happens
ns = getattr(model, "_no_split_modules", None)
if isinstance(ns, str):
model._no_split_modules = [ns]
# Optional: filter out names that are not found in this exact model variant
from accelerate.utils.dataclasses import get_module_class_from_name
if isinstance(getattr(model, "_no_split_modules", None), (list, tuple)):
model._no_split_modules = [
n for n in model._no_split_modules
if get_module_class_from_name(model, n) is not None
]
The “filter out missing names” workaround is explicitly recommended in an Accelerate issue because some models define _no_split_modules entries that are not present in all variants. (GitHub)
2) Why fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer “works” but print(model) looks unwrapped
Two separate points.
A) FSDP does not “distribute layers across GPUs”
FSDP shards parameters, gradients, optimizer states across ranks. The model’s module graph still looks like the full model on every rank. This is the whole point: each rank runs forward/backward, but holds only shards outside compute. (PyTorch Documentation)
So even with correct wrapping, you will still see:
ModuleList(0-31): 32 x LlamaDecoderLayer(...)
That is normal. It is not pipeline parallelism.
B) print(model) is only meaningful if you print the wrapped object
With Accelerate, FSDP wrapping happens when you call accelerator.prepare(...) / prepare_model(...). The docs describe that the FSDP parameters are applied via the Accelerate config or plugin at prepare time. (Hugging Face)
So:
- If you print the model before
accelerator.prepare, it will be unwrapped.
- If you are using Trainer, you often need to inspect
trainer.model_wrapped (outer wrapper), not just trainer.model. Trainer docs explain model_wrapped is the most external wrapped model. (Hugging Face)
How to prove wrapping is active (better than print(model))
Use one of these objective checks:
Check 1: Is the top-level object actually FSDP?
from accelerate import Accelerator
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
accelerator = Accelerator()
model = ...
model = accelerator.prepare(model)
if accelerator.is_main_process:
print("type(model) =", type(model))
print("is FSDP =", isinstance(model, FSDP))
Check 2: Checkpoint layout should be sharded per rank
Accelerate recommends SHARDED_STATE_DICT and shows that accelerator.save_state(...) produces per-rank shards. (Hugging Face)
If you do:
accelerator.save_state("ckpt")
you should see shard files per process under ckpt/pytorch_model_0/ and ckpt/optimizer_0/ as in the docs. (Hugging Face)
If you only get one monolithic checkpoint, you likely are not in FSDP mode.
3) Your config has one high-probability pitfall: key names and nesting
A) Make sure fsdp_config: is actually a nested mapping
In your pasted YAML, everything after fsdp_config: is aligned with it (no indentation). That would make fsdp_config empty and all the fsdp keys become top-level, which Accelerate will not read as FSDP config.
Correct structure is:
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_cpu_ram_efficient_loading: true
fsdp_sync_module_states: true
fsdp_use_orig_params: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
# ...
B) Your config key fsdp_backward_prefetch vs fsdp_backward_prefetch_policy
The Accelerate FSDP guide documents the config option as fsdp_backward_prefetch_policy. (Hugging Face)
The Accelerate CLI docs also list --fsdp_backward_prefetch_policy. (Hugging Face)
But many older configs and issues show fsdp_backward_prefetch. (GitHub)
Best practice:
- Run
accelerate config update on your config file so keys match your installed Accelerate version. The CLI explicitly supports updating configs. (Hugging Face)
C) Make sure you are editing the config file you are actually using
Accelerate stores the default config in the HF cache location unless you pass --config_file. The CLI docs spell out the default path behavior. (Hugging Face)
Concrete fix:
- Launch with
accelerate launch --config_file /path/to/your.yaml ...
- Or check
accelerate env output and confirm it points at the same file you edited.
4) Additional FSDP-specific pitfalls that match “it looks like it didn’t wrap”
Pitfall: environment variables override what you think you configured
People often export FSDP_TRANSFORMER_CLS_TO_WRAP / FSDP_AUTO_WRAP_POLICY in their shell or job scripts. In a recent Accelerate issue, the user sets these env vars and gets “could not find transformer layer class …” when preparing a model that does not contain the specified classes. (GitHub)
Action:
env | grep -E 'FSDP_|ACCELERATE_USE_FSDP' and unset anything stale.
Pitfall: fsdp_cpu_ram_efficient_loading: true requires init order discipline
Accelerate docs note that RAM-efficient loading requires the distributed process group to be initialized before calling from_pretrained, and also requires fsdp_sync_module_states=True. (Hugging Face)
If you load the model before Accelerate initializes distributed, you can get inconsistent weights or confusing behavior.
Safe pattern in an Accelerate-native script:
- Create
Accelerator() first
- Then call
from_pretrained(...)
- Then call
accelerator.prepare(model, ...)
5) What I think is happening in your specific case
Highest probability, based on your exact error string and your output:
- When relying on
_no_split_modules, your model._no_split_modules is a string, not a list. That produces the per-character join and makes the first “class name” equal to "L". (Hugging Face)
- When you explicitly set
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer, you avoid the _no_split_modules bug, but you are concluding “not wrapped” from print(model). That is unreliable unless you printed after accelerator.prepare and inspected the wrapped object. Also, even correctly wrapped, the module graph still shows all 32 decoder layers because that is normal for FSDP. (PyTorch Documentation)
- There may also be a YAML nesting mistake (your pasted config shows
fsdp_config: with no indented children), which would make Accelerate ignore all FSDP options entirely. This would fully explain “nothing wraps” even when you think you set LlamaDecoderLayer.
6) Similar cases online and where to read
Directly similar issue threads
Official docs worth following (and re-reading with the above pitfalls in mind)
PyTorch grounding (for mental model and debugging)
Bullet summary
"class L" strongly indicates model._no_split_modules is a string, causing character-wise join then split. Fix by forcing it to a list and optionally filtering missing names. (Hugging Face)
print(model) is not enough. Print after accelerator.prepare and check isinstance(model, FSDP). Also remember FSDP shards parameters, not “layers across GPUs”. (PyTorch Documentation)
- Verify your YAML nesting under
fsdp_config: and consider accelerate config update to normalize key names like fsdp_backward_prefetch_policy. (Hugging Face)