Hyperparameters vs message format prompt tuning

I am using LORA to finetune a model adaptor, to ultimately classify tweets as sexist or not (“yes” or “no”).

The format of the system prompt and user text, which I use to create my input from a dataset of tweets, impacts the ultimate accuracy (e.g. “Your task is to detect sexism” vs “Your task is to detect misogyny”).

So do the LORA hyperparameters.

Should I choose the best performing LORA hyperparameters, then do what is effectively “prompt tuning” on the finetuning data format. Or vice versa?

1 Like

The appropriate training data and templates vary depending on which model class is tuned with which trainer. Regardless of the training data content, if the template fails to apply, training will fail badly.

Assuming the template and data format are appropriate, I believe data quality and quantity are crucial.

Set LoRA hyperparameters according to the actual data conditions. Performing short test runs makes it easier to avoid catastrophic failures.


What you are really tuning (and why the order matters)

You have two different levers that both affect validation accuracy, but they do so in different ways:

1) Message format / instruction text (“prompt template”)

This includes two categories:

  • Structural formatting (chat roles, special control tokens, separators, whether you use the model’s chat_template): using the wrong control tokens can “drastically” hurt performance for chat/instruct models. (Hugging Face)
  • Task semantics (the actual meaning of the instruction): “sexism” vs “misogyny” is not a harmless paraphrase for many labeling schemes—it can change what the model is being asked to learn.

2) LoRA hyperparameters

LoRA injects low-rank trainable matrices into selected layers while freezing base weights; r, alpha, dropout, and target modules determine the adapter’s capacity and how strongly it can update behavior. (arXiv)

Key point: these two levers interact. The “best LoRA config” under template A need not be best under template B, because LoRA learns whatever training signal your formatting + loss exposes.


The best answer to “LoRA first or prompt first?” (for your case)

Do format correctness first, then do a small template sweep, then do LoRA tuning, then do a small interaction check.

This is not “vice versa”; it’s a staged approach designed to avoid wasting compute on runs that are invalid or incomparable.


Step 1 — Freeze the parts of “format” that are actually correctness, not tuning

A) If you’re training a chat/instruct model: always apply the model’s chat template

Use apply_chat_template() consistently for both training and inference so you’re not accidentally changing control tokens between runs. (Hugging Face)

If you don’t do this, you can see large swings that look like “prompt tuning works,” when it’s actually “sometimes the model recognizes roles correctly, sometimes it doesn’t.”

B) Make the loss target stable (especially with TRL/SFT)

For generative labeling (“Answer: YES/NO”), you typically want completion-only loss (loss on the label tokens, not on the prompt text). TRL’s SFTTrainer docs describe the training setup, and TRL issues repeatedly clarify that SFTTrainer is not automatically “completion-only” unless you configure it appropriately. (Hugging Face)

If the loss masking shifts when you change template wording, you’re no longer comparing prompt variants fairly—you’re changing what the model is trained to predict.

Outcome of Step 1: runs are comparable. Any accuracy change is more likely due to true semantics/learned behavior, not pipeline variance.


Step 2 — Decide: are you building a classifier, or a chat behavior?

This is the biggest “better approach” lever.

Option A (usually best for your stated goal): Sequence classification + LoRA

If you want the best and most stable tweet → sexist? (binary) classifier, use AutoModelForSequenceClassification (logits over 2 classes). Hugging Face’s text classification guide is designed for this workflow. (Hugging Face)

Benefits:

  • Much less sensitivity to prompt wording/roles.
  • Cleaner evaluation (macro-F1/ROC-AUC/thresholds).
  • No parsing failures (e.g., “YES, because…”).

Option B: YES/NO generation + LoRA (only if you must keep chat I/O)

If you need to keep “system/user → assistant says YES/NO”:

  • Use correct chat templating. (Hugging Face)
  • Use completion-only (or equivalent) so training signal is stable. (Hugging Face)
  • Prefer scoring label logits (YES vs NO) at a fixed Answer: position over free-form .generate() + parsing (this turns it into a classifier-like decision while preserving chat format). HF provides mechanisms for token-level scoring around generation. (Hugging Face)

Step 3 — Now tune the prompt template, but only within a fixed task definition

Here is the important conceptual split:

A) Do not treat “sexism vs misogyny” as mere prompt tuning

That’s a semantic change. If your annotation guideline labels “sexism” broadly, “misogyny” may narrow the task and can yield misleading “improvements” that won’t generalize.

B) Do treat as tunable:

  • Output constraint wording (“Answer with exactly YES or NO”)
  • Short definition aligned to the label policy
  • Where the answer is placed (Answer: marker, no extra tokens)

Background: prompt-based classification research shows the choice of template / label words can cause large variance. (arXiv)

Practical method: pick a baseline LoRA config and do a small sweep (e.g., 5–15 templates), keeping everything else identical.


Step 4 — Tune LoRA hyperparameters under the best 1–2 templates

Once format + loss + task semantics are stable, LoRA tuning becomes meaningful.

LoRA knobs (what they control):

  • r (rank): adapter capacity (how much the model can change). (arXiv)
  • alpha: scaling of the LoRA update (effective strength). (Hugging Face)
  • target_modules: where adaptation happens (attention proj only vs also MLP, etc.). (Hugging Face)
  • dropout: regularization for the adapter. (Hugging Face)

What to tune first in practice (highest ROI):

  1. learning rate / epochs / early stopping (often dominates)
  2. target modules
  3. r
  4. alpha
  5. LoRA dropout

Step 5 — Interaction check (minimal “joint” search)

Because prompt template and LoRA interact, do this at the end:

  • top 2–3 templates Ă— top 2–3 LoRA configs

This is the cheapest way to catch cases where template A wins with low LoRA capacity but template B wins when LoRA can represent the boundary.

If you have more compute, the “best” approach is to treat:

  • template choice (categorical)
  • LoRA params (continuous/discrete)
    as one hyperparameter search space (factorial / Bayesian / multi-fidelity). But the staged approach above gets most of the benefit with far fewer runs.

Summary recommendation

  • If your end goal is truly “binary tweet sexism classifier”: Sequence classification + LoRA is the most robust “better approach.” (Hugging Face)
  • If you must keep chat/generation: freeze chat templating + loss definition first, then template sweep, then LoRA tuning, then a small interaction check. (Hugging Face)

thank you so much for this incredibly detailed response!

1 Like