for now:
You are seeing a normal “small model + sampling” tradeoff, made worse by two common setup problems.
- Wrong prompt wrapping for chat models. If you do not use the model’s expected chat template and control tokens, instruction-following and style control drop hard. Hugging Face explicitly warns that using the wrong control tokens makes chat models perform “drastically worse.” (Hugging Face)
- Using temperature to solve a decoding problem. Pushing temperature up often improves “voice” but increases repetition risk, which is the classic text-degeneration pattern described in the nucleus-sampling literature.
Below is what I would do for your exact CPU-only case, with concrete parameters, few-shot guidance, and small datasets.
0) The highest-leverage check: are you using the right chat template?
Mistral Instruct-style models
Mistral’s instruct model card states that prompts should be surrounded with [INST] ... [/INST] to leverage instruction fine-tuning. (Hugging Face)
If you feed a plain string (“Act as… Write… in tone…”) you may be bypassing the exact instruction format.
SmolLM 1.7B Instruct
The SmolLM-1.7B-Instruct model card recommends generation settings (temperature 0.2, top_p 0.9) and is designed to be used with the standard chat tooling. (Hugging Face)
Also, its config.json shows max_position_embeddings: 2048, so you have a 2K token context window to budget for prompt + output. (Hugging Face)
DeepSeek-R1 distilled models
DeepSeek’s official guidance says:
- temperature 0.5–0.7 (0.6 recommended) to prevent endless repetition or incoherence
- avoid adding a system prompt and put all instructions in the user prompt (GitHub)
If your prompt is injected as a “system” instruction by your wrapper, that can directly fight their recommendation.
Why this matters more than any decoding knob
HF’s chat templating guide gives the exact reason: different chat models rely on different special tokens ([INST], <|user|>, etc.). Wrong template means the model is not “in instruction mode.” (Hugging Face)
Actionable rule: for any “-Instruct” / chat model, build prompts as messages and call tokenizer.apply_chat_template(...) whenever possible. (Hugging Face)
1) Fix the repetition the right way (instead of pushing temperature down)
You tried repetition_penalty and didn’t see improvement. That is common because it is a soft bias. You often need at least one hard repetition guard.
A. Add hard n-gram blocking
Use no_repeat_ngram_size=3 (or 4). This prevents repeating any 3-word sequence, which directly targets “delicious bread, tasty bread…” loops.
This control is widely requested even in production inference servers because repetition is common. Example: Hugging Face Text Generation Inference added/handled requests for no_repeat_ngram_size. (GitHub)
Transformers also implements this via a dedicated logits processor. (Hugging Face)
B. Switch decoding families: Contrastive search (best “tone + low repetition” default)
Contrastive search is explicitly supported in Transformers and kicks in when penalty_alpha > 0 and top_k > 1. (Hugging Face)
HF’s blog introduces it as a strong decoding method aimed at better quality than naive sampling. (Hugging Face)
A practical reference page suggests penalty_alpha often works in [0.3, 0.8] and larger top_k increases compute. (Hugging Face)
CPU-friendly starting point
do_sample=False
top_k=4 or 8
penalty_alpha=0.6
- plus
no_repeat_ngram_size=3
This often gives you “more voice than greedy” without the temperature=1.2 repetition spiral.
C. Typical decoding (typical_p) as a second option for style
Transformers includes a typical-decoding logits processor. (Hugging Face)
The key behavior: it prioritizes tokens whose log-prob is close to the distribution entropy, meaning very “default” tokens can be discarded. (Hugging Face)
This can help with your “casual becomes corporate” issue without cranking temperature.
Starting point
do_sample=True
temperature=0.8–1.0
typical_p=0.9–0.95
no_repeat_ngram_size=3
2) “Tone collapses to corporate” is usually a prompting problem on 1–2B models
Small models tend to treat “casual bakery tone” as vague. They respond better to concrete constraints.
Replace a tone adjective with a style spec the model can follow mechanically
Instead of only:
“Write in a casual bakery tone”
Add:
- Voice: warm, neighborly, local
- Sentence length: short, punchy
- Must include: 1 sensory phrase, 1 friendly invitation, 1 specific product detail
- Avoid words: “synergy,” “leverage,” “innovative solutions,” “world-class,” “bespoke”
This works because it turns “tone” into observable rules.
Add lexical anchors
Give 3–6 “anchor phrases” that are unmistakably the tone you want:
- “fresh out of the oven”
- “pop in”
- “we saved you a slice”
- “your neighborhood bakery”
Small models copy anchors surprisingly well.
3) Answers to your 3 specific questions
Q1) Parameter combinations you’re missing for CPU inference
CPU vs GPU does not change decoding logic, but CPU constraints make you care about cheap strategies.
Most impactful generation knobs you likely have not tried:
no_repeat_ngram_size=3 (hard anti-loop) (Hugging Face)
- Contrastive search:
top_k + penalty_alpha (Hugging Face)
- Typical decoding:
typical_p (Hugging Face)
- Lower
max_new_tokens and enforce structure (copywriting rarely needs 400 tokens). This reduces loop risk.
Model-specific constraint: DeepSeek recommends staying at ~0.6 temperature to avoid repetition. (GitHub)
So for DeepSeek-R1 distilled checkpoints, do not use temperature=1.2 unless you are also using hard repetition constraints.
Q2) Is few-shot prompting feasible for 1.7B (context window concerns)?
Yes, if you keep it short.
- SmolLM-1.7B-Instruct has 2048 max positions, so you have a real but limited budget. (Hugging Face)
- For 1–2B models, 2 shots is usually the sweet spot. 3 shots can help, but only if each is tiny.
Practical sizing rule
- Keep each example 60–120 tokens.
- Keep the instruction + style spec short.
- Reserve 80–200 tokens for output.
If you blow half the context on long examples, small models start “forgetting” constraints and drift into generic phrasing.
Q3) Lightweight open-source datasets for copy tone (under 10k)
You will not find a perfect “marketing tone taxonomy” dataset that is both small and high quality. What works in practice is combining:
- small marketing/copy data
- a tone proxy dataset (formality)
Good under-10k building blocks
-
Ad copy dataset (1,141 rows)
smangrul/ad-copy-generation on Hugging Face. (Hugging Face)
Great because it is already formatted for instruction-style training.
-
Formality labels (tone proxy)
osyvokon/pavlick-formality-scores provides sentence-level formality annotations from a TACL paper (human-labeled). (Hugging Face)
Use it to train a small “formality head” or to create few-shot anchors for casual vs formal.
Bigger but you can downsample
- GYAFC (Yahoo Answers Formality Corpus). The original paper describes it as very large (110K pairs), but you can downsample to <10K. (ACLENSBL)
Caveat: many implementations note it is “available on request,” not always a direct download. (GitHub)
4) CPU-only reality check: performance and memory choices matter
If you are truly running 7B locally on an 8GB machine, you are almost certainly using quantization or swapping.
A. Consider a CPU-native runtime for quantized models
For local CPU inference, the GGUF ecosystem (llama.cpp family) is often the practical path. It also supports applying chat templates stored in model metadata. (GitHub)
If you stay in Python, you can also look at CPU optimization stacks:
- Optimum-Intel + OpenVINO for deployment and CPU throughput tuning (official HF blog + OpenVINO docs). (Hugging Face)
- PyTorch INT8 quantization progress on x86 CPUs (PyTorch blog). (PyTorch)
B. Quantization can change tone control
SmolLM’s model card notes that 4-bit quantization degraded quality for the smallest sizes (135M, 360M). (Hugging Face)
The general lesson: aggressive quantization can make style adherence worse. If tone is your goal, prefer “reasonable” quantization levels and verify.
5) A concrete “do this next” recipe for your exact symptom pair
Step 1: enforce chat templates
Use apply_chat_template() and model-specific formatting. HF explains why this matters. (Hugging Face)
For Mistral Instruct specifically, follow [INST] ... [/INST]. (Hugging Face)
Step 2: stop using temperature=1.2 as your main style lever
For DeepSeek R1 family, they explicitly warn that higher temps can produce endless repetition and recommend 0.6. (GitHub)
Step 3: add hard repetition control
no_repeat_ngram_size=3. (Hugging Face)
Step 4: use contrastive search (first choice) or typical decoding (second choice)
Step 5: add a style spec + 2-shot examples
Keep total prompt size small enough to fit your 2K window for SmolLM. (Hugging Face)
Curated links (highest-signal)
Summary
- Fix prompt wrapping first. Use chat templates. Wrong control tokens can massively degrade tone control. (Hugging Face)
- Stop using temperature=1.2 to “buy tone.” Add
no_repeat_ngram_size and switch to contrastive search or typical decoding. (Hugging Face)
- Few-shot works on 1.7B if you keep it to 2 short examples and respect the 2K context window (SmolLM-1.7B). (Hugging Face)
- For small datasets: start with
ad-copy-generation (1,141 rows) plus a formality dataset as a tone proxy. (Hugging Face)