The models you mentioned are indeed relatively versatile multimodal models that focus on handling audio and video as well. While they can write text, they aren’t particularly strong at it since they handle other tasks besides text generation.
If you want a smaller model for writing, it’s best to start with a standard text-specialized Instruct LLM. As a rule of thumb, models with around 7B to 12B parameters can produce reasonably plausible text in major languages. Many people have also released derivatives of these models, fine-tuned for their own specific purposes. Still, I think it’s best to start with a basic Instruct model to test its raw suitability for your use case. Since LLMs each have significant quirks, fine-tuning one to your preferences will likely yield results closer to what you want.
Anyway, the best approach is simply to try out various representative models yourself.
If you want lightweight + beginner-friendly + good short copy, treat this as a three-part choice:
- Runner (the app you use)
- Model family (what it’s good at)
- A short-copy workflow (prompt shape + a couple settings)
Most “hype disappointments” come from picking a cool model name that is not optimized for your task (e.g., audio-first or multimodal-first), or running a decent model with defaults that force bland, robotic copy.
What “lightweight” actually means for copywriting
Lightweight usually means one or more of these:
- Small parameter count (roughly 1B–7B parameters). Smaller tends to run on consumer machines more easily.
- Quantized GGUF builds (compressed model files used by local runners). These trade a bit of quality for much easier local running.
- Instruct-tuned (trained to follow instructions), which matters for “give me 12 captions under 110 characters.”
For short promo copy, the “sweet spot” is often:
- 3B–7B instruct models for noticeably better tone and variety than ~1B models, while still being manageable.
Your three named models, in context
1) Qwen3-Omni: powerful, but not beginner-text-first
Qwen3-Omni is a native omni-modal model that handles text, images, audio, and video, with streaming speech capability. That’s the point of the model. (Hugging Face)
What this means for you:
- If your goal is only short written copy, Omni is usually unnecessary complexity.
- The “beginner-friendly click-and-go” experience is weaker because omni serving often involves specialized tooling and workflows (for example, vLLM omni serving plus a demo stack). (vLLM)
- Community discussions around running it locally often focus on support gaps and setup friction. (Reddit)
When it is worth it:
- You actually want audio outputs (ads read aloud, character voices, multilingual spoken promos), or multimodal inputs.
2) Dia-1.6B: you guessed right, it’s voice-focused
Dia-1.6B is positioned as a text-to-speech model (generate spoken audio from scripts), not as a general “write marketing captions” text LLM. (kaggle.com)
So for your use case:
- It is not a primary copywriting model.
- It becomes relevant only if you want voiceovers for your promos.
3) Bling-Sheared-Llama: “enterprise automation” small model, not a writing-first vibe model
The BLING Sheared Llama line is described as enterprise automation / knowledge-intensive instructions, with an emphasis on narrower instruction sets suited to very small models. (Hugging Face)
Translation for short-form copy:
- It can produce text, but it is not marketed as “creative, human-sounding captions.”
- Expect more “functional assistant tone” unless you do heavier prompt shaping.
- Its strongest pitch is “small, runs locally, fits enterprise/RAG workflows,” not “non-robotic ad copy.” (Hugging Face)
The beginner-friendly models I would actually try first (2025-era, lightweight)
You asked for “under-the-radar” options that work without the hype trap. Here’s the practical shortlist, with why each tends to fit short copy.
Tier A: best first try for a total newbie (balanced quality and ease)
Qwen3-4B-Instruct-2507
- It’s explicitly an updated non-thinking mode variant of Qwen3-4B, aimed at straightforward instruction responses (good for captions, slogans, variations). (Hugging Face)
- It’s directly available in LM Studio’s model catalog pages, which is exactly the “click and go” path you want. (LM Studio)
Why it’s a good “newbie” pick:
- 4B is large enough to sound less like a toy.
- Non-thinking variants reduce the chance you get long “reasoning style” output when you only want final copy.
Tier B: if you want more natural writing cadence (and can run 7B)
Mistral-7B-Instruct-v0.3
- Widely used instruct model; the model card highlights v3 tokenizer support and function calling support, which also tends to correlate with “modern packaging” in many runners. (Hugging Face)
Why it helps your “robotic” complaint:
- 7B models often produce better rhythm and phrasing variety than 3B–4B, especially in marketing-style text.
Tier C: very lightweight “surprisingly decent,” but settings matter
Gemma 2 2B IT (GGUF)
- The GGUF model card explicitly warns that in llama.cpp-style tools (including Ollama and LM Studio), you need to set flags correctly, especially repeat-penalty. (Hugging Face)
Why it can be good for you:
- If your hardware is limited, 2B is easier.
- If you tune repetition controls, you can get punchy short outputs.
Why beginners sometimes hate it:
- Wrong defaults can cause repetition or weirdness. The official GGUF card and llama.cpp discussion both emphasize repeat-penalty sensitivity. (Hugging Face)
Phi-3.5-mini-instruct
- Microsoft positions Phi-3.5-mini as a lightweight model with large context (128K) and training focused on high-quality datasets. (Hugging Face)
Why it’s useful for copy:
- Often strong at structured instruction-following (lists, variants, constraints).
- Can sound a bit “formal” by default, so you’ll rely on prompt voice examples.
Tier D: multilingual and documented prompt formats
Llama 3.2 3B Instruct
- Clear licensing and official prompt-format documentation exists, which helps avoid “why is the output weird” confusion. (Hugging Face)
Why it might matter:
- If you are writing Japanese or mixed JP/EN frequently, multilingual families can be a safer baseline.
Practical note:
- The license is not Apache-2.0; it is the Llama 3.2 Community License. Read it if you plan commercial usage. (Hugging Face)
“Click and go” setup: the two easiest routes
Route 1: LM Studio (most beginner-friendly GUI)
LM Studio’s docs explicitly describe a built-in model downloader from Hugging Face via the Discover tab. (LM Studio)
What you do:
- Install LM Studio.
- Go to Discover.
- Search and download something like “Qwen3 4B Instruct 2507.”
- Load and chat.
If you already downloaded a GGUF somewhere else:
- LM Studio documents how to import external GGUF models. (LM Studio)
Why this route fits your goal:
- Minimal touching of templates, backends, CLI flags.
- Easy to swap models and A/B test.
Route 2: Ollama (simple, but you will see a few “options” knobs)
Ollama’s docs explain Modelfiles and parameters, including stop sequences. Stop sequences matter if a model rambles or adds extra junk after your caption list. (docs.ollama.com)
Key detail that reduces frustration:
Why it can still feel “not beginner-friendly”:
- You may end up editing a Modelfile or options to get the exact behavior you want.
If your priority is truly “no fiddling,” LM Studio usually wins first.
Getting non-robotic short copy: what actually works
Background: why models default to “robotic”
Short copy pushes models into safe templates:
- Generic hype words.
- Repetitive structures.
- Over-polished “brand voice” that feels fake.
You fix this mostly with:
- Better constraints (what to avoid, exact format)
- Batch generation (many options at once)
- A rewrite pass (turn “ad voice” into “human voice”)
A prompt template that reliably produces better captions
Use a single “caption generator” prompt structure:
Inputs
- Product and offer
- Audience
- Tone (3 adjectives)
- “Banned phrases”
- Output constraints (length, hashtags, emoji)
- Ask for many variants
Example (copy/paste)
“Write 20 social captions for: [product].
Audience: [who]. Offer: [deal].
Tone: [3 adjectives].
Rules: max 110 characters each, no hashtags, max 1 emoji, no exclamation marks.
Avoid these phrases: ‘game-changer’, ‘unlock’, ‘elevate’, ‘seamless’, ‘revolutionize’.
Make every caption meaningfully different: vary angle, sentence length, and rhythm.
Output exactly 20 numbered lines. No extra commentary.”
Then do a second pass:
“Rewrite #3, #9, #14 to sound like a real person texting a friend. Keep under 100 characters. Use contractions. Remove salesy words.”
This “two-pass” approach is the fastest path to “less robotic” without changing models.
Tiny settings nudge (only if needed)
If outputs are bland or repetitive:
- Increase creativity slightly (temperature).
- Loosen sampling slightly (top-p).
- Adjust repetition control.
Why repetition control matters:
- Gemma’s GGUF card specifically calls out repeat-penalty sensitivity in llama.cpp-style tools. (Hugging Face)
How to pick without wasting days: a 30-minute “caption bake-off”
Step 1: pick 2–3 candidates
Start with:
Step 2: run the same 10 prompts on each model
Make prompts reflect your real work:
- “8 captions under 90 characters, no hashtags.”
- “10 promo lines that do not sound like marketing.”
- “6 variations using keyword X, avoid words Y.”
Score each output 1–5 on:
- Constraint-following
- Variety
- Human feel
- Lowest cringe
Pick the winner. Keep the runner. Stop shopping.
“Are there any good comparisons or leaderboards for writing?”
Yes, but use them correctly.
LMArena Creative Writing leaderboard (human preference)
This is useful because it’s explicitly a creative-writing slice and is updated frequently (the page shows a “Last Updated” date). (lmarena.ai)
How to use it for your case:
- Treat it as a shortlist generator, not a final answer.
- Then test your caption prompts locally.
Open LLM Leaderboard v2 (standardized, open models)
Hugging Face documents that it evaluates models on a fixed set of benchmarks via EleutherAI’s evaluation harness. (Hugging Face)
How to use it for copywriting:
- It’s good for “which small model is broadly capable.”
- It’s not directly measuring “sounds like a human marketer,” but it can filter out weaker instruction-followers.
What I would do if I were you (simple and realistic)
- Install LM Studio and use Discover to download models in-app. (LM Studio)
- Download Qwen3-4B-Instruct-2507 first. It’s a modern non-thinking instruct model and is straightforward to run. (Hugging Face)
- Download Mistral-7B-Instruct-v0.3 second only if you want more “human cadence” and your machine can handle it. (Hugging Face)
- Use the two-pass caption workflow (generate many, then humanize best picks).
- Only touch settings if you see repetition. If you try Gemma, take repeat-penalty seriously. (Hugging Face)
- Ignore Qwen3-Omni and Dia-1.6B for your first week. Omni is multimodal-first and Dia is TTS-first. (Hugging Face)
- If you still want a “niche under-the-radar” small model, try BLING only if you want enterprise-style instruction automation, not creative promo voice. (Hugging Face)
Curated reading list that maps to your pain points
- LM Studio basics and model downloading (true click-and-go) (LM Studio)
- Ollama Modelfile stop sequences (fix rambling, enforce clean outputs) (docs.ollama.com)
- Qwen3-4B-Instruct-2507 model card (what “non-thinking” update means) (Hugging Face)
- Qwen3-Omni repo + vLLM omni serving guide (why Omni is more complex) (GitHub)
- Gemma 2 2B IT GGUF card + llama.cpp repeat-penalty discussion (avoid “repetition trap”) (Hugging Face)
- LMArena Creative Writing leaderboard (human preference signal for writing) (lmarena.ai)
- Open LLM Leaderboard “About” (what it actually measures) (Hugging Face)
Summary
- Qwen3-Omni is impressive but multimodal-first and heavier to run. Dia-1.6B is TTS-first. (Hugging Face)
- For beginner-friendly short copy, start with LM Studio + Qwen3-4B-Instruct-2507, then try Mistral-7B-Instruct-v0.3 if you want richer tone. (LM Studio)
- Non-robotic captions come more from prompt constraints + a humanize rewrite pass than from chasing model hype.
- If you go very small (Gemma 2B), repetition controls matter a lot. (Hugging Face)