Updated.
Your updated SMML draft is now “research-shaped.” It has (1) a minimal language, (2) an IR step, (3) a layered execution plan, and (4) evaluation metrics. That matches how controllable generation work is actually built in 2023–2025 research and tooling: plan first, render second, measure third. (arXiv)
What matters most to professionals is not whether the tags look like HTML. What matters is whether the tags have clear semantics and compile into concrete controls (boxes, masks, pose maps, panel compositing rules, identity adapters) that today’s diffusion pipelines can obey. (arXiv)
Below is a clear, beginner-friendly way to understand what you have, what researchers will praise, what they will critique, and how to present SMML v0.1 so it is maximally adoptable.
1) Background: what you are really proposing (in plain terms)
The “middle layer” has a standard name in engineering: IR + compiler
In many fields, humans write something readable, then a compiler turns it into something machines can execute.
This is not a new idea in principle. It is exactly what “layout planning with LLMs” papers have been pushing: structured representations make layout intent more reliable than raw natural language. LayoutGPT is a clear example: it uses a style-sheet-like language to help LLMs output plausible layouts. (arXiv)
Why manga is a perfect case for IR
Manga pages are strongly “document-like”:
- panels are discrete units of time,
- gutters (negative space) carry pacing (“Ma”),
- reading order can be culturally specific (RTL vs LTR),
- page hierarchy matters (main panel vs minor beats).
A single diffusion pass is not naturally a document renderer. So a document-like IR makes sense.
2) What existing research says about your “Brain–Hand disconnect”
You are not alone. The field already treats “multi-panel manga” as a special problem.
DiffSensei: explicit “bridge” between an MLLM and diffusion for manga
DiffSensei frames manga generation as needing:
- multi-character control,
- layout control,
- sequential coherence across panels,
and it explicitly links a multimodal model with diffusion in a coordinated system. (arXiv)
MangaDiffusion: layout-controllable manga pages from plain text
MangaDiffusion emphasizes:
- generating multi-panel pages,
- reasonable and diverse page layouts,
- intra-panel and inter-panel interaction,
and it introduces Manga109Story as a dataset for “plain text story → manga page.” (arXiv)
Why this matters for SMML: these papers prove professionals agree that “prompt-only” is insufficient for full manga pages. Your contribution is not “a new diffusion model.” Your contribution is a user-facing “bridge format” that can ride on top of these ideas.
3) Background: what diffusion can and cannot reliably obey
To keep SMML credible, it helps to separate:
Deterministic things (machines can guarantee)
These should be done outside diffusion whenever possible:
- panel rectangles and gutters,
- reading order metadata,
- page composition and whitespace,
- placing dialogue text as a typeset layer.
This aligns with your “Ma” argument: the cleanest “white-out” is literal whiteness in the compositor, not a probabilistic generation request.
Probabilistic things (diffusion can obey, but only with the right controls)
Diffusion is much more reliable when you give it structure beyond text:
-
ControlNet adds spatial conditioning inputs like edges, depth, segmentation, human pose. This is the canonical mechanism for “spatial grammar” in modern diffusion pipelines. (arXiv)
-
Grounded placement uses boxes or masks so the model knows “what goes where.”
- GLIGEN is a known grounding approach using boxes as conditions (note: the Diffusers GLIGEN pipeline is marked deprecated, which is a practical maintenance warning). (GitHub)
- BoxDiff is a training-free way to enforce box constraints during diffusion steps. (GitHub)
- MultiDiffusion is a broader framework that can bind multiple diffusion processes with shared constraints and can use masks/boxes as guiding signals. (arXiv)
-
Identity consistency across panels is its own hard problem.
- IP-Adapter is a lightweight adapter for image-prompting and can help anchor identity/style. (GitHub)
- InstantID targets identity-preserving generation from a single reference face image. (GitHub)
- StoryDiffusion focuses on long-range character consistency via “consistent self-attention,” and explicitly mentions comics creation and consistent attire. (GitHub)
Implication for SMML: <character id="C1"> cannot just be a string. It must compile to one of these identity mechanisms, or it stays “wishful metadata.”
4) What professionals will think about your Phase roadmap
Your phases are reasonable. Researchers will mostly agree with the sequence “spec → IR → pipeline → metrics → expansion.” They will also want sharper boundaries between phases.
Phase 1: Minimal spec (SMML v0.1)
Good. Researchers like minimality because it becomes testable.
The most important improvement: define what is binding versus non-binding.
- Binding in v0.1: panels, frames, gutters, reading order.
- Non-binding in v0.1 (annotations): camera, effects, dialogue style.
This avoids overpromising before controls exist.
Phase 2: JSON-IR
Very good. Professionals prefer JSON-IR because it can be consumed by tooling.
This also aligns with LayoutGPT-style thinking: LLM outputs structured layout representations more reliably than prose. (arXiv)
Phase 3: Layered generation
Good. Layer separation matches the reality that:
- backgrounds, characters, and text have different constraints,
- corrections are commonly done by inpainting and compositing.
Even if the “perfect layer decomposition” is hard, the direction is correct.
Phase 4: Automated metrics
Excellent. This is where “cool demo” becomes “research project.”
Phase 5: SVML and SRML expansion
Fine as a vision, but professionals will ask you to keep scope tight until SMML is proven.
A practical note: timeline/sequence control already has interchange formats in video editing ecosystems, which supports your broader “semantic bridge” thesis (structure separate from render). If you ever write SVML seriously, aligning with existing editorial timeline concepts is the credibility move.
5) Your v0.1 tag set: what works, what needs tightening
Your minimal tags are sensible. The main missing piece is a stricter definition of geometry and order.
A) <frame x y w h> is the right core primitive
Relative coordinates are a strong choice because:
- they are model-agnostic,
- they are easy to normalize across page sizes,
- they map naturally to cropping, masks, and compositing.
Researchers will still require you to define:
- coordinate origin (typically top-left),
- whether x,y are panel top-left,
- whether panels can overlap,
- whether gutters are explicit rectangles or implied by separation.
B) Reading order must be explicit and authoritative
You already have order="1", which is good.
This matters because reading order is not always a simple “Z-path,” especially when layouts get dynamic. Reading research shows layout affects viewing patterns and that deviations from regular grids can change scan paths. (Frontiers)
So professionals will like:
order fields as the source of truth,
- or an explicit
<reading_order> list for complex pages.
C) Dialogue as data, not pixels
Separating <dialogue> from final rendering is correct.
There is an entire subfield around speech balloon detection, OCR, and speaker association in comics because it is structurally difficult. Manga109Dialog exists specifically to link speakers and texts, and it even notes reading order features are relevant for performance. (arXiv)
Keeping dialogue typeset as a layer makes SMML outputs editable and robust.
D) Camera and effect tags: keep them as annotations until you define compilation
Your <camera> and <effect> tags are fine as a “future-proof vocabulary.” The only risk is implying they enforce outcomes today.
Professionals will accept them if you state:
- v0.1 camera/effect tags are non-binding hints,
- later versions compile them into control signals (pose maps, depth maps, edge guidance, etc.). (arXiv)
6) The key missing piece: compilation semantics (tag → control)
Researchers will ask: “What does this tag become at runtime?”
A simple, easy-to-read mapping (conceptual, not code):
Page-level compilation
-
<page direction="rtl">
- becomes: a reading order convention default + an ordering validator.
- does not mean: diffusion understands RTL text.
-
<panel role="main">
- becomes: higher compute budget (resolution, steps, refinement passes) and stricter control strength.
Panel geometry compilation
-
<frame x y w h>
- becomes: crop region and mask for compositing, and possibly region constraints for generation.
Subject placement compilation
Identity compilation
This is the “nervous system” in technical form.
7) Why “Ma” and gutters belong in layout, not diffusion (with research context)
Your “gutter as time” intuition has empirical support.
- Research on comic page layout reading order discusses how factors like separation and proximity influence navigation, and that larger gutters can push readers to navigate differently. (Visual Language Lab)
- Eye-tracking studies also emphasize that readers use expected order but can deviate depending on page design. (Wiley Online Library)
So a strong professional framing is:
- “SMML treats gutters as first-class layout constraints.”
- “Diffusion renders panel contents, not page whitespace.”
That is exactly how you preserve “silence beats” and “white-out” effects reliably.
8) Your evaluation metrics: how to make them concrete and publishable
Your metrics are good. Making them concrete means tying them to existing datasets and evaluation tasks.
A) Layout correctness
If you generate layout from SMML deterministically, you can evaluate:
- exact match of panel rectangles,
- overlap violations,
- gutter width constraints,
- reading order consistency (order list vs geometry).
If you allow LLMs to propose layouts, then compare predicted frames against targets.
B) Dialogue and OCR metrics
Separate the pipeline:
- detect balloons/text regions,
- OCR,
- speaker association.
Manga109Dialog exists for speaker-to-text pairs and gives you a standard evaluation target. (arXiv)
C) Character consistency metrics
You can measure:
- face embedding similarity across panels (when faces visible),
- body/clothing similarity (shape or segmentation overlap),
- identity classifier consistency.
New segmentation resources for Manga109 are directly relevant because they include categories like frames, text/dialog, onomatopoeia, faces, bodies, balloons. (CVF Open Access)
D) Flow metrics (proxy for eye tracking)
If you cannot run eye tracking:
- measure whether the intended reading order is consistent with a rule-based traversal,
- measure “jump penalties” when gutters are large (a proxy for “don’t jump across silence”).
This is grounded in reading order research and gestalt grouping discussions in comics. (Frontiers)
9) Practical ecosystem integration: why ComfyUI is a natural “execution backend”
You do not need SMML to be married to ComfyUI, but ComfyUI is a realistic first target because it already speaks “graph + compositing + masks + control nodes.”
A) Panel layout nodes exist
- comfyui_panels is explicitly for comics/manga-like panel generation and organization. (GitHub)
- CR Comic Panel Templates generates structured layouts with rows/columns and auto sizing. (RunComfy)
- comfyui-panelforge is an extension aimed at comic panel creation, including layout and speech bubbles. (GitHub)
This supports your core thesis: page grammar can be deterministic.
B) Workflow-as-JSON matches your JSON-IR phase
ComfyUI workflows are JSON graphs. That matches the idea of compiling SMML into an executable IR.
- ComfyUI documentation explains workflows as JSON graphs and emphasizes controllable generation through node composition. (ComfyUI)
- ComfyUI Cloud docs describe submitting a workflow via
POST /api/prompt using “API format” workflows. (ComfyUI)
- Other integration docs describe exporting “API JSON” for POST requests to a
/prompt-style endpoint. (docs.nebius.com)
So your Phase 2 (JSON-IR) can map naturally to “workflow JSON.”
10) Context from older “markup for comics” work: why SMML feels natural
SMML is not the first time comics were represented as markup.
- CBML (Comic Book Markup Language) is an XML vocabulary for encoding comics structure (panels, balloons, captions). It exists for analysis and archiving, not generation, but it proves “comics as markup” is legitimate and useful. (dcl.luddy.indiana.edu)
- There are also hobbyist “comic markup → SVG rendering” projects, showing that markup-driven panel composition is intuitive even outside ML. (GitHub)
This gives you a strong way to position SMML:
- CBML: markup for describing comics.
- SMML: markup for describing and generating comics, via compilation to control signals.
11) The main pitfalls professionals will warn about
These are not arguments against SMML. They are design constraints SMML should acknowledge.
Pitfall 1: Ambiguous reading order in complex layouts
Solution: make order explicit or provide a <reading_order> list. Do not rely purely on geometry. Reading order research shows layout can violate simple paths. (Frontiers)
Pitfall 2: “Camera tags” do nothing unless you bind them to controls
Solution: treat camera tags as annotations until you have compilation targets (pose/depth/edges). ControlNet is the standard mechanism for this. (arXiv)
Pitfall 3: Character IDs are meaningless without identity conditioning
Solution: define an “identity contract” that compiles to IP-Adapter or InstantID or a sequence method like StoryDiffusion. (GitHub)
Pitfall 4: Tool churn and deprecations
Example: the Diffusers GLIGEN pipeline is explicitly labeled deprecated in the docs, which matters if you target it as a backend. (Hugging Face)
Solution: SMML should target capabilities (boxes, masks, pose control), not one fragile implementation.
Pitfall 5: Text-in-image quality and OCR are structurally hard
Solution: keep dialogue as typeset overlay. Use datasets like Manga109Dialog and segmentation annotations as evaluation targets, not as something diffusion must “just handle.” (arXiv)
12) A clean way to present SMML v0.1 to researchers (ready-to-post structure)
If you want maximum constructive feedback, a professional-friendly post usually includes:
-
One-sentence claim
“SMML is a manga page DSL that compiles to layout + control artifacts for diffusion workflows.”
-
Non-goals for v0.1
No claims about perfect artistry, perfect identity, or perfect text rendering.
-
Core v0.1 guarantees
Panels, frames, gutters, reading order.
-
Execution compatibility targets
ControlNet-style structure control, box/mask grounding, identity adapters. (arXiv)
-
A small test pack
10 SMML pages + expected JSON-IR + expected panel geometry.
-
A metrics plan
Layout correctness, speaker-text association metrics (Manga109Dialog), segmentation-based character measurements (Manga109 segmentation annotations). (arXiv)
This turns SMML into “a spec + a benchmark seed,” which is what researchers can adopt.
Curated online resources that directly support SMML’s framing
- ControlNet paper and implementations: the reference mechanism for spatial conditioning. (arXiv)
- LayoutGPT: structured “style sheet” output for layout planning by LLMs. (arXiv)
- DiffSensei: explicitly bridges multimodal reasoning and diffusion for customized manga generation. (CVF Open Access)
- MangaDiffusion: layout-controllable manga generation and Manga109Story dataset framing. (arXiv)
- Grounding and box constraints: GLIGEN, BoxDiff, MultiDiffusion. (GitHub)
- Identity consistency: IP-Adapter, InstantID, StoryDiffusion. (GitHub)
- Manga analysis datasets for evaluation: Manga109Dialog and Manga109 segmentation annotations (CVPR 2025). (arXiv)
- Reading order and gutter effects: comics reading order and eye-tracking literature. (Frontiers)
- Comics markup precedent: CBML as proof that comics-as-XML is a stable concept. (dcl.luddy.indiana.edu)
- ComfyUI panel/layout ecosystem and API-style execution: comfyui_panels, CR Comic Panel Templates, ComfyUI workflow JSON submission patterns. (GitHub)
Summary bullets
- SMML is strongest as DSL + JSON-IR + compilation semantics, not “tags as prompts.” (arXiv)
- v0.1 should guarantee geometry, gutters, and reading order, and treat camera/effects as annotations until they compile to controls. (Frontiers)
- Your metrics plan is realistic because manga datasets exist for speaker-text links and segmentation, and reading research supports studying flow and gutter effects. (arXiv)
- A practical backend target exists today (ComfyUI workflows + panel layout nodes), and it matches your JSON-IR phase. (ComfyUI)