Proposal: Overcoming the "Brain-Hand Disconnect" in Manga Generation via Semantic Layout Tags (SMML)

“I am a manga enthusiast with no coding background, but through deep analysis of Yoshihiro Togashi’s (Hunter x Hunter) layouts, I found a fatal disconnect between AI’s ‘Brain’ (LLM) and ‘Hands’ (Diffusion). I propose ‘SMML’ — a way to control AI using HTML-like tags. What do you think, experts?”

[Context: Why do we need SMML? The limitation of “Prompting”]

The current paradigm of “Natural Language Prompting” is reaching its limit for complex storytelling. Here is why the “Prompt-only” approach fails for serious Manga/Visual creators:

The “Flat Composition” Problem: Current Diffusion models process prompts as a “bag of words.” They lack a Spatial Grammar. This leads to repetitive, grid-based layouts (the “9-square trap”) because the AI doesn’t understand the dramatic hierarchy of a page.

The Loss of “Ma” (Negative Space/Timing): In Manga, the space between panels (Gutters) and the reading order (RTL vs LTR) define time and pacing. Natural language cannot precisely command the “absence of information” or “directional flow.”

The Brain-Hand Disconnect: We have LLMs that can analyze Yoshihiro Togashi’s complex layouts (The Brain), and we have Diffusion models that can draw anything (The Hands). But we lack a Nervous System (The Middle Layer) to translate high-level structural intent into pixel-level constraints.

This is why I propose SMML. It is not just a “new way to prompt”; it is a Logical Bridge designed to enforce structural integrity and cultural context (like Japanese Manga flow) that natural language alone cannot sustain.

Draft Specification: Semantic Manga Markup Language (SMML) for Controllable Image Generation

1. Abstract

Current image generation models (Diffusion models) lack the ability to understand logical layouts and cultural reading orders (e.g., Right-to-Left in Japanese Manga). This draft proposes SMML, a markup-based orchestration layer that bridges Large Language Models (LLM) and Image Generation Models. By defining page structures through semantic tags similar to HTML/CSS, users can enforce dramatic pacing, focal points, and cultural context.

2. Core Concept: “Thinking in Layers & Logic”

Instead of “Black Box” prompting, SMML defines a page as a hierarchy of objects:

  • Logical Layer (The Brain): Defines “Why” and “Where” (e.g., <main>, <panel>, <reading-order>).

  • Execution Layer (The Hands): Renders pixels within the defined coordinates.

3. Proposed Tag Syntax (Examples)

  • <page direction="rtl" layout="dynamic">: Sets the global reading flow (Right-to-Left).

  • <panel id="P1" coord="0,0,200,340" focus="low">: A small introductory panel.

  • <panel id="P2" coord="220,0,750,700" role="main" depth="extreme">: The “Main” panel. Triggers higher sampling density and prioritizes lighting/composition here.

  • <character id="C1" ref="link_to_lora">: Ensures consistent character identity across panels.

  • <gutter style="narrow">: Controls the “Ma” (space/time) between actions.

4. Why this Solves the “Stiffness” of AI Manga

  • Spatial Intelligence: Prevents the AI from defaulting to equal-sized grids (The “9-Square Trap”).

  • Intentionality: Allows the LLM to act as a “Storyboard Artist” that gives coordinate-level instructions to the Generative AI.

  • Cultural Adaptability: Automatically flips the Z-pattern of eye movement based on the direction tag.

==================================================

“I realized that this ‘Brain-Hand Disconnect’ isn’t just a problem for Manga. It’s a fundamental challenge for Video (Temporal Logic) and Robotics (Physical Logic). Here is how we can expand the SMML concept into SVML and SRML to create a unified ‘Semantic Commander’ for AI.”

Extended Proposal: Expanding SMML to Video (SVML) and Robotics (SRML)

  1. Concept Expansion: From Static Panels to Dynamic Actions
    The core philosophy of Semantic Markup remains the same: Bridging the “Reasoning Brain” (LLM) and the “Executive Hands” (Diffusion/Actuator models) through structured intent.

  2. SVML (Semantic Video Markup Language)
    SVML moves beyond text-to-video by providing a Chronological Logic Layer. Current video AIs struggle with specific timing and complex camera choreography.

Example Syntax:

XML

<video_sequence fps=“24”>

</video_sequence>

  1. SRML (Semantic Robotic Markup Language)
    SRML acts as a Behavioral Orchestrator for Physical AI (Robotics). It translates high-level moral or strategic commands into physical constraints.

Example Syntax:

XML

<robot_action_group priority=“safety_first”>

<motor_control>
<arm_movement trajectory=“smooth” force_limit=“5.0N”>

</arm_movement>
</motor_control>
</robot_action_group>

  1. The Vision: One Unified Semantic Architecture
    By implementing these “Semantic Bridges,” we can solve the fundamental issue of “AI Hallucination in Motion.” - In Video: It prevents characters from morphing randomly by locking their identity via .

In Robotics: It prevents dangerous movements by wrapping raw motor data in a <safety_context> layer.

1 Like

Subject: Use Case Study - Applying SMML to a Dramatic Scene (Subordinate vs. Boss)

To demonstrate the power of SMML, I have designed a specific use case. This example shows how the markup translates a “psychological confrontation” into precise visual instructions that an AI can follow without hallucinating the layout.

[SMML Prototype Code: “The Snap of a Pen”]

XML

<page direction="rtl" style="psychological_thriller">
  <panel_group height="20%">
    <panel id="P1" camera="extreme_closeup" focus="eyes_reflecting_monitor">
      <content>Reflection of dull Excel numbers in the subordinate's glasses. Expressing emptiness.</content>
    </panel>
    <panel id="P2" camera="low_angle" focus="boss_mouth_smirking">
      <content>The distorted smirk of the boss. The word "DEADLINE" blurs in the background.</content>
    </panel>
  </panel_group>

  <panel id="P3" role="main" height="55%" padding="large">
    <composition type="vanishing_point_left">
      <subject id="subordinate" pos="bottom_right" pose="standing_back" intensity="90">
        Standing back with resolve. Carrying the weight of piles of documents like a tombstone.
      </subject>
      <subject id="boss" pos="top_left" pose="looming" opacity="0.8">
        The boss depicted as a giant shadow looming over the office. City lights outside the window.
      </subject>
    </composition>
    <visual_effect type="speed_lines_inner" target="subordinate" />
    <bubble type="monologue" pos="center">"...Is this the 'responsibility' you demanded?"</bubble>
  </panel>

  <panel id="P4" height="25%" background="white_out">
    <camera focus="hand_action" type="zoom_in">
      <action id="pen_snap">
        A trembling hand clicking a ballpoint pen. A ritual of defiance, replacing Netero's prayer.
      </action>
    </camera>
    <visual_effect type="high_contrast" />
  </panel>
</page>

Why this works:

  1. RTL Flow Control: Forces the eye movement from the hollow eyes (Right) to the boss’s smirk (Left).

  2. Spatial Hierarchy: By assigning height="55%" and role="main", the AI knows to allocate the highest detail and focal depth to the middle panel.

  3. White-out Effect: The background="white_out" tag in the final panel creates a sudden drop in visual information, forcing the reader to “hear” the small sound of the pen click in their mind—much like the activation of “Hyakushiki Kannon.”

This is the “Neural Bridge” we need. Instead of guessing the layout, the AI follows the SMML blueprint to achieve professional-grade manga pacing.

1 Like

I think it’s truly a developing field.


What experts will say about your SMML idea

Experts will mostly agree with your diagnosis, then immediately ask “what does it compile to.”

Your core point is correct:

  • LLMs can plan and explain layouts (story logic, pacing, composition intent).
  • Diffusion can render images (pixels).
  • The missing piece is a structured intermediate representation that turns “intent” into explicit constraints.

In 2023–2025 research and tooling, this “middle layer” is exactly where progress has happened: layout planning, spatial conditioning, grounded placement, panel/page composition, and identity consistency modules. Your SMML is a clean way to package that into a creator-facing interface. (arXiv)

The expert critique will be:

  • “Tags alone don’t change the model.”
  • “So SMML must be a DSL + compiler that outputs the right control signals and workflow steps.”

That critique is fair. It also points to the strongest version of SMML.


Background: why “prompt-only” struggles for manga layouts

1) Diffusion does not natively represent page grammar

A manga page is not “one image.” It is a structured document:

  • panels (units of time)
  • gutters (timing and silence)
  • reading order (culture-specific flow)
  • focal hierarchy (which panel carries the emotional peak)

Text prompts condition diffusion, but they do not directly define a page-level grammar or enforce a reading path. So the model drifts toward generic compositions unless you add structure.

2) “Ma” is easiest when it is deterministic, not generated

Your point about “absence of information” is important.

A gutter is literally “nothing,” but it has meaning. The best practical solution is: don’t ask diffusion to draw gutters. Reserve whitespace in a layout compositor. This is exactly what real comic layout tooling does, and modern ComfyUI workflows can do the same with panel-layout nodes. (GitHub)

3) “Brain–Hand disconnect” is a known pattern in the literature

Manga-specific systems like DiffSensei and MangaDiffusion already formalize the gap you’re describing: story → structured control/layout → images, with special attention to multi-panel coherence and layout reasonableness. (arXiv)

So your proposal matches a real research direction. Your unique value is “make this controllable by creators through a stable language.”


What exists today that is “SMML in pieces”

Think of SMML as a “wrapper” that unifies these already-proven components.

A) Layout planning as structured text (LLM-friendly)

  • LayoutGPT: uses a style-sheet-like language so LLMs can output plausible layouts and spatial relations more reliably than free-form prompting. (LayoutGPT)
  • PlanGen: explicitly “pre-plans” layout conditions before image generation. (360cvgroup.github.io)

This supports your thesis: structured layout representations help.

B) Spatial conditioning (how you force composition)

  • ControlNet: adds spatial conditioning (pose, edges, depth, segmentation) to guide diffusion. (arXiv)

This is the “nervous system” mechanism that turns structure into pixels.

C) Grounded placement (boxes, regions, constraints)

  • GLIGEN: grounded generation using bounding boxes as condition inputs. (Gligen)
  • BoxDiff: training-free box-constrained diffusion (compile tags to boxes without needing extra training). (arXiv)
  • MultiDiffusion: binds multiple diffusion processes under shared constraints, useful for region-based control and multi-area generation. (GitHub)

These are strong “compiler targets” for <subject pos="...">.

D) Page/panel composition (your gutters and reading order)

  • ComfyUI panels tooling exists for comics/manga-like panel layouts. (GitHub)
  • Template-based panel layout nodes exist (example: “CR Comic Panel Templates”). (RunComfy)
  • Other panel/page layout plugins exist (example: panelforge). (GitHub)

This is how you make “Ma” real: deterministic layout first, generation second.

E) Character identity consistency (the “same character across panels” problem)

  • IP-Adapter: lightweight image-prompt adapter to keep identity/style anchored. (GitHub)
  • InstantID: tuning-free identity preserving from a single reference image. (GitHub)
  • StoryDiffusion: consistency mechanism (consistent self-attention) for long sequences and comics-like outputs. (arXiv)

This is what your <character ref="..."> must compile into, otherwise it becomes “just a name token.”

F) Comics markup precedent (proof that comics want a schema)

  • CBML is an XML vocabulary for encoding comics structure (panels, balloons, captions). It is not for generation, but it proves the “markup for comics” concept is stable and useful. (dhq-static.digitalhumanities.org)

This helps you frame SMML as “CBML + execution semantics.”


The strongest version of SMML

SMML is not a prompt format

SMML is best framed as:

A Domain Specific Language (DSL) for manga pages that compiles into:

  1. deterministic layout artifacts (panels/gutters/order) and
  2. diffusion control artifacts (boxes/masks/control maps/identity adapters),
    executed as a reproducible workflow.

That is exactly how “real” pipelines are designed: planning artifacts are separate from rendering.

Two-layer design (recommended)

To avoid becoming “a pile of tool knobs,” split SMML into:

  1. Semantic layer (creator intent, stable)
  • pacing, beats, roles, flow, “Ma,” reading direction
  • camera intent (close-up, low-angle) as semantics
  1. Execution layer (backend-specific, pluggable)
  • which engine, which model, steps, ControlNet strengths
  • identity method (IP-Adapter vs InstantID)
  • how many refinement passes

This is how you keep SMML future-proof when tools change.


A clear SMML mental model (easy to explain)

Step 1: SMML describes a page as a document

Like HTML describes a webpage’s structure.

  • Panels are “containers”
  • Gutters are “spacing rules”
  • Reading order is global metadata

Step 2: The compiler turns tags into artifacts

  • A page layout (SVG/JSON)
  • Per-panel prompts and budgets
  • Per-panel control inputs (pose/edge/depth maps, boxes)
  • Identity bindings shared across panels

Step 3: The renderer executes the plan

  • Render each panel
  • Fix mistakes by inpainting/refinement
  • Compose panels with gutters deterministically
  • Typeset bubbles as a separate layer

This is how you get professional pacing without fighting the model.


Your “Snap of a Pen” example, reframed as an executable plan

Your SMML example is good because it includes:

  • hierarchy (panel group + main panel)
  • role (main)
  • camera intent
  • white-out / silence beat

Here is how an expert would expect it to compile:

A) Page compiler output (deterministic)

  • Page direction RTL
  • Panel rectangles with a top strip of 2 small panels, a dominant middle, a bottom strip
  • Gutters: narrow or wide rules
  • Reading order list: P1 → P2 → P3 → P4 (or the RTL variant you define)

This part should not depend on diffusion at all.

B) Panel render plan (probabilistic but constrained)

For each panel:

P1: “eyes reflecting monitor”

  • prompt: close-up glasses reflection, sterile office mood
  • Control: lineart/edge or pose map if needed
  • budget: low-to-medium

P2: “boss smirk low angle”

  • prompt: low-angle mouth smirk, “deadline” blurred background
  • Control: face/pose guidance if available
  • budget: low-to-medium

P3: main confrontation

  • prompt: two subjects, subordinate bottom-right, boss top-left shadow

  • Compile <composition type="vanishing_point_left"> into either:

    • a depth/line guide ControlNet input, or
    • box constraints + negative regions
  • budget: high, plus optional refinement/inpaint pass for hands and faces

P4: white-out pen click

  • Layout gives the white background deterministically
  • Panel render focuses only on the hand and pen
  • High-contrast effect is a post-process (or style prompt), not a vague instruction

This “compiles cleanly” because it maps to known controllability methods: ControlNet for structure, boxes for placement, identity adapters for consistency. (arXiv)


What you should adjust in the written proposal (so experts take it seriously)

1) Replace “higher sampling density in a region” with a realistic mechanism

Diffusion doesn’t natively allocate “sampling density” inside one image region.

Write instead:

  • “Main panels receive higher resolution, more steps, and optional refinement passes.”
    That is how real workflows do it.

2) Make gutters and text bubbles explicitly “non-diffusion”

If you want “Ma” and readability:

  • Gutters are composited whitespace.
  • Speech bubbles and typography are vector layers (or editor overlays).
    This is both more controllable and more editable.

3) Be precise about what “direction=rtl” means

It should mean:

  • a reading-order solver (list of panels in order)
  • optional flow visualization (arrows)
    It should not mean “the model reads right-to-left.”

4) Name the compiler targets up front

Experts relax when they see you know the targets:

  • ControlNet-style conditioning for structure (arXiv)
  • GLIGEN/BoxDiff-style boxes for placement (Gligen)
  • IP-Adapter/InstantID/StoryDiffusion for identity consistency (GitHub)
  • Deterministic panel layout via ComfyUI panel tools (GitHub)

A practical “no-coding” path to demonstrate SMML today

If you want credibility quickly, do a demo in a toolchain that already supports the building blocks:

  1. Panel layout: use ComfyUI panels or template nodes to create the page structure first. (GitHub)
  2. Panel-by-panel generation: render each panel as a separate image with its own prompt and controls.
  3. Structure control: use ControlNet conditioning for poses/edges/depth. (arXiv)
  4. Identity lock: use IP-Adapter or InstantID for consistent characters. (GitHub)
  5. Sequence consistency (optional but strong): test StoryDiffusion for multi-panel continuity. (arXiv)
  6. Automation-ready (if you later add a compiler): ComfyUI server accepts workflows posted to /prompt, and you can export API workflow JSON from some hosted UIs. (ComfyUI)

This produces the key artifact experts want: “here is the plan, here is the deterministic layout, here is the controlled render.”


The “expert-grade” one-paragraph summary of SMML

SMML is a manga-specific DSL that represents pages as structured documents (panels, gutters, flow, hierarchy) and compiles into reproducible generation workflows: deterministic layout composition plus diffusion control artifacts (ControlNet conditions, box/mask grounding, identity adapters) executed panel-by-panel for pacing and consistency. Existing work in layout planning (LayoutGPT, PlanGen), manga generation pipelines (DiffSensei, MangaDiffusion), and controllable diffusion (ControlNet, GLIGEN, BoxDiff, MultiDiffusion) shows that the missing “middle layer” is real; SMML proposes a creator-friendly standard for it. (LayoutGPT)


Summary bullets

  • Your diagnosis matches current research: manga generation needs a middle layer between LLM planning and diffusion rendering. (arXiv)
  • SMML is strongest as a DSL + compiler, not “XML prompts.” The compiler targets are ControlNet, boxes/masks (GLIGEN/BoxDiff), and identity adapters. (arXiv)
  • Make “Ma” real by keeping gutters deterministic and bubbles editable, not generated. Use panel layout tools to enforce this. (GitHub)
  • A no-coding proof is feasible today using ComfyUI layout + per-panel generation + identity locking, then later a compiler can automate it through /prompt. (ComfyUI)
1 Like

Thank you for your insightful feedback and for sharing those relevant papers.The “Intermediate Representation” you mentioned is exactly the kind of “bridge” I was hoping for. It’s exciting to see that my intuition as a manga fan aligns with current technical research.I’m strictly an “idea person” and a manga enthusiast, not a developer or researcher. I don’t have the technical background to run experiments or write code myself.So, I’m putting this concept out there for the community. Please feel free to take this SMML idea and use it in your professional research however you like. I’d be happy if my “user-side perspective” helps experts like you build the next generation of creative AI.I’ll be cheering you on from the sidelines!

1 Like

[Update: Strategic Roadmap for SMML Implementation]

To help researchers and developers visualize how SMML can be integrated into the current AI ecosystem, I’ve collaborated with AI assistants to draft a realistic Implementation Roadmap. This breaks down the concept from “Vision” to “Deployment.”

Phase 1: Minimal Specification (SMML v0.1)
Goal: Define the “Minimum Viable Language” for layout testing.

Core Tags: <page direction="rtl">, <panel role="main">, <character id="">.

Focus: Establishing structural integrity rather than artistic quality.

Phase 2: JSON-IR (Intermediate Representation)
Goal: Convert SMML into a machine-friendly format.

Mechanism: LLM acts as the Architect (outputting JSON), while the Diffusion model acts as the Executive (rendering the JSON).

Benefit: Standardizing the bridge between Reasoning and Execution.

Phase 3: Layered Generation & ControlNet Integration
Goal: Separate concerns to prevent visual hallucinations.

Execution: Generate Backgrounds, Characters, and Dialogue on separate layers/masks.

Advantage: Allows professional-grade editing and consistent character identity across panels.

Phase 4: Automated Evaluation Metrics
Goal: Move from subjective “coolness” to objective “correctness.”

Metrics: OCR success rates for dialogue, character ID consistency across 20+ pages, and “eye-tracking” flow analysis.

Phase 5: Domain Expansion (SVML & SRML)
Goal: Scaling the “Semantic Bridge” to other industries.

SVML (Video): Controlling temporal pacing and camera choreography.

SRML (Robotics): Mapping high-level strategic intent to physical motor constraints.

Final thoughts: I’m just an idea person who loves manga, so I’ll leave the “heavy lifting” to you experts! Even if the name SMML disappears, I would be happy if this “Layered Semantic Approach” helps the community in some small way.

I’m very curious to hear what professional researchers think about this draft!

1 Like

[Update: Detailed Draft Specification for SMML v0.1]

With the help of AI assistants (ChatGPT/Gemini), I have organized the SMML concept into a more specific, minimal draft.

I’m sharing this v0.1 Draft as a humble starting point. It focuses purely on “layout and structural integrity” rather than artistic style or storytelling, to make it easier for developers to test.


SMML v0.1: Core Specification (Draft)

Concept: Separation of “Design” and “Execution.”

  • LLM (Brain): Generates the SMML blueprint.

  • Renderer (Hands): Interprets the tags to maintain spatial and character consistency.

1. Minimal Tag Set (Proposed)

  1. <page>: Defines reading flow.

    • Attributes: direction (rtl/ltr), layout_mode (grid/free)
  2. <panel>: Defines a single frame.

    • Attributes: id, order, role (main/sub/focal)
  3. <frame>: Essential for layout.

    • Attributes: x, y, w, h (Relative coordinates 0.0–1.0)
  4. <character>: For identity consistency.

    • Attributes: id, focus_point (face/hand/back)
  5. <dialogue>: Text data (should be generated on a separate layer).

    • Attributes: id, speaker, type (shout/whisper/thought)
  6. <camera>: Spatial perspective.

    • Attributes: type (close/mid/wide), angle (low/eye/high)
  7. <effect>: Visual cues.

    • Attributes: type (speed/impact/inner_emotion)
  8. <meta>: For experimental LoRA or specific prompt hints.

2. Sample Code: “The Awakening”

XML

<page direction="rtl">
  <panel id="p1" order="1" role="focal">
    <frame x="0.0" y="0.0" w="1.0" h="0.6" />
    <camera type="mid" angle="low" />
    <character id="C1" focus_point="face" />
    <effect type="impact" />
    <dialogue id="d1" speaker="C1" type="shout">
      "Now, my time begins."
    </dialogue>
  </panel>

  <panel id="p2" order="2" role="sub">
    <frame x="0.0" y="0.65" w="0.45" h="0.3" />
    <camera type="close" angle="eye" />
    <character id="C2" />
  </panel>
</page>

3. Why this approach?

Current models struggle with “hallucinating” layouts. By using relative coordinates (<frame>) and identity IDs (<character>), we can move away from “prompt gambling” and toward “structural control.”

Also, these tags (like <camera angle>) are designed with future scalability in mind, potentially bridging the gap between Manga, Video (SVML), and Robotics (SRML).

Closing Thoughts: I am not an expert in coding, just a manga fan dreaming of a better creative tool. Please feel free to take this draft, tear it apart, or improve it in any way you see fit. I would be honored if this “user-side dream” helps the professional community in some small way.

1 Like

Updated.


Your updated SMML draft is now “research-shaped.” It has (1) a minimal language, (2) an IR step, (3) a layered execution plan, and (4) evaluation metrics. That matches how controllable generation work is actually built in 2023–2025 research and tooling: plan first, render second, measure third. (arXiv)

What matters most to professionals is not whether the tags look like HTML. What matters is whether the tags have clear semantics and compile into concrete controls (boxes, masks, pose maps, panel compositing rules, identity adapters) that today’s diffusion pipelines can obey. (arXiv)

Below is a clear, beginner-friendly way to understand what you have, what researchers will praise, what they will critique, and how to present SMML v0.1 so it is maximally adoptable.


1) Background: what you are really proposing (in plain terms)

The “middle layer” has a standard name in engineering: IR + compiler

In many fields, humans write something readable, then a compiler turns it into something machines can execute.

  • HTML describes structure. The browser turns it into pixels.

  • SMML describes manga page intent. A “compiler” would turn it into:

    • a page layout (panels and gutters),
    • per-panel generation jobs,
    • per-panel control signals (pose/edges/masks/boxes),
    • identity bindings (character consistency),
    • and then a compositor assembles the page.

This is not a new idea in principle. It is exactly what “layout planning with LLMs” papers have been pushing: structured representations make layout intent more reliable than raw natural language. LayoutGPT is a clear example: it uses a style-sheet-like language to help LLMs output plausible layouts. (arXiv)

Why manga is a perfect case for IR

Manga pages are strongly “document-like”:

  • panels are discrete units of time,
  • gutters (negative space) carry pacing (“Ma”),
  • reading order can be culturally specific (RTL vs LTR),
  • page hierarchy matters (main panel vs minor beats).

A single diffusion pass is not naturally a document renderer. So a document-like IR makes sense.


2) What existing research says about your “Brain–Hand disconnect”

You are not alone. The field already treats “multi-panel manga” as a special problem.

DiffSensei: explicit “bridge” between an MLLM and diffusion for manga

DiffSensei frames manga generation as needing:

  • multi-character control,
  • layout control,
  • sequential coherence across panels,
    and it explicitly links a multimodal model with diffusion in a coordinated system. (arXiv)

MangaDiffusion: layout-controllable manga pages from plain text

MangaDiffusion emphasizes:

  • generating multi-panel pages,
  • reasonable and diverse page layouts,
  • intra-panel and inter-panel interaction,
    and it introduces Manga109Story as a dataset for “plain text story → manga page.” (arXiv)

Why this matters for SMML: these papers prove professionals agree that “prompt-only” is insufficient for full manga pages. Your contribution is not “a new diffusion model.” Your contribution is a user-facing “bridge format” that can ride on top of these ideas.


3) Background: what diffusion can and cannot reliably obey

To keep SMML credible, it helps to separate:

Deterministic things (machines can guarantee)

These should be done outside diffusion whenever possible:

  • panel rectangles and gutters,
  • reading order metadata,
  • page composition and whitespace,
  • placing dialogue text as a typeset layer.

This aligns with your “Ma” argument: the cleanest “white-out” is literal whiteness in the compositor, not a probabilistic generation request.

Probabilistic things (diffusion can obey, but only with the right controls)

Diffusion is much more reliable when you give it structure beyond text:

  • ControlNet adds spatial conditioning inputs like edges, depth, segmentation, human pose. This is the canonical mechanism for “spatial grammar” in modern diffusion pipelines. (arXiv)

  • Grounded placement uses boxes or masks so the model knows “what goes where.”

    • GLIGEN is a known grounding approach using boxes as conditions (note: the Diffusers GLIGEN pipeline is marked deprecated, which is a practical maintenance warning). (GitHub)
    • BoxDiff is a training-free way to enforce box constraints during diffusion steps. (GitHub)
    • MultiDiffusion is a broader framework that can bind multiple diffusion processes with shared constraints and can use masks/boxes as guiding signals. (arXiv)
  • Identity consistency across panels is its own hard problem.

    • IP-Adapter is a lightweight adapter for image-prompting and can help anchor identity/style. (GitHub)
    • InstantID targets identity-preserving generation from a single reference face image. (GitHub)
    • StoryDiffusion focuses on long-range character consistency via “consistent self-attention,” and explicitly mentions comics creation and consistent attire. (GitHub)

Implication for SMML: <character id="C1"> cannot just be a string. It must compile to one of these identity mechanisms, or it stays “wishful metadata.”


4) What professionals will think about your Phase roadmap

Your phases are reasonable. Researchers will mostly agree with the sequence “spec → IR → pipeline → metrics → expansion.” They will also want sharper boundaries between phases.

Phase 1: Minimal spec (SMML v0.1)

Good. Researchers like minimality because it becomes testable.

The most important improvement: define what is binding versus non-binding.

  • Binding in v0.1: panels, frames, gutters, reading order.
  • Non-binding in v0.1 (annotations): camera, effects, dialogue style.

This avoids overpromising before controls exist.

Phase 2: JSON-IR

Very good. Professionals prefer JSON-IR because it can be consumed by tooling.

This also aligns with LayoutGPT-style thinking: LLM outputs structured layout representations more reliably than prose. (arXiv)

Phase 3: Layered generation

Good. Layer separation matches the reality that:

  • backgrounds, characters, and text have different constraints,
  • corrections are commonly done by inpainting and compositing.

Even if the “perfect layer decomposition” is hard, the direction is correct.

Phase 4: Automated metrics

Excellent. This is where “cool demo” becomes “research project.”

Phase 5: SVML and SRML expansion

Fine as a vision, but professionals will ask you to keep scope tight until SMML is proven.

A practical note: timeline/sequence control already has interchange formats in video editing ecosystems, which supports your broader “semantic bridge” thesis (structure separate from render). If you ever write SVML seriously, aligning with existing editorial timeline concepts is the credibility move.


5) Your v0.1 tag set: what works, what needs tightening

Your minimal tags are sensible. The main missing piece is a stricter definition of geometry and order.

A) <frame x y w h> is the right core primitive

Relative coordinates are a strong choice because:

  • they are model-agnostic,
  • they are easy to normalize across page sizes,
  • they map naturally to cropping, masks, and compositing.

Researchers will still require you to define:

  • coordinate origin (typically top-left),
  • whether x,y are panel top-left,
  • whether panels can overlap,
  • whether gutters are explicit rectangles or implied by separation.

B) Reading order must be explicit and authoritative

You already have order="1", which is good.

This matters because reading order is not always a simple “Z-path,” especially when layouts get dynamic. Reading research shows layout affects viewing patterns and that deviations from regular grids can change scan paths. (Frontiers)

So professionals will like:

  • order fields as the source of truth,
  • or an explicit <reading_order> list for complex pages.

C) Dialogue as data, not pixels

Separating <dialogue> from final rendering is correct.

There is an entire subfield around speech balloon detection, OCR, and speaker association in comics because it is structurally difficult. Manga109Dialog exists specifically to link speakers and texts, and it even notes reading order features are relevant for performance. (arXiv)

Keeping dialogue typeset as a layer makes SMML outputs editable and robust.

D) Camera and effect tags: keep them as annotations until you define compilation

Your <camera> and <effect> tags are fine as a “future-proof vocabulary.” The only risk is implying they enforce outcomes today.

Professionals will accept them if you state:

  • v0.1 camera/effect tags are non-binding hints,
  • later versions compile them into control signals (pose maps, depth maps, edge guidance, etc.). (arXiv)

6) The key missing piece: compilation semantics (tag → control)

Researchers will ask: “What does this tag become at runtime?”

A simple, easy-to-read mapping (conceptual, not code):

Page-level compilation

  • <page direction="rtl">

    • becomes: a reading order convention default + an ordering validator.
    • does not mean: diffusion understands RTL text.
  • <panel role="main">

    • becomes: higher compute budget (resolution, steps, refinement passes) and stricter control strength.

Panel geometry compilation

  • <frame x y w h>

    • becomes: crop region and mask for compositing, and possibly region constraints for generation.

Subject placement compilation

  • <character id="C1"> + optional position attributes (if you add them later)

    • becomes: a spatial region reserved for that character (box or mask).
    • can be enforced using grounding methods like GLIGEN or BoxDiff-style constraints. (GitHub)

Identity compilation

  • <character id="C1" ref="...">

    • becomes: an identity-conditioning mechanism:

      • IP-Adapter reference embedding, or
      • InstantID face identity module, or
      • StoryDiffusion sequence consistency module. ([GitHub][8])

This is the “nervous system” in technical form.


7) Why “Ma” and gutters belong in layout, not diffusion (with research context)

Your “gutter as time” intuition has empirical support.

  • Research on comic page layout reading order discusses how factors like separation and proximity influence navigation, and that larger gutters can push readers to navigate differently. (Visual Language Lab)
  • Eye-tracking studies also emphasize that readers use expected order but can deviate depending on page design. (Wiley Online Library)

So a strong professional framing is:

  • “SMML treats gutters as first-class layout constraints.”
  • “Diffusion renders panel contents, not page whitespace.”

That is exactly how you preserve “silence beats” and “white-out” effects reliably.


8) Your evaluation metrics: how to make them concrete and publishable

Your metrics are good. Making them concrete means tying them to existing datasets and evaluation tasks.

A) Layout correctness

If you generate layout from SMML deterministically, you can evaluate:

  • exact match of panel rectangles,
  • overlap violations,
  • gutter width constraints,
  • reading order consistency (order list vs geometry).

If you allow LLMs to propose layouts, then compare predicted frames against targets.

B) Dialogue and OCR metrics

Separate the pipeline:

  1. detect balloons/text regions,
  2. OCR,
  3. speaker association.

Manga109Dialog exists for speaker-to-text pairs and gives you a standard evaluation target. (arXiv)

C) Character consistency metrics

You can measure:

  • face embedding similarity across panels (when faces visible),
  • body/clothing similarity (shape or segmentation overlap),
  • identity classifier consistency.

New segmentation resources for Manga109 are directly relevant because they include categories like frames, text/dialog, onomatopoeia, faces, bodies, balloons. (CVF Open Access)

D) Flow metrics (proxy for eye tracking)

If you cannot run eye tracking:

  • measure whether the intended reading order is consistent with a rule-based traversal,
  • measure “jump penalties” when gutters are large (a proxy for “don’t jump across silence”).

This is grounded in reading order research and gestalt grouping discussions in comics. (Frontiers)


9) Practical ecosystem integration: why ComfyUI is a natural “execution backend”

You do not need SMML to be married to ComfyUI, but ComfyUI is a realistic first target because it already speaks “graph + compositing + masks + control nodes.”

A) Panel layout nodes exist

  • comfyui_panels is explicitly for comics/manga-like panel generation and organization. (GitHub)
  • CR Comic Panel Templates generates structured layouts with rows/columns and auto sizing. (RunComfy)
  • comfyui-panelforge is an extension aimed at comic panel creation, including layout and speech bubbles. (GitHub)

This supports your core thesis: page grammar can be deterministic.

B) Workflow-as-JSON matches your JSON-IR phase

ComfyUI workflows are JSON graphs. That matches the idea of compiling SMML into an executable IR.

  • ComfyUI documentation explains workflows as JSON graphs and emphasizes controllable generation through node composition. (ComfyUI)
  • ComfyUI Cloud docs describe submitting a workflow via POST /api/prompt using “API format” workflows. (ComfyUI)
  • Other integration docs describe exporting “API JSON” for POST requests to a /prompt-style endpoint. (docs.nebius.com)

So your Phase 2 (JSON-IR) can map naturally to “workflow JSON.”


10) Context from older “markup for comics” work: why SMML feels natural

SMML is not the first time comics were represented as markup.

  • CBML (Comic Book Markup Language) is an XML vocabulary for encoding comics structure (panels, balloons, captions). It exists for analysis and archiving, not generation, but it proves “comics as markup” is legitimate and useful. (dcl.luddy.indiana.edu)
  • There are also hobbyist “comic markup → SVG rendering” projects, showing that markup-driven panel composition is intuitive even outside ML. (GitHub)

This gives you a strong way to position SMML:

  • CBML: markup for describing comics.
  • SMML: markup for describing and generating comics, via compilation to control signals.

11) The main pitfalls professionals will warn about

These are not arguments against SMML. They are design constraints SMML should acknowledge.

Pitfall 1: Ambiguous reading order in complex layouts

Solution: make order explicit or provide a <reading_order> list. Do not rely purely on geometry. Reading order research shows layout can violate simple paths. (Frontiers)

Pitfall 2: “Camera tags” do nothing unless you bind them to controls

Solution: treat camera tags as annotations until you have compilation targets (pose/depth/edges). ControlNet is the standard mechanism for this. (arXiv)

Pitfall 3: Character IDs are meaningless without identity conditioning

Solution: define an “identity contract” that compiles to IP-Adapter or InstantID or a sequence method like StoryDiffusion. (GitHub)

Pitfall 4: Tool churn and deprecations

Example: the Diffusers GLIGEN pipeline is explicitly labeled deprecated in the docs, which matters if you target it as a backend. (Hugging Face)
Solution: SMML should target capabilities (boxes, masks, pose control), not one fragile implementation.

Pitfall 5: Text-in-image quality and OCR are structurally hard

Solution: keep dialogue as typeset overlay. Use datasets like Manga109Dialog and segmentation annotations as evaluation targets, not as something diffusion must “just handle.” (arXiv)


12) A clean way to present SMML v0.1 to researchers (ready-to-post structure)

If you want maximum constructive feedback, a professional-friendly post usually includes:

  1. One-sentence claim
    “SMML is a manga page DSL that compiles to layout + control artifacts for diffusion workflows.”

  2. Non-goals for v0.1
    No claims about perfect artistry, perfect identity, or perfect text rendering.

  3. Core v0.1 guarantees
    Panels, frames, gutters, reading order.

  4. Execution compatibility targets
    ControlNet-style structure control, box/mask grounding, identity adapters. (arXiv)

  5. A small test pack
    10 SMML pages + expected JSON-IR + expected panel geometry.

  6. A metrics plan
    Layout correctness, speaker-text association metrics (Manga109Dialog), segmentation-based character measurements (Manga109 segmentation annotations). (arXiv)

This turns SMML into “a spec + a benchmark seed,” which is what researchers can adopt.


Curated online resources that directly support SMML’s framing

  • ControlNet paper and implementations: the reference mechanism for spatial conditioning. (arXiv)
  • LayoutGPT: structured “style sheet” output for layout planning by LLMs. (arXiv)
  • DiffSensei: explicitly bridges multimodal reasoning and diffusion for customized manga generation. (CVF Open Access)
  • MangaDiffusion: layout-controllable manga generation and Manga109Story dataset framing. (arXiv)
  • Grounding and box constraints: GLIGEN, BoxDiff, MultiDiffusion. (GitHub)
  • Identity consistency: IP-Adapter, InstantID, StoryDiffusion. (GitHub)
  • Manga analysis datasets for evaluation: Manga109Dialog and Manga109 segmentation annotations (CVPR 2025). (arXiv)
  • Reading order and gutter effects: comics reading order and eye-tracking literature. (Frontiers)
  • Comics markup precedent: CBML as proof that comics-as-XML is a stable concept. (dcl.luddy.indiana.edu)
  • ComfyUI panel/layout ecosystem and API-style execution: comfyui_panels, CR Comic Panel Templates, ComfyUI workflow JSON submission patterns. (GitHub)

Summary bullets

  • SMML is strongest as DSL + JSON-IR + compilation semantics, not “tags as prompts.” (arXiv)
  • v0.1 should guarantee geometry, gutters, and reading order, and treat camera/effects as annotations until they compile to controls. (Frontiers)
  • Your metrics plan is realistic because manga datasets exist for speaker-text links and segmentation, and reading research supports studying flow and gutter effects. (arXiv)
  • A practical backend target exists today (ComfyUI workflows + panel layout nodes), and it matches your JSON-IR phase. (ComfyUI)
1 Like

Thank you for your incredible feedback, John6666.

To be honest, the technical depth of your suggestions—such as SVG integration and attention refinement—goes far beyond my own experience as a simple manga enthusiast. That said, it is truly rewarding to know that my conceptual idea resonated with someone who clearly has deep expertise in this field.

If you ever consider writing a script or a prototype based on this idea, I would be absolutely honored. While I cannot contribute on the coding side, I would be more than happy to offer user-side perspectives or creative brainstorming whenever that could be helpful.

Thank you again for taking SMML seriously and for helping push this vision forward. I’m very much looking forward to seeing what you (or others) might build from here.

1 Like

Subject: SMML: user-side vision and illustrative IR draft

I continued to think more deeply about SMML with my AI assistant.

I took some time to further refine the user-side vision of SMML — mainly from a manga-creation and semantic-structure perspective. I’ve summarized these thoughts as a draft conceptual direction (tentatively calling it “SMML v0.2 – vision draft”), together with an illustrative JSON example (SMML-IR v0.1) to show what an internal data structure could look like in the future.

My intention here is not to propose a fixed specification, but rather to offer a kind of “dictionary of manga-specific needs” that you might find useful when thinking about future scalability or edge cases.


SMML – Draft Concepts for Future Scalability (Non-binding)

Standardized Coordinate System

Relative coordinates from top-left (0.0, 0.0) to bottom-right (1.0, 1.0), aligned with common digital canvas conventions.

From “Camera” to “Focus Level”

Instead of cinematic terms like wide/close, a continuous semantic value:

focus: 0.0 – 1.0

Lower values emphasize environment/context, higher values prioritize character detail.

Balloon Hierarchy

To support multiple speech bubbles per panel, a nested balloon structure:

  • ellipse (standard speech)

  • burst (shouting/emphasis)

  • thought (inner voice)

  • none (narration / monologue)


Illustrative SMML-IR v0.1 (JSON – example only)

{

  "page": {

    "direction": "rtl",

    "panels": [

      {

        "id": "p3",

        "order": 3,

        "role": "main",

        "layout": { "x": 0.0, "y": 0.28, "w": 1.0, "h": 0.45 },

        "focus": 0.3,

        "characters": [

          { "id": "N1", "label": "Netero" },

          { "id": "M1", "label": "Meruem" }

        ],

        "dialogue": [

          {

            "speaker": "N1",

            "balloons": [

              { "type": "speech", "content": "ここは" },

              { "type": "speech", "content": "墓場" }

            ]

          }

        ]

      },

      {

        "id": "p4",

        "order": 4,

        "role": "sub",

        "layout": { "x": 0.0, "y": 0.75, "w": 1.0, "h": 0.25 },

        "focus": 0.8,

        "characters": [{ "id": "N1", "label": "Netero" }],

        "dialogue": [

          { "speaker": "N1", "balloons": [{ "type": "none", "content": "貴様のな" }] }

        ]

      }

    ]

  }

}

Again, there is absolutely no expectation to implement any of this in your current prototype. I’m sharing it purely as an ideal destination and a semantic reference.

1 Like