Abnormal generation after multi GPU

I implemented the code strictly following the official implementation , but the generated responses are still very strange.:pleading_face:

import math
import torch
from transformers import AutoTokenizer, AutoModel

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
        'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "models/InternVL2_5-8B"
device_map = split_model('InternVL2_5-8B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values



# set the max number of tiles in `max_num`
pixel_values = load_image('coco2014/val2014/COCO_val2014_000000000139.jpg', max_num=1).to(torch.bfloat16).to(model.vision_model.device)
generation_config = dict(max_new_tokens=128, do_sample=False)

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

``
1 Like

If relatively easy case:


Background: what you’re doing vs other “multi-GPU” modes

Your code uses model sharding via a device_map (weights split across multiple GPUs inside one Python process). That is different from distributed inference (many processes, usually one per GPU, splitting prompts/batches). Accelerate documents these as different approaches: device_map / big-model inference vs split prompts across processes. (Hugging Face)

This distinction matters because a common failure mode is mixing “sharded model” with “multi-process launch” incorrectly.


What “gibberish generation” usually indicates

Output like repetitive, low-information characters (your screenshot) typically comes from numerical corruption during forward/generation (wrong device transfers, broken inter-GPU communication, NaNs/Infs, or a buggy kernel path), not from decoding parameters.

There are many public reports of “single GPU OK, multi-GPU gibberish” with device_map sharding. (GitHub)


The most likely causes in your exact setup (InternVL2.5 + manual device_map)

1) Launching with torchrun / multi-process while also sharding with device_map

In a well-known Hugging Face thread showing the same symptom, the model is sharded with device_map="auto" and launched with torchrun; a Hugging Face maintainer states this cannot be run with torchrun in that configuration. (Hugging Face Forums)

Fix

  • Run single-process:

    CUDA_VISIBLE_DEVICES=0,1 python your_script.py
    
  • If using accelerate, ensure it’s one process:

    accelerate launch --num_processes 1 your_script.py
    

If you want multi-process throughput, do not shard with device_map; instead replicate the model per process.


2) Transformers version mismatch (InternVL2.5 explicitly requires a minimum)

InternVL2.5’s model card explicitly says: “Please use transformers>=4.37.2 to ensure the model works normally.” (Hugging Face)

Fix

pip install -U "transformers>=4.37.2" accelerate

3) Inter-GPU transport problems (PCIe ACS / P2P / NCCL)

This is the most common root cause when:

  • single GPU is fine,
  • multi-GPU “runs” but produces nonsense.

In the same Hugging Face “gibberish” thread, the original poster later reports the issue was NCCL, fixed by deactivating ACS because it interfered with GPU communication. (Hugging Face Forums)

NVIDIA’s NCCL docs include explicit instructions for disabling ACS (via setpci) when it breaks P2P/GPU Direct behavior. (NVIDIA Docs)
NVIDIA also notes that P2P not being functional is usually tied to ACS being enabled (and gives BIOS/kernel mitigation suggestions). (GitHub)

Fast diagnostic
Run once with P2P disabled:

NCCL_P2P_DISABLE=1 python your_script.py
  • If the output becomes normal → you’ve almost certainly hit a P2P/ACS/IOMMU/topology issue.
  • Next step is to follow your platform’s recommended way to disable ACS/IOMMU (often BIOS + kernel params) or use NCCL’s documented ACS procedure. (NVIDIA Docs)

4) FlashAttention path issues (use_flash_attn=True)

InternVL examples enable use_flash_attn=True, but if your FlashAttention build / CUDA / driver stack is off, it can lead to numerical instability that looks like garbage output.

Fix / isolation test
Load with FlashAttention off:

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,      # test
    trust_remote_code=True,
    device_map=device_map
).eval()

If this fixes it, keep FlashAttention off until you align flash-attn + CUDA + driver versions.


5) Your split_model() can create invalid layer keys (important to harden)

InternVL’s published split_model() (the one you copied) does not stop when layer_cnt == num_layers; with enough GPUs it can assign non-existent layers (e.g., layers.32, layers.33 for a 32-layer model). The official snippet shows the same loop structure. (Hugging Face)

Depending on library versions, that can be harmless or can cause subtle dispatch issues.

Fix: make the mapping bounded

def split_model_safe(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {'InternVL2_5-8B': 32}[model_name]

    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    plan = [num_layers_per_gpu] * world_size
    plan[0] = math.ceil(plan[0] * 0.5)

    layer_cnt = 0
    for i, n in enumerate(plan):
        for _ in range(n):
            if layer_cnt >= num_layers:
                break
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
        if layer_cnt >= num_layers:
            break

    # Keep entry/exit + vision on GPU0 (matches InternVL rationale)
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map['language_model.output'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
    return device_map

Concrete debugging checklist (quick → deep)

A) Confirm you’re in the “supported” config

  1. Single-process run (no torchrun with multiple procs). (Hugging Face Forums)

  2. Transformers >= 4.37.2. (Hugging Face)

  3. Print device map after load:

    print(model.hf_device_map)
    

B) Isolate kernel vs comms vs mapping

Run these toggles one at a time:

  1. Disable FlashAttention (use_flash_attn=False)

  2. Disable NCCL P2P

    NCCL_P2P_DISABLE=1 python your_script.py
    
  3. Use split_model_safe() (bounded layers)

C) If P2P disable fixes it

You’re in the “ACS/P2P topology” bucket.

  • Follow NVIDIA NCCL troubleshooting guidance for checking/disabling ACS. (NVIDIA Docs)
  • Consider running NCCL performance/tests; the HF thread explicitly recommends NCCL tests for diagnosing interconnect problems. (Hugging Face Forums)

Why this fits your symptom better than “prompt / decoding”

  • Your generation is deterministic (do_sample=False), and the prompt is simple.

  • Similar “gibberish” reports happen even with plain text-only LLMs when sharded across GPUs. (GitHub)

  • The strongest real-world fixes reported are:


If you want the fastest “most likely fix”

  1. Upgrade Transformers to >=4.37.2. (Hugging Face)
  2. Ensure you’re running one process (plain python, not torchrun). (Hugging Face Forums)
  3. Try NCCL_P2P_DISABLE=1. If it fixes output, pursue ACS/P2P remediation per NCCL docs. (Hugging Face Forums)
  4. If not, disable FlashAttention and use the bounded split_model_safe().

This is a great breakdown. We’ve seen “gibberish” generation show up most often when a configuration crosses an unsupported boundary, especially mixing device_map sharding with multi-process launch or subtle inter-GPU transport issues.

In practice, validating a known-good baseline (single process, deterministic settings) before scaling out tends to surface whether the issue is model logic vs. synchronization / topology. Once that baseline diverges, it’s almost always a systems problem rather than decoding or prompting.

1 Like