Inference Endpoint re-initializes during active API requests causing 500 errors (custom Docker, multi-model pipeline)

I am facing a critical issue with a dedicated Inference Endpoint where the container unexpectedly re-initializes while it is already in the RUNNING state and actively processing API requests.

Problem description:

  • The endpoint shows status RUNNING

  • An API request is sent and processing starts

  • During processing, the endpoint restarts / re-initializes automatically

  • The request fails with 500 Internal Server Error

  • After re-initialization, the endpoint becomes RUNNING again

Additional observations:

  • The endpoint logs do not show any error, crash, or exception before the restart

  • There is no warning or failure message prior to entering the Initializing state

  • GPU and CPU utilization remain well below maximum (verified via analytics)

  • Memory usage also appears stable

Based on this, it does not seem to be related to:

  • OOM kills

  • memory limits

  • GPU/CPU resource exhaustion

Deployment details:

  • This endpoint uses a custom Docker image

  • The inference system is a custom pipeline

  • Multiple models are involved in the workflow

  • Models are dynamically loaded and offloaded (weights moved in/out of GPU/CPU memory) depending on the requested operation, as different models are used sequentially within a single request

Impact:

  • Ongoing inference jobs are terminated mid-execution

  • API clients receive 500 errors

  • Production usage becomes unreliable despite stable traffic patterns

Questions:

  1. Under what conditions can an inference endpoint automatically re-initialize while in the RUNNING state?

  2. Could this be related to:

    • platform-level container restarts

    • autoscaling mechanisms?

  3. Are there recommended configurations or best practices to prevent restarts / re-initialization during long-running inference requests, especially when using:

    • custom Docker images

    • multi-model pipelines

    • dynamic model loading/offloading?

This issue is currently blocking stable production usage, so any guidance or investigation would be greatly appreciated.

Regards

1 Like

Hey @roll-ai thanks for posting and emailing support! Please feel free to follow up with support via email if you’re still running into issues so that we can investigate further.

1 Like