I am facing a critical issue with a dedicated Inference Endpoint where the container unexpectedly re-initializes while it is already in the RUNNING state and actively processing API requests.
Problem description:
-
The endpoint shows status RUNNING
-
An API request is sent and processing starts
-
During processing, the endpoint restarts / re-initializes automatically
-
The request fails with 500 Internal Server Error
-
After re-initialization, the endpoint becomes RUNNING again
Additional observations:
-
The endpoint logs do not show any error, crash, or exception before the restart
-
There is no warning or failure message prior to entering the Initializing state
-
GPU and CPU utilization remain well below maximum (verified via analytics)
-
Memory usage also appears stable
Based on this, it does not seem to be related to:
-
OOM kills
-
memory limits
-
GPU/CPU resource exhaustion
Deployment details:
-
This endpoint uses a custom Docker image
-
The inference system is a custom pipeline
-
Multiple models are involved in the workflow
-
Models are dynamically loaded and offloaded (weights moved in/out of GPU/CPU memory) depending on the requested operation, as different models are used sequentially within a single request
Impact:
-
Ongoing inference jobs are terminated mid-execution
-
API clients receive 500 errors
-
Production usage becomes unreliable despite stable traffic patterns
Questions:
-
Under what conditions can an inference endpoint automatically re-initialize while in the RUNNING state?
-
Could this be related to:
-
platform-level container restarts
-
autoscaling mechanisms?
-
-
Are there recommended configurations or best practices to prevent restarts / re-initialization during long-running inference requests, especially when using:
-
custom Docker images
-
multi-model pipelines
-
dynamic model loading/offloading?
-
This issue is currently blocking stable production usage, so any guidance or investigation would be greatly appreciated.
Regards