How do you deal with missing or incomplete datasets in computer vision?

Hey everyone!
I’m curious how people here handle dataset shortages for object detection / segmentation projects (YOLO, Mask R-CNN, etc.).

A few quick questions:

  1. How often do you run into a lack of good labeled data for your models?

  2. What do you usually do when there’s no dataset that fits — collect real data, label manually, or use synthetic/simulated data?

  3. Have you ever tried generating synthetic data (Unity, Unreal, etc.) — did it actually help?

Would love to hear how different teams or researchers deal with this.

1 Like

Hugging Face Discord has several channels dedicated to datasets, and if your field is science, there’s also the Hugging Science Discord, so asking there might be more reliable.

It’s rare for datasets to be sufficiently complete from the start, so synthetic datasets are usually a valid approach.

Real-world data is irreplaceable, of course. Synthetic data can help, but in practice it works best as a complement, not a full replacement. Especially for detection tasks where context and noise matter.

For missing or sparse labels, one practical middle ground is to bootstrap annotations using open-vocabulary / language-conditioned detectors, then refine a small subset manually. Tools like Grounding-DINO, OWL-ViT, or services such as Detect Anything can generate rough boxes from free-form prompts, which is often enough to get an initial dataset before investing in full labeling.
This way you still train on real images, just with less upfront annotation cost.

1 Like

This seems to come up in almost every real-world project. From experience there’s always a tradeoff between waiting for perfect data and just moving forward with what you have. Interesting to hear how people decide where to draw that line.

1 Like

I think it would depend on the project purpose and importance. For my first school project for a class, the dataset had major issues. I did my best to fix the worst issues with the time available, then just reported the datasets quality issues before proceeding ahead because the alternative was being unable to work on the area of science , in which dirty data is common.

1 Like