If you’ve seen an AI demo that draws boxes around people, cars, and objects in a live video — instantly, as the video plays — you’ve almost certainly seen YOLO. It’s the most popular real-time object detection system in computer vision, and it powers everything from security cameras to robotics. This guide explains what YOLO is, how it works, and how to start using it.
Punti chiave
- YOLO (“You Only Look Once”) detects and locates multiple objects in an image in a single pass.
- That single pass is why it’s fast enough for real-time video.
- It has evolved through many versions — each faster and more accurate than the last.
- It’s beginner-accessible — modern YOLO tools let you run detection in a few lines of code.
What is object detection?
First, the task YOLO solves. Object detection answers two questions about an image at once:
- What objects are present? (classification)
- Where is each one? (localization — a bounding box around it)
This is harder than plain image classification, which only says “this image contains a dog.” Object detection says “there’s a dog here, a person there, and two cars over there” — identifying and locating every object, often many at once.
What is YOLO?
YOLO stands for “You Only Look Once.” The name captures its key innovation. Earlier detection systems were slow because they worked in stages: first propose many regions that might contain an object, then examine each region separately. Looking at thousands of regions one by one takes time — too much for live video.
YOLO does it differently. It looks at the entire image just once and predicts all the objects and all their boxes in a single pass through one neural network. One look, all the answers.
That design is why YOLO is fast. Real-time detection means processing many frames per second, and YOLO’s single-pass approach makes that achievable even on modest hardware — which is exactly why it became the default choice for real-time applications.
How YOLO works
The simplified version of what happens inside:
- Divide the image into a grid. YOLO conceptually splits the image into a grid of cells.
- Each cell makes predictions. Every cell predicts bounding boxes for objects centered in it, a confidence score for each box, and what class of object it is.
- Combine everything. All predictions across the whole grid are gathered together.
- Clean up overlaps. The same object often gets predicted by several nearby cells. A step called non-maximum suppression removes the duplicates, keeping only the best box for each object.
The result: one neural network, one pass, a complete set of labeled boxes — fast.
The evolution of YOLO
YOLO is not a single fixed model — it’s a family that has improved steadily since its first release. Each new version (the series has run well into the double digits, including v9 and beyond) has pushed the same two goals: higher accuracy e greater speed, while staying efficient enough for real-time use.
For practical purposes, the lesson is simple: use a recent, well-supported version. The newer releases are faster e more accurate than older ones, and they come with mature, easy-to-use tooling. Don’t agonize over the exact version number — pick a current one with good documentation.
What YOLO is used for
Real-time detection is useful almost everywhere:
- Security and surveillance — detecting people, vehicles, or unattended objects in camera feeds.
- Autonomous vehicles — spotting cars, pedestrians, and obstacles, part of the wider self-driving perception system.
- Retail — counting customers, analyzing foot traffic, monitoring shelves.
- Manufacturing — spotting defects and missing parts on production lines.
- Agriculture — counting crops, livestock, or detecting pests from drone footage.
- Sports analytics — tracking players and the ball in real time.
- Robotics — letting robots see and respond to objects around them.
Anywhere a machine needs to understand what’s in a video as it happens, YOLO is a strong fit.
YOLO’s strengths and limits
| Punti di forza | Limitazioni |
|---|---|
| Very fast — runs in real time | Can struggle with very small objects |
| Good accuracy for its speed | Densely packed objects can be missed |
| Sees the whole image — fewer false positives on background | Slightly less accurate than the slowest, heaviest detectors |
| Mature, beginner-friendly tooling | Best results still need task-specific training data |
The overarching trade-off: YOLO optimizes for the balance of speed and accuracy. A few research models score marginally higher on accuracy, but they’re too slow for real-time use. For the vast majority of practical applications, YOLO’s balance is exactly right.
How to get started with YOLO
The barrier to entry is low in 2026:
- Use a modern YOLO library. Current YOLO tooling is well-packaged — you can install it and run detection with a recent pre-trained model in just a few lines of Python.
- Start with a pre-trained model. These already recognize dozens of common object types out of the box. Run one on your own images or webcam to see detection working immediately.
- Train on your own data when needed. To detect something specific — a particular product, a custom category — you collect and label example images and fine-tune YOLO on them. Mature tools make this process straightforward.
- Mind your hardware. YOLO runs on a regular computer, but a GPU makes both training and high-frame-rate detection much faster.
What hardware do you need to run YOLO in real time?
“Real time” has a concrete meaning: the model must process each video frame in under roughly 33 milliseconds, the budget you get at 30 frames per second. Hit that and detections keep pace with a live camera; miss it and the feed stutters or drops frames. Whether you clear that bar depends almost entirely on the hardware underneath, and this is where most beginner projects go wrong.
The single biggest factor is the GPU. On a CPU, even a small YOLO model usually runs well below 30 FPS on video, which is fine for processing a folder of images but not for a live stream. Move the same model to an NVIDIA GPU and inference typically runs 10 to 50 times faster, comfortably clearing real time. For training or running the Ultralytics toolchain you want a CUDA-capable NVIDIA card (Compute Capability 6.0 or newer) with at least 8 GB of VRAM; 12–16 GB gives you headroom for larger models and bigger training batches.
Three practical tiers cover almost every project:
| Configurazione | Ideale per | Real-time video? |
| CPU only (laptop) | Learning, batch image processing, prototyping | Rarely — small models only, low resolution |
| Desktop NVIDIA GPU (RTX-class, 8 GB+) | Training custom models, high-FPS streams | Yes — small models often exceed 60 FPS |
| Edge board (e.g. Jetson Orin Nano) | Deployed cameras, robotics, on-site inference | Yes — roughly 30–60 FPS with TensorRT optimisation |
A few things that move the needle more than buying a bigger card. Model size matters most: the nano and small variants are designed to hit real time on modest hardware, while the largest variants trade speed for accuracy and demand a stronger GPU. Optimisation is not optional on the edge: exporting to TensorRT with FP16 precision can roughly double throughput on Jetson devices versus running raw PyTorch, which is often the difference between 20 and 40 FPS. And input resolution is a direct lever — halving it cuts the compute roughly in proportion.
The honest takeaway: you do not need a data-centre GPU to use YOLO in real time. A mid-range gaming GPU handles training and high-FPS inference, and a sub-$300 edge board handles deployment. Match the model variant to your hardware before you start, not after.
Domande frequenti
What is YOLO in object detection?
YOLO (“You Only Look Once”) is a real-time object detection system. It identifies multiple objects in an image and draws a bounding box around each one — telling you both what objects are present and where they are — using a single pass through one neural network.
Why is YOLO so fast?
YOLO analyzes the entire image in a single pass through one neural network, predicting all objects and boxes at once. Older detection systems examined thousands of image regions separately, which was slow. YOLO’s single-look design is what makes real-time detection possible.
Is YOLO good for beginners?
Yes. Modern YOLO libraries are well-documented and easy to use — you can run detection with a pre-trained model in just a few lines of Python. It’s one of the most accessible ways to get started with practical computer vision.
What can YOLO detect?
A YOLO model can detect whatever it was trained on. Pre-trained models recognize dozens of common object types — people, vehicles, animals, everyday items — out of the box. To detect specific or custom objects, you fine-tune YOLO on your own labeled images.
Which version of YOLO should I use?
Use a recent, well-supported version. YOLO has evolved through many releases, each faster and more accurate than the last, and the newer ones come with mature tooling. Rather than focusing on the exact version number, choose a current release with good documentation.
Can I use YOLO in a commercial product for free?
Not automatically — licensing is the most overlooked trap. The original YOLOv9 repository is released under GPL-3.0, and the popular Ultralytics implementations (used to run many YOLO versions) are AGPL-3.0. Both are copyleft: if you ship a product built on that code or those weights, you must open-source your entire application under the same licence. To keep your code closed and proprietary, you need a paid Ultralytics Enterprise License. Internal R&D and customer-facing tools both count, so check the licence terms before you build, not after.
How many labelled images do I need to train YOLO on my own objects?
Far fewer than training from scratch, thanks to transfer learning. Starting from pretrained COCO weights, a usable prototype is often possible with a few hundred well-labelled images per class. For a robust production model, Ultralytics suggests aiming for around 1,500 images and roughly 10,000 labelled instances per class. Label quality and diversity — varied lighting, angles, backgrounds and occlusion — matter more than raw count, and built-in augmentation stretches a modest dataset further.
Do I need to know deep learning to fine-tune YOLO?
No. Fine-tuning on a custom dataset is mostly data preparation and a few commands, not neural-network theory. The harder work is collecting and accurately annotating images; the training step itself is largely automated. A basic grasp of Python and the command line is enough to get a custom detector running.
Conclusione
YOLO made real-time object detection practical by replacing slow, multi-stage pipelines with a single, fast look at the whole image. That one idea — “you only look once” — is why it powers security systems, autonomous vehicles, retail analytics, robotics, and countless other applications.
It isn’t the single most accurate detector in existence, but it offers the best balance of speed and accuracy, and that balance is what real applications need. Best of all, it’s genuinely accessible — pick a recent version, start with a pre-trained model, and you can have object detection running today. For the wider field, see how detection fits into computer vision for self-driving cars.
