Building a Real-Time Industrial Inspection System with YOLOv11 and Intel RealSense

The gap between a computer vision model that performs well on a benchmark dataset and one that performs reliably in a factory is larger than most engineers expect. Industrial environments are adversarial by default.

How computer vision, depth imaging, and disciplined engineering delivered 96.9% segmentation accuracy in a real manufacturing environment.

The gap between a computer vision model that performs well on a benchmark dataset and one that performs reliably in a factory is larger than most engineers expect. Industrial environments are adversarial by default: inconsistent lighting, vibration, reflective metal surfaces, and a cycle time requirement that doesn't care about your model's complexity.

When I built an automated toolkit inspection system for a manufacturing client — detecting sockets and deep sockets and measuring their dimensions in real time — those constraints shaped every decision I made.

Final results: 96.9% segmentation accuracy, 93.2% detection accuracy, and a 10% increase in production throughput.

Getting there required more engineering than machine learning.

The Problem

Manual inspection of industrial toolkits is slow, expensive, and inconsistent. A human inspector checking socket dimensions across hundreds of toolkit assemblies per day will have variance — fatigue, distraction, measurement drift.

The client's goal was automation: a system that could identify each socket in an assembly, classify it as a standard or deep socket, and verify its dimensions met specification — all within the production line's cycle time.

The core technical requirements were:

Detect individual sockets in a toolkit tray, correctly segmented from each other
Classify each as socket or deep socket (visually similar; the distinction is internal depth)
Measure outer diameter to within an acceptable tolerance
Complete inference within the time window available in the production cycle
Operate reliably under factory lighting conditions

The last two requirements were not negotiable. A system that was 99% accurate but too slow to keep up with the production line was useless. A system that was fast but degraded under factory lighting was dangerous.

Why RGB-D, Not Just RGB

Computer vision and depth imaging

The first architecture decision was the most consequential: what kind of camera?

A standard RGB camera could detect and segment sockets. But it cannot measure them. You can estimate size from a known camera distance and calibrated reference, but this is fragile — any camera position drift introduces measurement error. For a system with dimensional tolerances, this wasn't acceptable.

Intel RealSense cameras output pixel-aligned depth maps alongside RGB frames. Every pixel in the RGB image has a corresponding depth value in real-world coordinates. Once you've segmented a socket instance, you can project its pixel coordinates through the depth map to get real-world dimensions — actual millimeters, not pixel counts normalized against an assumed depth.

For classifying deep sockets versus standard sockets, depth data was particularly critical. The visual difference between a standard and deep socket from above is subtle — primarily a slight difference in the shadow profile inside the socket opening. The depth profile of the socket interior, however, is unambiguous. A deep socket has measurably greater interior depth. RGB-only classification would have required a much more complex model to catch what depth data makes obvious.

Alternatives I considered:

Structured light systems — more depth accuracy, but expensive and fragile under factory conditions
Stereo cameras — cheaper, but more computational overhead and lower near-range accuracy than RealSense for our working distance

RealSense struck the right balance for the deployment context.

Choosing YOLOv11

Model selection was driven by two constraints: accuracy and inference speed.

Instance segmentation — not object detection — was required. Bounding boxes around sockets don't give you the precise pixel mask needed to extract depth measurements for the socket boundary. You need per-pixel instance masks.

I evaluated several approaches:

Detectron2 (Mask R-CNN). Strong segmentation accuracy, but inference time was too high for real-time use on the target hardware. Optimized versions exist, but the operational complexity of tuning Detectron2 for production wasn't justified.
YOLOv8 with segmentation head. Viable, and I built an initial prototype on it. Accuracy on the specific domain — close-range, top-down, reflective metal objects — was lower than I wanted. The model struggled with partially overlapping sockets in crowded trays.
YOLOv11. Improvements in YOLOv11's segmentation head — particularly better small-object handling and tighter mask boundaries — were directly relevant. Inference time on GPU remained within the cycle time budget.

I chose YOLOv11-medium over YOLOv11-large after profiling: the accuracy difference on the dataset was marginal (~1.2%) but the inference speed difference was meaningful at scale.

The model was not used off-the-shelf. It was trained on factory-collected data — images captured in the actual production environment with actual sockets, under actual lighting conditions. Domain specificity is not optional in industrial vision.

The Measurement Pipeline

Industrial manufacturing automation

Once the model returns instance masks, the measurement pipeline works as follows:

Frame acquisition — capture synchronized RGB and depth frames from RealSense.
Inference — run YOLOv11 segmentation on the RGB frame. Get per-instance masks and class labels.
Mask projection — for each instance mask, extract corresponding depth values. Filter pixels where depth is unavailable (RealSense has depth holes in highly reflective areas).
Dimension computation — use RealSense intrinsic calibration to project pixel coordinates and depth values into real-world 3D coordinates. Compute socket diameter from the projected point cloud.
Deep socket classification — analyze the depth profile within the socket interior. Deep sockets show a characteristic depth gradient at the interior rim boundary that standard sockets don't.
Result output — pass measurements to the PLC interface for pass/fail decision.

The most brittle part of this pipeline was step 3 — depth hole handling. RealSense depth maps have missing values in highly reflective areas, and sockets are made of polished metal. I addressed this with a combination of temporal averaging across frames and inpainting for persistent holes. Neither is perfect; together they were sufficient.

The Hard Parts Nobody Writes About

Lighting calibration was manual and fragile. Industrial fluorescent lighting creates reflections that shift as the light source flickers or as different numbers of lights are active in the cell. The model had to be trained with exposure variation to generalize. I collected images across multiple days at different times of day to capture lighting diversity.

The dataset labeling problem. You cannot use open-source socket images for a production industrial inspection system. Every image in the training set had to be captured in the actual factory, on actual toolkits, and manually labeled. The dataset was approximately 2,400 images with applied augmentation (rotation, brightness variation, synthetic occlusion). This was time-consuming. There is no shortcut.

Confidence threshold calibration. In manufacturing, a false negative — missing a defective or misidentified socket — is more costly than a false positive. I calibrated confidence thresholds to bias toward false positives, accepting more manual review flags in exchange for near-zero missed detections on defects.

This is an engineering judgment, not a model parameter.

Inference optimization under a real cycle time budget. The production line had a fixed cycle time. Inference had to complete within it or the system would become a bottleneck. I profiled the full pipeline — not just model inference, but frame capture latency, depth map processing, and result transmission to the PLC. The model itself was fast; the surrounding overhead required careful implementation. Python's GIL was a constraint; I moved depth processing to a separate thread with a shared memory buffer.

Edge Deployment Considerations

The system ran on an industrial PC with a discrete GPU — not a cloud endpoint. Latency to a cloud API would have made real-time operation impossible, and factory network connectivity couldn't be relied upon for mission-critical inspection.

This meant the entire model and inference pipeline had to run on local hardware, and that hardware had to be maintainable by factory staff who were not ML engineers. The deployment had to be durable: no Python environment management issues, no dependency conflicts that broke on OS updates.

I containerized the inference service with Docker, using a pinned CUDA + PyTorch base image. The container started automatically on system boot via systemd. Updates were pushed as new image versions.

Results and Where the Numbers Come From

96.9% segmentation accuracy — mean IoU across all socket instances in a held-out test set collected separately from training data in the same factory environment.
93.2% detection accuracy — [email protected] on detection, slightly lower because some overlapping sockets in dense trays challenged the model.
10% production throughput increase — eliminated a manual inspection step that previously required a human operator to sample and verify toolkit completeness.

What I'd Build Differently

Uncertainty estimation. The model outputs a confidence score, but this isn't the same as calibrated uncertainty. For a production system, I'd want the system to say "I'm not sure about this instance, flag for human review" rather than committing to a potentially wrong classification at a confidence that still exceeds threshold.
Automated re-labeling pipeline. As socket designs change, the model needs retraining. A pipeline where edge cases flagged by the system automatically become training candidates — with human review — would make the system self-improving.
Structured light supplement. For the small percentage of sockets where RealSense depth is unreliable due to reflection, a supplementary structured light scan could provide ground truth. Expensive, but would push measurement reliability from "very good" to "production-certified."

Key Takeaways

Depth imaging (RGB-D) enables capabilities that RGB alone cannot — specifically real-world dimensional measurement. Choose your sensor with the task in mind.
Industrial AI systems require domain-specific data. Open-source datasets will not generalize. Invest in data collection in the actual deployment environment.
Model selection should be driven by the product of accuracy and inference speed, not accuracy alone.
Confidence threshold calibration is a business decision as much as a technical one. In manufacturing, the cost of a false negative typically outweighs the cost of a false positive.
Containerize edge AI deployments. Factory IT staff should not need to manage Python virtual environments.

Conclusion

Building an industrial AI vision system isn't primarily a machine learning problem. It's an engineering problem where machine learning is one of several components.

The accuracy numbers matter — but so does the frame capture latency, the depth hole handling strategy, the deployment containerization, and the threshold calibration that reflects the actual cost asymmetry of false positives versus false negatives in the factory context.

The 96.9% segmentation accuracy came from careful engineering of the full pipeline, not from running a state-of-the-art model on clean benchmark data. That's the version of AI engineering that ships to production.