Nikhil Reddy

Posted on Feb 1 • Originally published at Medium

Depth-Driven Vision: With Object Detection and Depth Estimation

Introduction

In the world of computer vision, combining object detection and depth estimation has brought a big improvement in analyzing videos accurately. This blog talks about a project that uses two powerful tools: YOLO for detecting objects and Depth_anything for depth estimation. By blending these advanced models, the project not only finds objects well but also digs deep into understanding distances in video frames from the camera’s reference point.

Imagine a scenario where a continuous stream of video frames undergoes processing, with YOLO diligently detecting specific objects and Depth_anything overlaying crucial depth information. This mix of technologies doesn’t just do object identification; it also shows us how far things are from the camera, giving us valuable insights into the spatial layout of each frame.

YOLO: You Only Look Once

YOLO, short for “You Only Look Once,” is a pioneering object detection algorithm known for its speed and accuracy in real-time applications. It revolutionized the field by introducing a single neural network that predicts bounding boxes and class probabilities directly from full images in one evaluation, making it incredibly efficient. The evolution of the YOLO series has seen significant advancements from YOLOv1 to the latest YOLOv8. Each iteration has brought improvements in speed, accuracy, and model architecture. Notable versions include:

YOLOv4: Introduced bag-of-freebies and bag-of-specials techniques, achieving impressive results with PyTorch implementation.
YOLOv5: Developed using PyTorch, it focused on user-friendliness for training and deployment, achieving high accuracy on the MS COCO dataset.
Scaled-YOLOv4: Introduced scaling-up and scaling-down techniques for improved accuracy and speed.
YOLOv8: The latest version released in January 2023 by Ultralytics, featuring an anchor-free approach, faster Non-maximum Suppression (NMS), and various enhancements for improved performance.

YOLO-World

The YOLO-World model, based on YOLOv8, is a real-time open-vocabulary object detection system that excels in detecting any object within an image based on descriptive texts. It offers a versatile tool for vision-based applications with lower computational demands while maintaining competitive performance. The model is user-friendly, easy to integrate into Python applications, and provides pre-trained weights for efficient deployment.

Key Features:

Real-time object detection with just a prompt.
Accessibility on the YOLO-World Github.
Utilizes the CNN-based YOLO architecture for unmatched speed and accuracy.

Architecture:

YOLO-World’s architecture efficiently fuses image features with text embeddings using components like YOLO Detector, Text Encoder, and RepVL-PAN for multi-level cross-modality fusion.

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

Tianheng Cheng^2,3,* Lin Song^1,📧,* Yixiao Ge^1,🌟,2 Wenyu Liu³, Xinggang Wang^3,📧, Ying Shan^1,2

* Equal contribution 🌟 Project lead 📧 Corresponding author

¹ Tencent AI Lab, ² ARC Lab, Tencent PCG ³ Huazhong University of Science and Technology

Notice

YOLO-World is still under active development!

We recommend that everyone use English to communicate on issues, as this helps developers from around the world discuss, share experiences, and answer questions together.

For business licensing and other related inquiries, don't hesitate to contact yixiaoge@tencent.com.

🔥 Updates

[2024-11-5]: We update the YOLO-World-Image and you can try it at HuggingFace YOLO-World-Image (Preview Version). It's a preview version and we are still improving it! Detailed documents about training and few-shot inference are coming soon.
[2024-7-8]: YOLO-World now has been integrated into ComfyUI! Come and try adding YOLO-World to your…

View on GitHub

Depth Anything: Revolutionizing Depth Estimation in Computer Vision

Depth Anything, a model represents a significant advancement in monocular depth estimation. Trained on a vast dataset comprising 1.5 million labeled images and over 62 million unlabeled images, Depth Anything stands out as a robust foundation model for Monocular Depth Estimation (MDE), offering unparalleled features and capabilities.

Key Features of Depth Anything:

Zero-Shot Relative Depth Estimation: Outperforming MiDaS v3.1 (BEiTL-512), Depth Anything excels in zero-shot relative depth estimation.
Zero-Shot Metric Depth Estimation: Surpassing ZoeDepth, this model showcases superior zero-shot metric depth estimation capabilities.
Optimal In-Domain Fine-Tuning: Through fine-tuning and evaluation on datasets like NYUv2 and KITTI, Depth Anything demonstrates exceptional performance.
Enhanced ControlNet: The model enhances a depth-conditioned ControlNet based on its architecture, surpassing previous versions based on MiDaS.

depth-anything.github.io

Project Architecture

This project showcases a system that smoothly integrates cutting-edge models like YOLO-World and Depth_Anything to enhance video analysis capabilities significantly. By combining state-of-the-art technologies in object detection, depth estimation, and distance calculation, this system provides a holistic approach to video analysis, yielding impactful outcomes and meaningful visual insights.

Components:

Input Processing:

The video input undergoes frame-by-frame processing using OpenCV (cv2) to extract visual data for analysis.

Object Detection with YOLO-World:

Utilizing the YOLO-World model, specific classes are identified within each frame.
Detected objects are visually highlighted with bounding boxes and class labels for easy identification.
The results we get are the coordinates of the bounding box, Confidence score, and Class ID.

from collections import defaultdict
import cv2
from ultralytics import YOLO

# Initialize YOLO model
model = YOLO('yolov8s-world.pt')
model.set_classes(["car"])  # Define custom classes

# Open video stream
cap = cv2.VideoCapture('road_test.mp4') 
# Store the track history
track_history = defaultdict(lambda: [])

# Loop through the video frames
while cap.isOpened():
    # Read a frame from the video
    success, frame = cap.read()

    if success:
        # Run YOLOv8 tracking on the frame, persisting tracks between frames
        results = model.track(frame, persist=True)
        print(results[0].boxes)
        # Visualize the results on the frame
        annotated_frame = results[0].plot()
        # Check if there is any detection and extract coordinates and confidence score
        if results[0].boxes.data.numel() > 0:
            detections = results[0].boxes.data[0].tolist()
            xmin, ymin, xmax, ymax, conf, class_id = detections
            print("xmin=", round(xmin))
            print("ymin=", round(ymin))
            print("xmax=", round(xmax))
            print("ymax=", round(ymax))
            print("conf=", conf)
        # Display the annotated frame
        cv2.imshow("Tracking", annotated_frame)

        # Break the loop if 'q' is pressed
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
    else:
        # Break the loop if the end of the video is reached
        break

# Release the video capture object and close the display window
cap.release()
cv2.destroyAllWindows()

Depth Estimation with Depth_Anything:

The Depth_Anything pre-trained model comes into play, generating detailed depth maps based on the processed frames.
These depth maps provide crucial spatial information essential for accurate distance calculation.
Using the Coordinates of the bounding box, we get the pixel values of the target object from the depth maps.

#Using the frames collected, applying the depth_anything model to mask with depth maps. 
frame = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2RGB) / 255.0
frame = transform({'image': frame})['image']
frame = torch.from_numpy(frame).unsqueeze(0).to(DEVICE)
with torch.no_grad():
    depth = depth_anything(frame)
depth = F.interpolate(depth[None], (frame_height, frame_width), mode='bilinear', align_corners=False)[0, 0]
depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
print(depth)
depth = depth.cpu().numpy().astype(np.uint8)
depth_color = cv2.applyColorMap(depth, cv2.COLORMAP_INFERNO)

Distance Calculation:

Utilizing pixel values of the bounding box, the system calculates the relative distance of detected objects from the camera reference point.
Here, the distance is calculated from the camera reference point in only unidirectional, and also it is calculated to the center point of the object.
Changes in distance for each frame are recorded, offering valuable insights into object movement and spatial dynamics.

# Distance is inversely proportional to the pixel value of the center of the bounding box.
dist = depth[round((round(ymin)+round(ymax))/2)][round((round(xmin)+round(xmax))/2)]
dist = 1/(dist)
if dist > 0:
    dist_arr.append(dist)
#Here the distance we got doesn't have any units. It is later converted to meters using the relative distance technique.
#Note- You can use any other technique to calculate the relative distance.

Visualization and Output:

Visual representations include masked color maps depicting depth information and annotated frames showcasing detected objects along with their respective distances.

#Frames in original colour with bounding box detecting the target object.
out_frame = cv2.rectangle(frame, (int(xmin), int(ymin)), (int(xmax), int(ymax)), (0, 255, 0), 2)
out_frame = cv2.putText(frame, f'{class_name} {conf:.2f} Dist: {dist:.3f}', (int(xmin), int(ymin) - 10),
                                      cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
#Masked Frames with depth maps with bounding boxes detecting the target object.
masked_frame = cv2.rectangle(depth_color, (int(xmin), int(ymin)), (int(xmax), int(ymax)), (0, 255, 0), 2)
masked_frame = cv2.putText(depth_color, f'{class_name} Distance: {dist:.3f}', (int(xmin), int(ymin) - 10),
                                      cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

Results & Analysis

In this project, we investigated the variation in distance(depth) of a car(object) over time(no. of frames). We achieved this by employing a combination of YOLO for object detection and a depth estimation model to track the car’s distance from the camera across individual video frames.

Variation in distance(depth) of a car(object) over time(no. of frames)

The graph depicts a clear trend: the car’s distance steadily increased with each frame, signifying a movement away from the camera. This implies the car traversed the field of view throughout the video. The distance fluctuated slightly at times, but the overall trajectory indicated a receding motion.

The minimum distance observed positioned the car at the camera’s immediate vicinity in the very first frame. This could suggest the car was either stationary or approaching the camera initially. Conversely, the maximum distance captured the car at its furthest point from the camera within the video sequence.

It’s important to acknowledge that the analysis captured the car’s distance in a single dimension. The car’s movement in other directions, such as laterally towards or away from the camera, wouldn’t be reflected in this specific data.

This project demonstrates the effectiveness of YOLO and depth estimation models in analyzing car movement within a video. The generated distance data offers valuable insights for tasks like motion tracking and speed estimation.