4. Visual Intelligence: Computer Vision Tasks

Computer Vision (CV) enables computers to “see” and interpret visual information from images and videos. Transformers.js brings powerful CV models directly to the browser, allowing for client-side image processing, analysis, and understanding. This chapter explores common CV tasks.

4.1. Image Classification

Image classification involves assigning a label (or class) to an entire image, determining what the main subject of the image is.

4.1.1. Detailed Explanation

An image classification pipeline takes an image (as a URL, File object, or HTMLImageElement) and outputs a list of predicted labels with confidence scores. Models are trained on vast datasets like ImageNet, learning to recognize patterns associated with thousands of different categories.

4.1.2. Code Examples: Image Classifier

Let’s create an app that classifies uploaded images.

<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Image Classifier</title>
    <style>
        body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; margin: 20px; background-color: #f4f7f6; color: #333; display: flex; flex-direction: column; align-items: center; }
        .container { background-color: #ffffff; padding: 30px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); width: 100%; max-width: 700px; text-align: center; }
        h1 { color: #2c3e50; margin-bottom: 20px; }
        input[type="file"] { margin-bottom: 15px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; width: calc(100% - 20px); }
        #imagePreview { max-width: 100%; height: 200px; object-fit: contain; margin-bottom: 15px; border: 1px dashed #ccc; display: block; margin-left: auto; margin-right: auto; }
        button { background-color: #009688; color: white; padding: 12px 25px; border: none; border-radius: 5px; cursor: pointer; font-size: 17px; transition: background-color 0.3s ease; margin-bottom: 15px; }
        button:hover:not(:disabled) { background-color: #00796b; }
        button:disabled { background-color: #cccccc; cursor: not-allowed; }
        #output { margin-top: 25px; padding: 20px; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #e0f2f7; text-align: left; }
        #loadingSpinner { border: 4px solid #f3f3f3; border-top: 4px solid #009688; border-radius: 50%; width: 20px; height: 20px; animation: spin 1s linear infinite; display: inline-block; margin-right: 10px; vertical-align: middle; display: none; }
        @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } }
        .classification-item { margin-bottom: 5px; }
        .classification-item span { font-weight: bold; color: #00796b; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Image Classifier</h1>
        <input type="file" id="imageUpload" accept="image/*">
        <img id="imagePreview" src="" alt="Image Preview" style="display: none;">
        <button id="classifyButton">
            <span id="loadingSpinner"></span> Classify Image
        </button>
        <div id="output">
            <h3>Classification Results:</h3>
            <p id="classificationResult">Upload an image and click "Classify Image".</p>
        </div>
    </div>

    <script type="module" src="./app.js"></script>
</body>
</html>

// app.js
import { pipeline } from "https://esm.sh/@huggingface/transformers";

document.addEventListener('DOMContentLoaded', async () => {
    const imageUpload = document.getElementById('imageUpload');
    const imagePreview = document.getElementById('imagePreview');
    const classifyButton = document.getElementById('classifyButton');
    const classificationResult = document.getElementById('classificationResult');
    const loadingSpinner = document.getElementById('loadingSpinner');

    classificationResult.textContent = "Loading image classification model...";
    classifyButton.disabled = true;
    loadingSpinner.style.display = 'inline-block';

    let imageClassifier;
    try {
        // Using a MobileNetV2 model for image classification, good balance of speed and accuracy for web
        // Also consider 'Xenova/vit-base-patch16-224' for higher accuracy if performance is less critical
        imageClassifier = await pipeline(
            'image-classification',
            'Xenova/mobilenet-v2',
            {
                device: 'webgpu',
                dtype: 'q8',
            }
        );
        classificationResult.textContent = "Model loaded. Upload an image and classify!";
        classifyButton.disabled = false;
        loadingSpinner.style.display = 'none';
    } catch (error) {
        console.error("Failed to load image classification model:", error);
        classificationResult.textContent = "Error loading model. Check console.";
        loadingSpinner.style.display = 'none';
    }

    imageUpload.addEventListener('change', (event) => {
        const file = event.target.files[0];
        if (file) {
            const reader = new FileReader();
            reader.onload = (e) => {
                imagePreview.src = e.target.result;
                imagePreview.style.display = 'block';
                classificationResult.textContent = "Image ready for classification.";
            };
            reader.readAsDataURL(file);
        } else {
            imagePreview.src = '';
            imagePreview.style.display = 'none';
            classificationResult.textContent = "No image selected.";
        }
    });

    classifyButton.addEventListener('click', async () => {
        const file = imageUpload.files[0];
        if (!file) {
            classificationResult.textContent = "Please upload an image first.";
            return;
        }

        classifyButton.disabled = true;
        loadingSpinner.style.display = 'inline-block';
        classificationResult.textContent = "Classifying image...";

        try {
            // Pass the image element directly to the pipeline
            const output = await imageClassifier(imagePreview);

            let resultsHtml = '<ul>';
            output.forEach(item => {
                resultsHtml += `<li class="classification-item"><span>${item.label}</span>: ${(item.score * 100).toFixed(2)}%</li>`;
            });
            resultsHtml += '</ul>';
            classificationResult.innerHTML = resultsHtml;

        } catch (error) {
            console.error("Error during image classification:", error);
            classificationResult.textContent = "Error during classification. Please try again.";
        } finally {
            classifyButton.disabled = false;
            loadingSpinner.style.display = 'none';
        }
    });
});

4.1.3. Exercises/Mini-Challenges: Image Classification Enhancements

Top-K Results: The model often returns multiple predictions. Modify the display to show only the top 3 or top 5 predictions, sorted by confidence score.
Webcam Input: Instead of file upload, modify the app to take a live stream from the user’s webcam as input. You’ll need to use navigator.mediaDevices.getUserMedia to access the camera, display the video stream on a <video> element, and then capture frames (e.g., to a <canvas> and then toDataURL or toBlob) to feed into the pipeline.
Zero-Shot Image Classification: Research “zero-shot image classification” with Transformers.js. This allows you to classify images into categories that the model was not explicitly trained on, by providing text labels. Find a compatible model (e.g., Xenova/clip-vit-base-patch32) and adapt your application to allow the user to input custom labels (e.g., “dog”, “cat”, “chair”) to classify the image against.

4.2. Object Detection

Object detection involves identifying the presence and location of multiple objects within an image, drawing bounding boxes around them, and labeling each object.

4.2.1. Detailed Explanation

An object detection pipeline takes an image and outputs an array of objects, where each object describes a detected item. Each detection typically includes:

box: Coordinates (xmin, ymin, xmax, ymax) for the bounding box.
label: The class of the detected object (e.g., “car”, “person”).
score: The confidence score for that detection.

4.2.2. Code Examples: Real-time Object Detector (with static image)

Due to the visual nature, we’ll draw bounding boxes on a canvas.

<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Object Detector</title>
    <style>
        body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; margin: 20px; background-color: #f4f7f6; color: #333; display: flex; flex-direction: column; align-items: center; }
        .container { background-color: #ffffff; padding: 30px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); width: 100%; max-width: 800px; text-align: center; }
        h1 { color: #2c3e50; margin-bottom: 20px; }
        input[type="file"] { margin-bottom: 15px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; width: calc(100% - 20px); }
        #canvasContainer { position: relative; width: 100%; max-width: 700px; margin: 0 auto 15px auto; border: 1px solid #ccc; background-color: #eee; }
        #displayCanvas { max-width: 100%; height: auto; display: block; }
        button { background-color: #ffc107; color: #333; padding: 12px 25px; border: none; border-radius: 5px; cursor: pointer; font-size: 17px; transition: background-color 0.3s ease; margin-bottom: 15px; }
        button:hover:not(:disabled) { background-color: #e0a800; }
        button:disabled { background-color: #cccccc; cursor: not-allowed; }
        #output { margin-top: 25px; padding: 20px; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #fffde7; text-align: left; }
        #loadingSpinner { border: 4px solid #f3f3f3; border-top: 4px solid #ffc107; border-radius: 50%; width: 20px; height: 20px; animation: spin 1s linear infinite; display: inline-block; margin-right: 10px; vertical-align: middle; display: none; }
        @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } }
        .detection-item { margin-bottom: 5px; }
        .detection-item span { font-weight: bold; color: #ffc107; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Object Detector</h1>
        <input type="file" id="imageUpload" accept="image/*">
        <div id="canvasContainer">
            <canvas id="displayCanvas"></canvas>
            <p id="canvasPlaceholder" style="position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); color: #888; display: block;">Upload an image here</p>
        </div>
        <button id="detectButton">
            <span id="loadingSpinner"></span> Detect Objects
        </button>
        <div id="output">
            <h3>Detected Objects:</h3>
            <p id="detectionResult">Upload an image and click "Detect Objects".</p>
        </div>
    </div>

    <script type="module" src="./app.js"></script>
</body>
</html>

// app.js
import { pipeline } from "https://esm.sh/@huggingface/transformers";

document.addEventListener('DOMContentLoaded', async () => {
    const imageUpload = document.getElementById('imageUpload');
    const displayCanvas = document.getElementById('displayCanvas');
    const canvasPlaceholder = document.getElementById('canvasPlaceholder');
    const detectButton = document.getElementById('detectButton');
    const detectionResult = document.getElementById('detectionResult');
    const loadingSpinner = document.getElementById('loadingSpinner');
    const ctx = displayCanvas.getContext('2d');

    detectionResult.textContent = "Loading object detection model...";
    detectButton.disabled = true;
    loadingSpinner.style.display = 'inline-block';

    let objectDetector;
    try {
        // Using a DETR-based model for object detection.
        // 'Xenova/detr-resnet-50' is powerful but larger.
        // 'Xenova/detr-resnet-50-panoptic' is even more comprehensive (detection + segmentation).
        // For a lighter alternative, consider models like 'Xenova/yolox-tiny' or 'Xenova/mobilevit-xxs' if available for detection.
        objectDetector = await pipeline(
            'object-detection',
            'Xenova/detr-resnet-50', // Good balance, might be slow on older devices
            {
                device: 'webgpu',
                dtype: 'q8',
            }
        );
        detectionResult.textContent = "Model loaded. Upload an image and detect objects!";
        detectButton.disabled = false;
        loadingSpinner.style.display = 'none';
    } catch (error) {
        console.error("Failed to load object detection model:", error);
        detectionResult.textContent = "Error loading model. Check console.";
        loadingSpinner.style.display = 'none';
    }

    let currentImage = null;

    imageUpload.addEventListener('change', (event) => {
        const file = event.target.files[0];
        if (file) {
            const reader = new FileReader();
            reader.onload = (e) => {
                const img = new Image();
                img.onload = () => {
                    currentImage = img;
                    displayCanvas.width = img.width;
                    displayCanvas.height = img.height;
                    ctx.clearRect(0, 0, displayCanvas.width, displayCanvas.height); // Clear previous drawings
                    ctx.drawImage(img, 0, 0, img.width, img.height);
                    canvasPlaceholder.style.display = 'none';
                    detectionResult.textContent = "Image ready for object detection.";
                };
                img.src = e.target.result;
            };
            reader.readAsDataURL(file);
        } else {
            currentImage = null;
            ctx.clearRect(0, 0, displayCanvas.width, displayCanvas.height);
            canvasPlaceholder.style.display = 'block';
            detectionResult.textContent = "No image selected.";
        }
    });

    detectButton.addEventListener('click', async () => {
        if (!currentImage) {
            detectionResult.textContent = "Please upload an image first.";
            return;
        }

        detectButton.disabled = true;
        loadingSpinner.style.display = 'inline-block';
        detectionResult.textContent = "Detecting objects...";

        try {
            const detections = await objectDetector(currentImage);

            // Clear previous drawings but keep the image
            ctx.clearRect(0, 0, displayCanvas.width, displayCanvas.height);
            ctx.drawImage(currentImage, 0, 0, displayCanvas.width, displayCanvas.height);

            let resultsHtml = '<ul>';
            detections.forEach(detection => {
                const { box, label, score } = detection;
                const { xmax, xmin, ymax, ymin } = box;
                const confidence = (score * 100).toFixed(2);

                resultsHtml += `<li class="detection-item"><span>${label}</span>: ${confidence}% (Box: x${xmin.toFixed(0)}, y${ymin.toFixed(0)} to x${xmax.toFixed(0)}, y${ymax.toFixed(0)})</li>`;

                // Draw bounding box
                ctx.strokeStyle = '#FFC107'; // Yellow
                ctx.lineWidth = 2;
                ctx.strokeRect(xmin, ymin, xmax - xmin, ymax - ymin);

                // Draw label background
                ctx.fillStyle = '#FFC107';
                ctx.fillRect(xmin, ymin - 20, (label.length * 10) + 10, 20);

                // Draw label text
                ctx.fillStyle = 'black';
                ctx.font = '14px sans-serif';
                ctx.fillText(`${label} (${confidence}%)`, xmin + 5, ymin - 5);
            });
            resultsHtml += '</ul>';
            detectionResult.innerHTML = detections.length > 0 ? resultsHtml : "No objects detected.";

        } catch (error) {
            console.error("Error during object detection:", error);
            detectionResult.textContent = "Error during detection. Please try again.";
        } finally {
            detectButton.disabled = false;
            loadingSpinner.style.display = 'none';
        }
    });
});

4.2.3. Exercises/Mini-Challenges: Object Detection Interaction

Confidence Threshold: Add an input field (e.g., a slider from 0.0 to 1.0) for a “confidence threshold.” Only display bounding boxes and results for detections with a score above this threshold.
Color-Coded Labels: Assign a different color to the bounding box and label text for each unique label detected (e.g., red for “person”, blue for “car”). You can use a simple hash function or a predefined color map for this.
Real-time Webcam Object Detection: This is a more challenging but highly rewarding exercise. Integrate webcam input as in the classification exercise. Instead of classifying a single image, process frames from the live video stream on the canvas and draw bounding boxes in real-time. This will require careful management of requestAnimationFrame and setTimeout to avoid overwhelming the browser and model. Consider reducing the video resolution or frame rate for performance.

Visual Intelligence: Computer Vision Tasks

// table of contents