8. Project 2: Interactive Image Captioning Tool

This project will challenge you to build an interactive web application that generates descriptive captions for uploaded images. This utilizes a multimodal AI model, which can process both visual and textual information to understand and describe an image.

8.1. Project Objective and Problem Statement

Objective: Develop a client-side web application where users can upload an image, and the application uses a Transformers.js model to automatically generate a human-readable caption describing the image’s content.

Problem Statement: Describing images accurately is a complex task that combines computer vision and natural language generation. Many solutions rely on server-side processing. We aim to create a privacy-preserving, interactive tool that performs this multimodal task directly in the browser.

8.2. Project Setup

Start with a clean project folder:

mkdir image-captioner-app
cd image-captioner-app
npm init -y
npm i @huggingface/transformers
npm i -g serve # If you don't have it already

Create index.html and app.js in your image-captioner-app directory.

8.3. Step-by-Step Implementation

Step 1: Basic HTML Structure (`index.html`)

Create index.html with an input for image upload, an image preview area, a button to trigger captioning, and a display area for the generated caption.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Interactive Image Captioning</title>
    <link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;700&display=swap" rel="stylesheet">
    <style>
        body {
            font-family: 'Montserrat', sans-serif;
            margin: 0;
            padding: 20px;
            background-color: #e8f5e9;
            color: #333;
            display: flex;
            flex-direction: column;
            align-items: center;
            min-height: 100vh;
        }
        .container {
            background-color: #ffffff;
            padding: 40px;
            border-radius: 12px;
            box-shadow: 0 10px 25px rgba(0, 0, 0, 0.15);
            width: 100%;
            max-width: 800px;
            text-align: center;
            margin-top: 20px;
        }
        h1 {
            color: #1a237e;
            margin-bottom: 30px;
            font-weight: 700;
        }
        input[type="file"] {
            margin-bottom: 20px;
            padding: 12px;
            border: 1px solid #c8e6c9;
            border-radius: 8px;
            font-size: 16px;
            width: calc(100% - 20px);
            background-color: #f7fcf8;
            cursor: pointer;
        }
        #imagePreviewContainer {
            width: 100%;
            max-width: 600px;
            margin: 0 auto 20px auto;
            border: 2px dashed #a5d6a7;
            border-radius: 10px;
            padding: 15px;
            min-height: 250px;
            display: flex;
            align-items: center;
            justify-content: center;
            background-color: #f0fdf0;
            overflow: hidden;
            position: relative;
        }
        #imagePreview {
            max-width: 100%;
            max-height: 250px;
            display: none; /* Hidden by default */
            border-radius: 8px;
            box-shadow: 0 4px 10px rgba(0, 0, 0, 0.05);
        }
        #previewPlaceholder {
            color: #81c784;
            font-style: italic;
            font-size: 18px;
            display: block; /* Shown by default */
        }
        button {
            background-color: #4CAF50;
            color: white;
            padding: 15px 35px;
            border: none;
            border-radius: 8px;
            cursor: pointer;
            font-size: 19px;
            font-weight: 700;
            transition: background-color 0.3s ease, transform 0.2s ease;
            display: inline-flex;
            align-items: center;
            justify-content: center;
            gap: 12px;
            margin-bottom: 20px;
        }
        button:hover:not(:disabled) {
            background-color: #388e3c;
            transform: translateY(-3px);
        }
        button:disabled {
            background-color: #cccccc;
            cursor: not-allowed;
            transform: translateY(0);
        }
        #output {
            margin-top: 30px;
            padding: 25px;
            border: 1px solid #b2dfdb;
            border-radius: 10px;
            background-color: #e0f2f7;
            text-align: left;
            min-height: 100px;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
        }
        #captionResult {
            font-size: 20px;
            font-weight: 500;
            line-height: 1.5;
            color: #1a237e;
        }
        #loadingSpinner {
            border: 4px solid #f3f3f3;
            border-top: 4px solid #4CAF50;
            border-radius: 50%;
            width: 24px;
            height: 24px;
            animation: spin 1s linear infinite;
            display: none; /* Hidden by default */
        }
        @keyframes spin {
            0% { transform: rotate(0deg); }
            100% { transform: rotate(360deg); }
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>Interactive Image Captioner</h1>
        <input type="file" id="imageUpload" accept="image/*">
        <div id="imagePreviewContainer">
            <img id="imagePreview" src="" alt="Image Preview">
            <p id="previewPlaceholder">Upload an image to get started</p>
        </div>
        <button id="captionButton">
            <span id="loadingSpinner"></span> Generate Caption
        </button>
        <div id="output">
            <p id="captionResult">Waiting for model to load...</p>
        </div>
    </div>

    <script type="module" src="./app.js"></script>
</body>
</html>

Step 2: Initialize Multimodal Pipeline (`app.js`)

For image captioning, we need a multimodal model that can understand both images and generate text. A great choice is a Vision-Encoder Decoder (VED) model like Xenova/vit-gpt2-image-captioning.

// app.js
import { pipeline } from "https://esm.sh/@huggingface/transformers";

document.addEventListener('DOMContentLoaded', async () => {
    const imageUpload = document.getElementById('imageUpload');
    const imagePreview = document.getElementById('imagePreview');
    const previewPlaceholder = document.getElementById('previewPlaceholder');
    const captionButton = document.getElementById('captionButton');
    const captionResult = document.getElementById('captionResult');
    const loadingSpinner = document.getElementById('loadingSpinner');

    let imageCaptioner = null;
    let currentImage = null; // Store the current image element

    // --- Loading Model ---
    captionButton.disabled = true;
    loadingSpinner.style.display = 'inline-block';
    captionResult.textContent = "Loading image captioning model... (this may take a while on first load)";

    try {
        imageCaptioner = await pipeline(
            'image-to-text', // The task for image captioning
            'Xenova/vit-gpt2-image-captioning', // A strong VED model for captioning
            {
                device: 'webgpu', // Optimize performance with WebGPU
                dtype: 'q8',      // Use 8-bit quantization for balance of speed/accuracy
            }
        );
        captionResult.textContent = "Model loaded! Upload an image and generate a caption.";
        captionButton.disabled = false;
    } catch (error) {
        console.error("Error loading image captioning model:", error);
        captionResult.textContent = `Error loading model: ${error.message}. Check console.`;
    } finally {
        loadingSpinner.style.display = 'none';
    }

    // --- Image Upload Handler ---
    imageUpload.addEventListener('change', (event) => {
        const file = event.target.files[0];
        if (file) {
            const reader = new FileReader();
            reader.onload = (e) => {
                imagePreview.src = e.target.result;
                imagePreview.style.display = 'block'; // Show the image
                previewPlaceholder.style.display = 'none'; // Hide the placeholder
                currentImage = imagePreview; // Store the image element for the pipeline
                captionResult.textContent = "Image ready for captioning.";
            };
            reader.readAsDataURL(file);
        } else {
            imagePreview.src = '';
            imagePreview.style.display = 'none';
            previewPlaceholder.style.display = 'block';
            currentImage = null;
            captionResult.textContent = "No image selected.";
        }
    });

    // --- Captioning Button Event Listener ---
    captionButton.addEventListener('click', async () => {
        if (!imageCaptioner) {
            captionResult.textContent = "Model not loaded yet or failed to load.";
            return;
        }
        if (!currentImage || !currentImage.src) {
            captionResult.textContent = "Please upload an image first!";
            return;
        }

        captionButton.disabled = true;
        loadingSpinner.style.display = 'inline-block';
        captionResult.textContent = "Generating caption...";

        try {
            // The pipeline directly accepts an HTMLImageElement
            const output = await imageCaptioner(currentImage);
            const generatedCaption = output[0].generated_text;

            captionResult.textContent = generatedCaption;

        } catch (error) {
            console.error("Error during image captioning:", error);
            captionResult.textContent = `Caption generation failed: ${error.message}`;
        } finally {
            captionButton.disabled = false;
            loadingSpinner.style.display = 'none';
        }
    });
});

Step 3: Serve the Application

Navigate to your image-captioner-app directory in the terminal and run:

serve .

Open your browser to the provided local address (e.g., http://localhost:3000).

Step 4: Test and Experiment

Upload various images (landscapes, objects, people, animals).
Observe the captions generated. How descriptive are they? Are there any biases or inaccuracies?
Try uploading images with complex scenes or unusual subjects.

8.4. Encouraging Independent Problem-Solving

Now, let’s enhance this application with more advanced features. Try to implement these on your own:

Multiple Caption Generation: Modify the imageCaptioner call to generate num_return_sequences: 3 (or more) different captions for the same image. Display all generated captions, perhaps allowing the user to pick the best one or just listing them. This often involves parameters like do_sample: true and temperature.
Caption Length Control: Add input fields (like sliders or number inputs) for min_new_tokens and max_new_tokens to allow users to specify the desired length of the generated caption.
Visualizing Confidence (Advanced): Image captioning models don’t typically output token-level confidence scores directly in the pipeline output, but the concept is to display something indicating the model’s certainty.
- Alternative Idea 1 (Simulated Confidence): You could simulate this by running the captioner multiple times with slight temperature variations. If the captions are very similar, assume high confidence. If they vary wildly, assume lower confidence. Display a qualitative indicator (e.g., “High Confidence” / “Low Confidence”).
- Alternative Idea 2 (Zero-Shot Image-to-Text with Score): If you can find an image-to-text model that does provide token probabilities or a single score for the entire caption (less common with direct captioning, more with VQA), integrate that.
Integration with Object Detection: This is a truly advanced challenge! Combine the image captioning with the object detection pipeline from Chapter 4. When an image is uploaded:
- First, run object detection and display the bounding boxes and labels.
- Then, generate a caption.
- Goal: Try to make the caption incorporate some of the detected objects. This is difficult because the captioning model won’t directly “know” about the objects detected by another model. You might need to experiment with prompting strategies (e.g., provide a prompt to the captioning model like “Describe the image, paying attention to the detected objects: [list of detected objects]”). This delves into the realm of prompt engineering for multimodal models.

Project 2: Interactive Image Captioning Tool

// table of contents