TensorFlow Guide: Working with Data - `tf.data` API

4. Working with Data: `tf.data` API

Efficiently loading, preprocessing, and feeding data to your models is crucial for performance, especially with large datasets. TensorFlow’s tf.data API is designed to build high-performance input pipelines that are robust, flexible, and scalable.

4.1 Why `tf.data`?

Traditional data loading often involves reading all data into memory or iterating over files one by one. This can be slow and memory-intensive. The tf.data API solves this by:

Streaming Data: It can process data elements one by one or in small batches, avoiding the need to load the entire dataset into memory.
Performance Optimizations: It offers built-in optimizations like parallel processing, prefetching, and caching to keep your GPUs/CPUs busy.
Flexibility: It supports various data sources (arrays, files, custom generators) and offers a rich set of transformations.

4.2 Creating a `tf.data.Dataset`

You can create a dataset from various sources.

From Tensors/NumPy Arrays

This is the simplest way for small datasets already in memory.

import tensorflow as tf
import numpy as np

# From a single tensor
dataset_single_tensor = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])
for element in dataset_single_tensor:
    print(element.numpy())

print("\n---")

# From multiple tensors (features and labels)
features = tf.constant([[10, 20], [30, 40], [50, 60]])
labels = tf.constant([0, 1, 0])
dataset_multi_tensor = tf.data.Dataset.from_tensor_slices((features, labels)) # Tuple of tensors

for x, y in dataset_multi_tensor:
    print(f"Features: {x.numpy()}, Label: {y.numpy()}")

print("\n---")

# From NumPy arrays
numpy_features = np.array([[1.1, 1.2], [2.1, 2.2]])
numpy_labels = np.array([0, 1])
dataset_numpy = tf.data.Dataset.from_tensor_slices((numpy_features, numpy_labels))

for x, y in dataset_numpy:
    print(f"NumPy Features: {x.numpy()}, NumPy Label: {y.numpy()}")

From Generators

For custom data loading logic or when data doesn’t fit in memory, generators are useful.

import tensorflow as tf

# A simple Python generator function
def data_generator():
    for i in range(1, 6):
        yield i, i * 10 # Yield (feature, label) pairs

# Create a dataset from the generator
# output_types specifies the data types of the yielded elements
# output_shapes specifies the shape of the yielded elements (None for dynamic shapes)
dataset_from_generator = tf.data.Dataset.from_generator(
    data_generator,
    output_types=(tf.int32, tf.int32),
    output_shapes=(None, None) # Or (tf.TensorShape([]), tf.TensorShape([])) for scalars
)

print("\n--- Dataset from Generator ---")
for x, y in dataset_from_generator:
    print(f"Generated X: {x.numpy()}, Generated Y: {y.numpy()}")

4.3 Dataset Transformations: Building the Pipeline

The real power of tf.data comes from its chainable transformations.

`map()`: Applying Transformations to Each Element

The map() transformation applies a Python function (which should ideally be a tf.function for performance) to each element of the dataset. This is where you perform preprocessing, such as scaling, image resizing, or text tokenization.

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])

def square_and_add_one(x):
    return tf.square(x) + 1

# Apply the function to each element
mapped_dataset = dataset.map(square_and_add_one)

print("\n--- Mapped Dataset (square_and_add_one) ---")
for element in mapped_dataset:
    print(element.numpy())

# With multiple inputs/outputs
features = tf.constant([1.0, 2.0, 3.0])
labels = tf.constant([0, 1, 0])
dataset_pair = tf.data.Dataset.from_tensor_slices((features, labels))

def preprocess_pair(x, y):
    x_scaled = x * 10 # Scale feature
    y_one_hot = tf.one_hot(y, depth=2) # One-hot encode label (assuming 2 classes)
    return x_scaled, y_one_hot

processed_dataset = dataset_pair.map(preprocess_pair)

print("\n--- Mapped Dataset (processed pairs) ---")
for x, y in processed_dataset:
    print(f"Processed X: {x.numpy()}, Processed Y: {y.numpy()}")

# For images, map is used for resizing, augmentation, etc.
# For optimal performance, use num_parallel_calls=tf.data.AUTOTUNE with map()
# This parallelizes data processing across multiple CPU cores.
# processed_dataset = dataset_pair.map(preprocess_pair, num_parallel_calls=tf.data.AUTOTUNE)

`filter()`: Selecting Elements

Filters elements based on a boolean-returning function.

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

def is_even(x):
    return x % 2 == 0

filtered_dataset = dataset.filter(is_even)

print("\n--- Filtered Dataset (even numbers) ---")
for element in filtered_dataset:
    print(element.numpy())

`shuffle()`: Randomizing Order

Crucial for training to prevent models from learning the order of data.

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# buffer_size should be larger than your dataset for perfect shuffling,
# or large enough to provide a good level of randomness.
shuffled_dataset = dataset.shuffle(buffer_size=10)

print("\n--- Shuffled Dataset ---")
for element in shuffled_dataset:
    print(element.numpy())

`batch()`: Grouping Elements

Combines consecutive elements into batches. This is essential for gradient descent, which works on batches of data.

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

batched_dataset = dataset.batch(batch_size=3)

print("\n--- Batched Dataset ---")
for batch in batched_dataset:
    print(batch.numpy())

# Handling remaining elements: drop_remainder=True will discard batches smaller than batch_size
batched_dataset_dr = dataset.batch(batch_size=3, drop_remainder=True)
print("\n--- Batched Dataset (drop_remainder=True) ---")
for batch in batched_dataset_dr:
    print(batch.numpy())

`repeat()`: Iterating Multiple Epochs

Repeats the dataset indefinitely or for a specified number of epochs.

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices([1, 2])

repeated_dataset = dataset.repeat(count=3) # Repeats 3 times

print("\n--- Repeated Dataset ---")
for element in repeated_dataset:
    print(element.numpy())

`prefetch()`: Overlapping Preprocessing and Model Execution

This is a key optimization. It allows the data pipeline to fetch the next batch of data while the model is training on the current batch. This keeps your GPU (or CPU) busy and prevents idle time.

tf.data.AUTOTUNE: Automatically tunes the number of elements to prefetch.

import tensorflow as tf
import time

# Simulate slow data loading and processing
def slow_data_generator():
    for i in range(5):
        time.sleep(0.1) # Simulate IO delay
        yield i

def slow_preprocessing(x):
    time.sleep(0.05) # Simulate CPU processing delay
    return x * 2

# Without prefetching
dataset_no_prefetch = tf.data.Dataset.from_generator(
    slow_data_generator, output_types=tf.int32
).map(slow_preprocessing).batch(2)

start_time = time.time()
print("\n--- Without Prefetching ---")
for batch in dataset_no_prefetch:
    print(f"Processed batch: {batch.numpy()}")
print(f"Time without prefetch: {time.time() - start_time:.4f} seconds\n")


# With prefetching
dataset_with_prefetch = tf.data.Dataset.from_generator(
    slow_data_generator, output_types=tf.int32
).map(slow_preprocessing, num_parallel_calls=tf.data.AUTOTUNE).batch(2).prefetch(tf.data.AUTOTUNE)

start_time = time.time()
print("\n--- With Prefetching ---")
for batch in dataset_with_prefetch:
    print(f"Processed batch: {batch.numpy()}")
print(f"Time with prefetch: {time.time() - start_time:.4f} seconds")

You should observe that the version with prefetch is significantly faster because the data loading and processing are happening in parallel with the consumption of data by the loop.

4.4 Building an Optimized Input Pipeline: Best Practices

A typical, optimized tf.data pipeline often follows this order:

Read/Load: tf.data.Dataset.from_tensor_slices() or tf.data.TFRecordDataset() etc.
Shuffle: dataset.shuffle(buffer_size). Do this early to ensure good randomness.
Map: dataset.map(preprocessing_function, num_parallel_calls=tf.data.AUTOTUNE). For element-wise transformations.
Cache (Optional): dataset.cache(). If your data fits in memory and preprocessing is expensive, cache after map().
Batch: dataset.batch(batch_size).
Prefetch: dataset.prefetch(tf.data.AUTOTUNE). Always add this as the last step.

import tensorflow as tf
import numpy as np

# Simulate some large data (e.g., image paths and labels)
NUM_SAMPLES = 10000
image_paths = [f'/data/img_{i}.png' for i in range(NUM_SAMPLES)]
labels = np.random.randint(0, 10, size=NUM_SAMPLES) # 10 classes

# 1. Create a dataset of (image_path, label) pairs
dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))

# 2. Shuffle early for better randomness
BUFFER_SIZE = 5000 # Larger buffer for better shuffling
dataset = dataset.shuffle(buffer_size=BUFFER_SIZE)

# Simulate image loading and preprocessing (this would be where you load from disk)
def load_and_preprocess_image(image_path, label):
    # In a real scenario, you'd use tf.io.read_file, tf.image.decode_jpeg, etc.
    # For this example, let's just create a dummy tensor
    # Simulate a small delay for IO and computation
    # tf.py_function is used to wrap arbitrary Python functions for tf.data
    # This is useful for functions that cannot be expressed purely in TensorFlow ops.
    def _load_and_preprocess_py(path, lbl):
        # time.sleep(0.01) # Simulate I/O or heavy CPU work
        dummy_image = np.random.rand(64, 64, 3).astype(np.float32) # Dummy 64x64 RGB image
        return dummy_image, lbl

    image, new_label = tf.py_function(
        _load_and_preprocess_py,
        inp=[image_path, label],
        Tout=[tf.float32, tf.int64] # Specify output types
    )
    # Important: set the shape after py_function as it loses static shape info
    image.set_shape([64, 64, 3])
    new_label.set_shape([]) # Scalar label
    return image, new_label

# 3. Map preprocessing function, parallelize with AUTOTUNE
dataset = dataset.map(load_and_preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)

# 4. (Optional) Cache if data fits in memory and mapping is expensive
# dataset = dataset.cache()

# 5. Batch the data
BATCH_SIZE = 32
dataset = dataset.batch(BATCH_SIZE)

# 6. Prefetch to overlap data production and consumption
dataset = dataset.prefetch(tf.data.AUTOTUNE)

# Now you can iterate over this `dataset` in your training loop
print("\n--- Optimized Pipeline Example ---")
for i, (images, labels) in enumerate(dataset.take(3)): # Take 3 batches for demonstration
    print(f"Batch {i+1}: Images shape={images.shape}, Labels shape={labels.shape}")

Exercise 4.1: Building a Custom Image Data Pipeline

Objective: Construct an efficient tf.data pipeline for a simulated image dataset, including shuffling, preprocessing, and optimization.
Instructions:
- Simulate Data:
  - Create 1000 dummy image paths (e.g., f'img_{i}.jpg').
  - Create 1000 dummy labels (integers 0-9).
- Preprocessing Function: Write a Python function process_image(image_path, label) that:
  - Simulates loading an image by creating a random tensor of shape (128, 128, 3) (e.g., using tf.random.uniform).
  - Normalizes the pixel values to be between 0 and 1.
  - Applies random horizontal flipping to the image (use tf.image.random_flip_left_right).
  - Returns the processed image and the label.
  - Crucially: Use tf.py_function to wrap your Python logic if it’s not purely TensorFlow operations, and remember to set_shape after using tf.py_function.
- Build the Pipeline:
  - Create a tf.data.Dataset from the simulated image paths and labels.
  - Apply shuffle with a buffer size of 500.
  - Apply your process_image function using map and num_parallel_calls=tf.data.AUTOTUNE.
  - Batch the dataset with a batch_size of 64.
  - Apply prefetch(tf.data.AUTOTUNE).
- Verify: Iterate through 2-3 batches of your final dataset and print the shapes of the images and labels to ensure they are correct.
Expected Output: Messages showing the shapes of images and labels for a few batches, confirming the pipeline is working as expected.

# Your solution for Exercise 4.1 here
import tensorflow as tf
import numpy as np
import time

# 1. Simulate Data
NUM_SAMPLES = 1000
image_paths_dummy = [f'dummy_path_{i}.jpg' for i in range(NUM_SAMPLES)]
labels_dummy = np.random.randint(0, 10, size=NUM_SAMPLES) # 10 classes

# Create a tf.data.Dataset from dummy paths and labels
raw_dataset = tf.data.Dataset.from_tensor_slices((image_paths_dummy, labels_dummy))

# 2. Preprocessing Function
# This function would typically load an actual image from disk
# and perform operations like resizing, augmentation, etc.
def process_image(image_path, label):
    # Simulate loading an image by creating a random tensor
    # In a real scenario:
    # img_raw = tf.io.read_file(image_path)
    # img_tensor = tf.image.decode_jpeg(img_raw, channels=3)
    img_tensor = tf.random.uniform(shape=(128, 128, 3), minval=0.0, maxval=255.0, dtype=tf.float32)

    # Normalize pixel values to 0-1
    img_normalized = img_tensor / 255.0

    # Apply random horizontal flipping
    img_augmented = tf.image.random_flip_left_right(img_normalized)

    return img_augmented, label

# Due to `tf.image.random_flip_left_right` being a TF op, `tf.py_function` might not be strictly necessary
# if all operations are pure TensorFlow. However, if you had more complex Python logic, you would use it.
# Let's write it purely in TensorFlow for optimal performance.

# 3. Build the Pipeline
BUFFER_SIZE = 500
BATCH_SIZE = 64

# Shuffle early
dataset_pipeline = raw_dataset.shuffle(buffer_size=BUFFER_SIZE)

# Map preprocessing function, parallelize
dataset_pipeline = dataset_pipeline.map(process_image, num_parallel_calls=tf.data.AUTOTUNE)

# Batch the data
dataset_pipeline = dataset_pipeline.batch(BATCH_SIZE)

# Prefetch to overlap data production and consumption
dataset_pipeline = dataset_pipeline.prefetch(tf.data.AUTOTUNE)

# 4. Verify the pipeline
print("\n--- Verifying Custom Image Data Pipeline ---")
for i, (images, labels) in enumerate(dataset_pipeline.take(3)): # Take 3 batches for verification
    print(f"Batch {i+1}:")
    print(f"  Images shape: {images.shape}")
    print(f"  Images dtype: {images.dtype}")
    print(f"  Labels shape: {labels.shape}")
    print(f"  Labels dtype: {labels.dtype}")
    # Assertions to ensure correct shapes and types
    assert images.shape == (BATCH_SIZE, 128, 128, 3), "Image batch shape is incorrect!"
    assert labels.shape == (BATCH_SIZE,), "Label batch shape is incorrect!"
    assert images.dtype == tf.float32, "Image dtype is incorrect!"
    assert labels.dtype == tf.int64 or labels.dtype == tf.int32, "Label dtype is incorrect!"

print("\nPipeline verified successfully for a few batches!")

You’ve now learned how to construct powerful and efficient data pipelines using tf.data, which is fundamental for working with real-world datasets in TensorFlow. Next, we’ll delve into more advanced training techniques.

TensorFlow Guide: Working with Data - `tf.data` API

// table of contents

4. Working with Data: tf.data API

4.1 Why tf.data?

4.2 Creating a tf.data.Dataset