TensorFlow Guide: Guided Project 2 - Text Generation with LSTMs

8. Guided Project 2: Text Generation with LSTMs

In this project, you’ll build a character-level text generation model using Long Short-Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN). The model will learn patterns in text and then be able to generate new sequences of characters, essentially writing new “sentences” based on what it learned.

Project Objective

Build an LSTM-based model to generate creative text, trained on a classic text dataset. We’ll use a portion of Shakespeare’s works.

Step 1: Download and Prepare the Text Data

First, we need to download a text file and preprocess it for our model.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import os
import time

# Download a text dataset (Shakespeare's sonnets for simplicity)
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# Length of text in characters
print(f'Length of text: {len(text)} characters')

# Take a look at the first 250 characters
print('\nFirst 250 characters:\n', text[:250])

# The unique characters in the file
vocab = sorted(set(text))
print(f'\n{len(vocab)} unique characters')

# Create a mapping from unique characters to indices
char_to_idx = {char: idx for idx, char in enumerate(vocab)}
idx_to_char = np.array(vocab)

# Convert the entire text to numbers
text_as_int = np.array([char_to_idx[char] for char in text])

print(f"\nText example (first 13 chars to int): {text_as_int[:13]}")
print(f"Text example (first 13 chars back to text): {''.join(idx_to_char[text_as_int[:13]])}")

Self-Check: Confirm the length of the text, number of unique characters, and that the character-to-index mapping works correctly.

Step 2: Create Training Sequences

Our model will predict the next character given a sequence of preceding characters. We’ll divide the text into sequences of a fixed length.

Encourage Independent Problem-Solving: How would you create input-output pairs from text_as_int such that each input is a sequence of seq_length characters and the corresponding output is the next character?

# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text_as_int) // (seq_length + 1)

# Create training examples/targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# Use batch to create sequences of desired length
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

# For each sequence, duplicate and shift to form input and target text
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Print a few examples of input and target sequences
print("\n--- Sample Input-Target Sequences ---")
for input_example, target_example in dataset.take(1):
    print('Input data:', ''.join(idx_to_char[input_example.numpy()]))
    print('Target data:', ''.join(idx_to_char[target_example.numpy()]))

# Batch the dataset
BATCH_SIZE = 64
BUFFER_SIZE = 10000 # tf.data.AUTOTUNE for buffering elements

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

print(f"\nDataset prepared with batch size: {BATCH_SIZE}")
print(f"Sample batch from dataset (input shape): {next(iter(dataset))[0].shape}")
print(f"Sample batch from dataset (target shape): {next(iter(dataset))[1].shape}")

Self-Check: Verify that the input and target sequences are correctly shifted by one character. Input shape should be (BATCH_SIZE, seq_length) and target shape (BATCH_SIZE, seq_length).

Step 3: Build the Text Generation Model (RNN/LSTM)

We’ll use Keras to build an RNN model with LSTM layers.

Encourage Independent Problem-Solving: Design an LSTM-based model for character-level text generation. Consider these layers:

An Embedding layer to convert character indices into dense vectors.
One or more LSTM layers to process sequences.
A Dense output layer with softmax activation to predict the next character’s probability distribution over the vocabulary.

# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = keras.Sequential([
        # Embedding layer: Maps each character index to a dense vector
        layers.Embedding(vocab_size, embedding_dim,
                         batch_input_shape=[batch_size, None]), # None for dynamic sequence length

        # LSTM layer(s): Process sequences. return_sequences=True to stack LSTMs
        layers.LSTM(rnn_units,
                    return_sequences=True,
                    stateful=True, # Stateful LSTMs maintain their internal state across batches
                    recurrent_initializer='glorot_uniform'),
        layers.Dropout(0.2), # Add dropout for regularization

        # You can add more LSTM layers here
        # layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        # layers.Dropout(0.2),

        # Dense output layer: Predicts logits for each character in the vocabulary
        layers.Dense(vocab_size)
    ])
    return model

model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

model.summary()

Self-Check: Review the model summary. Ensure the Embedding output shape and the final Dense output shape are as expected. Note stateful=True, which is important for text generation where the state should carry over across calls.

Step 4: Compile and Train the Model

We’ll compile the model and train it. For text generation, sparse_categorical_crossentropy is typically used because targets are integer indices of the next character, and the model outputs logits over the vocabulary.

# Custom loss function because model outputs are logits, not probabilities
# and we want to ignore sequence padding if any in the future.
def loss(labels, logits):
    return keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

# ModelCheckpoint to save weights
checkpoint_dir = './text_generation_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

EPOCHS = 30 # Number of epochs, can be increased for better results

print("\nStarting text generation model training...\n")
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
print("\nText generation model training finished!")

Self-Check: Observe the loss decreasing over epochs. The model is learning!

Step 5: Generate Text

After training, the fun part: generating new text! We’ll use the trained model to predict one character at a time, feeding its output back as the next input.

Encourage Independent Problem-Solving: How would you write a function to generate text:

Given a starting string.
With a specified number of characters to generate.
By converting the input string to numbers.
Feeding it to the model to get predictions for the next character.
Sampling the next character (instead of always picking the argmax, to introduce randomness).
Repeating this process.

# To make predictions, the model needs to expect a batch size of 1.
# We'll rebuild the model with batch_size=1 and load the trained weights.
model_generation = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

# Load the latest checkpoint. If you ran this multiple times, it loads the last one.
# Use the last checkpoint from training history or find the latest from directory.
model_generation.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model_generation.build(tf.TensorShape([1, None]))

model_generation.summary()

# Function to generate text
def generate_text(model, start_string, num_generate=1000, temperature=1.0):
    # Evaluation step (generating text using the learned model)

    # Convert start string to numbers (vector of chars)
    input_eval = [char_to_idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0) # Add batch dimension

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment with different temperatures to see the effects.
    temperature = temperature

    # Here batch size is 1
    model.reset_states() # Reset RNN states before starting a new generation
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx_to_char[predicted_id])

    return start_string + ''.join(text_generated)

# Generate text with different starting strings and temperatures
print("\n--- Generated Text (Temperature 1.0) ---")
print(generate_text(model_generation, start_string="ROMEO:", num_generate=500, temperature=1.0))

print("\n\n--- Generated Text (Temperature 0.5 - more conservative) ---")
print(generate_text(model_generation, start_string="JULIET:", num_generate=500, temperature=0.5))

print("\n\n--- Generated Text (Temperature 1.5 - more adventurous) ---")
print(generate_text(model_generation, start_string="LORD:", num_generate=500, temperature=1.5))

Self-Check: Observe the generated text. Does it somewhat resemble the style of Shakespeare? Do different temperatures produce different types of text (e.g., more coherent vs. more random)?

Experiment and Improve:

Try increasing the number of EPOCHS.
Experiment with different rnn_units or adding another LSTM layer in build_model.
Modify seq_length.
Use a different text dataset!

Congratulations! You’ve completed your second guided project, building a sequence model to generate text using LSTMs. This project gave you hands-on experience with preparing sequential data, designing RNN architectures, and implementing creative deep learning applications.