Text Generation Using Python

Text generation is the process of generating new text sequence based on a given input of text. It can be achieved via several methods. In this tutorial we will explore how to generate text using three methods namely

Markov Chains
Machine Learning
Deep Learning

Text Generation using Markov Chains

We can use a Markov Chain to generate new text. In order to do so you will have to split the source text into individual words, and then build a dictionary called chain that maps each word to a list of possible next words.

Next, we randomly select a starting word from the source text, and then use the Markov Chain to generate the specified number of words of new text. To do this, we repeatedly choose a random next word from the list of possible next words for the current word, and then add that word to the output text. We then update the current word to be the word we just added, and repeat the process until we’ve generated the desired amount of text.

Finally, we join the generated words back into a single string using the join() method, and return the resulting string.

import random

def generate_text(text, length):
    words = text.split()
    chain = {}
    prev_word = words[0]
    for word in words[1:]:
        if prev_word not in chain:
            chain[prev_word] = []
        chain[prev_word].append(word)
        prev_word = word

    word = random.choice(words)
    output = [word]
    for i in range(length):
        if word not in chain:
            word = random.choice(words)
        next_word = random.choice(chain[word])
        output.append(next_word)
        word = next_word
    return " ".join(output)

This can further be optimised by using a default dict and a higher order Markov chain that takes into account multiple previous words.

The reasons for the above options involves

Use a default dictionary: Instead of checking if a key exists in the dictionary before appending a new value, we can use the collections.defaultdict class to create a dictionary that automatically initializes new keys with an empty list:

from collections import defaultdict

def generate_text(text, length):
    words = text.split()
    chain = defaultdict(list)
    prev_word = words[0]
    for word in words[1:]:
        chain[prev_word].append(word)
        prev_word = word
    ...

Choose starting word more intelligently: Instead of randomly selecting a starting word, we can choose a word that appears at the beginning of a sentence, since these are more likely to be good starting points for a coherent sentence:

def generate_text(text, length):
    sentences = text.split('.')
    sentence_starts = [s.strip().split()[0] for s in sentences if len(s.strip()) > 0]
    words = text.split()
    chain = defaultdict(list)
    for i in range(1, len(words)):
        chain[words[i-1]].append(words[i])
    prev_word = random.choice(sentence_starts)
    ...

Use higher-order Markov Chains: Instead of only considering the previous word when generating the next word, we can use a higher-order Markov Chain that takes into account multiple previous words. This can result in more coherent and meaningful generated text:

def generate_text(text, length, order=2):
    words = text.split()
    chain = defaultdict(list)
    for i in range(order, len(words)):
        prev_words = tuple(words[i-order:i])
        chain[prev_words].append(words[i])
    start_idxs = [i for i in range(len(words)) if words[i][0].isupper()]
    if len(start_idxs) == 0:
        start_idxs = [0]
    start_idx = random.choice(start_idxs)
    prev_words = tuple(words[start_idx:start_idx+order])
    output = list(prev_words)
    for i in range(length):
        if prev_words not in chain:
            prev_words = tuple(words[start_idx:start_idx+order])
        word = random.choice(chain[prev_words])
        output.append(word)
        prev_words = tuple(output[-order:])
    return " ".join(output)

In this optimized version of the function, we use a parameter called order to specify the order of the Markov Chain (i.e., the number of previous words to consider when generating the next word). We also choose a starting word more intelligently by selecting a random word that appears at the beginning of a sentence, and we continue generating text by using the most recent order words as the key in the Markov Chain dictionary. This can result in more meaningful and coherent generated text, but it may require a larger source text corpus to work effectively.

We can convert all of this into a class for easy reusability in a package or a library

import random

class MarkovChainGenerator:
    
    def __init__(self, order=1):
        self.order = order
        self.markov_dict = {}
        self.start_words = []
        self.end_words = []

    def _add_to_dict(self, words_list):
        for i in range(len(words_list) - self.order):
            key = tuple(words_list[i:i+self.order])
            value = words_list[i+self.order]
            if key not in self.markov_dict:
                self.markov_dict[key] = []
            self.markov_dict[key].append(value)
        self.start_words.append(tuple(words_list[:self.order]))
        self.end_words.append(words_list[-1])

    def add_text(self, text):
        words_list = text.split()
        self._add_to_dict(words_list)

    def add_corpus(self, corpus):
        for text in corpus:
            self.add_text(text)

    def generate_text(self, num_words=100):
        current_key = random.choice(self.start_words)
        text = list(current_key)
        while len(text) < num_words or text[-1] not in self.end_words:
            if current_key not in self.markov_dict:
                current_key = random.choice(self.start_words)
            next_word = random.choice(self.markov_dict[current_key])
            text.append(next_word)
            current_key = tuple(text[-self.order:])
        return " ".join(text)

In this implementation, the MarkovChainGenerator class has four methods:

__init__: initializes the object with a specified order (which defaults to 1), an empty dictionary to hold the Markov chain data, and empty lists to hold the start and end words of each text added to the generator.
_add_to_dict: a private helper method that takes a list of words, generates the corresponding Markov chain data, and adds it to the dictionary.
add_text: takes a string of text, splits it into words, and adds it to the Markov chain data.
add_corpus: takes a list of strings (a corpus), and adds each string to the Markov chain data using add_text.
generate_text: takes an optional num_words argument (defaulting to 100), and generates a new text using the Markov chain data.

corpus = [
    "I am the very model of a modern major general",
    "I've information vegetable, animal, and mineral",
    "I know the kings of England, and I quote the fights historical",
    "From Marathon to Waterloo, in order categorical"
]

mcg = MarkovChainGenerator(order=2)
mcg.add_corpus(corpus)
print(mcg.generate_text())

We can further optimise it for larger text using the defaultdict

from collections import defaultdict
import random

class MarkovChain:
    def __init__(self, order=1):
        self.order = order
        self.transitions = defaultdict(list)

    def add_text(self, text):
        words = text.split()
        for i in range(len(words)-self.order):
            key = tuple(words[i:i+self.order])
            value = words[i+self.order]
            self.transitions[key].append(value)

    def generate_text(self, length=100):
        current_state = random.choice(list(self.transitions.keys()))
        text = list(current_state)
        for i in range(length):
            if current_state in self.transitions:
                next_word = random.choice(self.transitions[current_state])
            else:
                next_word = random.choice(list(self.transitions.keys()))
            text.append(next_word)
            current_state = tuple(text[-self.order:])
        return ' '.join(text)

In this implementation, a defaultdict is used to store the transitions, with the key being a tuple of words of length order and the value being a list of words that follow the key in the input text.

The add_text method takes a string input and splits it into words, then adds the transitions to the dictionary.

The generate_text method randomly chooses a starting state from the available keys in the dictionary, then generates text by randomly selecting a word from the list of possible values for the current state. If the current state is not in the dictionary, a new state is chosen at random from the available keys.

Using defaultdict simplifies the implementation by eliminating the need to check if a key exists in the dictionary before appending to its list value. This can provide performance improvements for larger inputs where there may be a large number of unique keys.

Text Generation using Machine Learning

Text generation is the process of using machine learning algorithms to generate new text based on a given input. An example is using GPT 2 to generate a text.

import openai
openai.api_key = "YOUR_API_KEY" # replace with your OpenAI API key

# define the prompt for text generation
prompt = "The quick brown fox"

# set the temperature and max_tokens parameters for text generation
temperature = 0.7
max_tokens = 100

# use the GPT-2 model to generate text based on the prompt
response = openai.Completion.create(
    engine="davinci", # replace with the name of the GPT-2 model you want to use
    prompt=prompt,
    temperature=temperature,
    max_tokens=max_tokens,
)

# extract the generated text from the response object
generated_text = response.choices[0].text

# print the generated text
print(generated_text)

Text Generation Using Transformers

We can also use Transformer based models for text generation using Tensorflow and PyTorch via the steps below

Import the necessary libraries and modules
Define the hyperparameters for the model
Define the input and output sequences
Define the encoder and decoder layers
Define the transformer model
Define the loss function and optimizer
Train the model
Generate new text using the trained model

import numpy as np
import tensorflow as tf

# Define hyperparameters
max_seq_length = 50
batch_size = 32
embedding_dim = 128
num_heads = 4
dff = 512
num_layers = 4
dropout_rate = 0.1
vocab_size = 10000
epochs = 10

# Define input and output sequences
inputs = tf.keras.layers.Input(shape=(max_seq_length,))
targets = tf.keras.layers.Input(shape=(max_seq_length,))

# Define encoder layer
encoder = tf.keras.layers.Embedding(vocab_size, embedding_dim)(inputs)
encoder = tf.keras.layers.Dropout(dropout_rate)(encoder)

for i in range(num_layers):
    encoder = tf.keras.layers.MultiHeadAttention(num_heads, dff)(encoder, encoder)
    encoder = tf.keras.layers.Dropout(dropout_rate)(encoder)
    encoder = tf.keras.layers.LayerNormalization(epsilon=1e-6)(encoder)

# Define decoder layer
decoder = tf.keras.layers.Embedding(vocab_size, embedding_dim)(targets)
decoder = tf.keras.layers.Dropout(dropout_rate)(decoder)

for i in range(num_layers):
    decoder = tf.keras.layers.MultiHeadAttention(num_heads, dff)(decoder, decoder)
    decoder = tf.keras.layers.Dropout(dropout_rate)(decoder)
    decoder = tf.keras.layers.LayerNormalization(epsilon=1e-6)(decoder)

    decoder = tf.keras.layers.MultiHeadAttention(num_heads, dff)(decoder, encoder)
    decoder = tf.keras.layers.Dropout(dropout_rate)(decoder)
    decoder = tf.keras.layers.LayerNormalization(epsilon=1e-6)(decoder)

# Define output layer
outputs = tf.keras.layers.Dense(vocab_size, activation='softmax')(decoder)

# Define the model
model = tf.keras.models.Model(inputs=[inputs, targets], outputs=outputs)

# Define the loss function and optimizer
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()

# Compile the model
model.compile(optimizer=optimizer, loss=loss)

# Train the model
model.fit(dataset, epochs=epochs)

# Generate new text using the trained model
def generate_text(model, start_string):
    # Tokenize the input string
    input_eval = tokenizer.texts_to_sequences([start_string])
    input_eval = tf.keras.preprocessing.sequence.pad_sequences(input_eval, maxlen=max_seq_length, padding='post')

    # Initialize the output string
    output_string = start_string

    # Generate new text word by word
    for i in range(max_seq_length):
        # Generate the predictions
        predictions = model.predict([input_eval, input_eval])
        # Get the predicted next word index
        predicted_index = np.argmax(predictions[:, i, :])

        # Convert the index to its corresponding word
        predicted_word = tokenizer.index_word[predicted_index]

        # Add the predicted word to the output string
        output_string += ' ' + predicted_word

        # Update the input sequence
        input_eval[0, i+1] = predicted_index

    return output_string

Text Generation Using PyTorch Transformer

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

class TransformerModel(nn.Module):
    def __init__(self, input_dim, output_dim, hidden_dim, n_layers, n_heads, pf_dim, dropout):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.pos_embedding = nn.Embedding(1000, hidden_dim)
        self.encoder_layer = nn.TransformerEncoderLayer(hidden_dim, n_heads, pf_dim, dropout)
        self.encoder = nn.TransformerEncoder(self.encoder_layer, n_layers)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([hidden_dim])).to(device)

    def forward(self, src, src_mask):
        # src = [src len, batch size]
        # src_mask = [src len, src len]

        src = self.dropout((self.embedding(src) * self.scale) + self.pos_embedding(torch.arange(0, src.shape[0]).unsqueeze(1).to(device)))
        # src = [src len, batch size, hid dim]

        src = self.encoder(src, src_key_padding_mask=src_mask)
        # src = [src len, batch size, hid dim]

        output = self.fc_out(src[-1, :, :])
        # output = [batch size, out dim]

        return output

There are many other ways to generate text in Python apart from Markov chains and GPT. These other examples include:

Recurrent Neural Networks (RNNs) – RNNs are a type of neural network that can be used for sequence prediction and generation, including text generation.
Long Short-Term Memory (LSTM) Networks – LSTMs are a type of RNN that are particularly good at processing and generating sequences of data.
Neural Machine Translation (NMT) – NMT models are used for machine translation tasks, but can also be used for text generation.
Variational Autoencoders (VAEs) – VAEs are a type of generative model that can be used for text generation.
Transformer-based Models – Transformer models, like GPT, are a type of neural network that uses attention mechanisms to generate text.
Neural Dialogue Models – Dialogue models are designed specifically for generating conversational text, and can be based on RNNs or transformer models.

There are many other techniques for text generation in Python, and the choice of which one to use depends on the specific task and requirements of the project

We have seen how to generate text using several methods in Python. Thank you for your attention

Jesus Saves

By Jesse E.Agbe(JCharis)

Text Generation using Markov Chains

Text Generation using Machine Learning

Text Generation Using Transformers

Jesus Saves

Leave a Comment Cancel Reply