A Practical Guide to Text Modeling with PyTorch

Natural Language Processing (NLP) is at the heart of many AI-powered applications—sentiment analysis, chatbots, document summarizers, and much more. In this guide, we’ll explore how to build powerful NLP models using PyTorch, one of the most popular deep learning frameworks. We’ll focus on four core tasks, providing code samples and showing how they’re used in real-world projects.

  • Text Classification
  • Text Generation
  • Summarization/Translation (with Transformers)
  • Transfer Learning (with Pretrained Models)

Let’s roll up our sleeves and dive in!


1. Text Classification: Predicting Sentiment with PyTorch

Use-case: “Is this movie review positive or negative?”
Popular dataset: IMDB Movie Reviews

How does it work?
We preprocess reviews by tokenizing and converting words to indices, pad sequences to a uniform length, and batch our data. The model learns from batches, predicting class labels (e.g., positive=1, negative=0).

Sample dataset entry:

"review": "The plot was dull and predictable.", "label": 0  # 0=negative
"review": "A wonderful movie with excellent performances.", "label": 1  # 1=positive

PyTorch Model (LSTM-based):

import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        _, (h_n, _) = self.lstm(x)
        out = self.dropout(h_n[-1])
        return self.fc(out)

# Example usage:
# model = TextClassifier(vocab_size=10000, embed_dim=128, hidden_dim=64, num_classes=2)

Key steps in training:

  • Tokenize and numerically encode reviews.
  • Batch and pad sequences.
  • Use nn.CrossEntropyLoss() to train.
  • Predict with:
  outputs = model(X_batch)        # forward pass
  preds = outputs.argmax(dim=1)   # predicted class

2. Text Generation: Creating Natural Language with PyTorch

Use-case: “Autocomplete this sentence: ‘The weather today is…’”
Popular dataset: WikiText-2

How does it work?
The model is trained to predict the next word, given a sequence. To generate text, we input a prompt, then repeatedly sample the next word using the model’s predictions.

Sample data:

Input: "The weather today is"
Output (model): "The weather today is sunny and warm."

PyTorch Model (LSTM-based language model):

class TextGenerator(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embedding(x)
        x, hidden = self.lstm(x, hidden)
        out = self.fc(x)
        return out, hidden

# Example usage:
# model = TextGenerator(vocab_size=10000, embed_dim=128, hidden_dim=256)

Training steps:

  • Create input-target pairs: e.g., input=[‘the’,‘weather’,‘today’], target=[‘weather’,‘today’,‘is’].
  • Use teacher forcing during training (model gets correct token as next input).
  • For generation, sample one token at a time using the trained model.

3. Summarization or Translation: Transformers to the Rescue

Use-case: Summarize or translate long texts (“NASA launches new Mars rover to search for signs of life.” → “NASA sends rover to find life on Mars.”)
Popular datasets:

How does it work?
Transformers process the entire source text in parallel, using self-attention to model long-distance dependencies. Both input and output are sequences.

Sample data:

Input:  "NASA launches new Mars rover to search for signs of life."
Target: "NASA sends rover to find life on Mars."

PyTorch Model (Toy Transformer):

class TransformerSeq2Seq(nn.Module):
    def __init__(self, vocab_size, embed_dim, nhead, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer = nn.Transformer(
            d_model=embed_dim,
            nhead=nhead,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            batch_first=True
        )
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, src, tgt):
        src_emb = self.embedding(src)
        tgt_emb = self.embedding(tgt)
        out = self.transformer(src_emb, tgt_emb)
        return self.fc(out)

Workflow:

  • Tokenize and index both source and target texts.
  • During training, shift the target sequence by one for teacher forcing.
  • During inference, generate the output sequence token by token.

4. Transfer Learning: Fine-tuning Pretrained Transformers (e.g., BERT)

Use-case: Achieve state-of-the-art performance on text classification with only a small amount of labeled data.
Popular datasets:

How does it work?
Pre-trained language models like BERT have already “read” the internet (or lots of books, etc.) and learned rich language representations. With transfer learning, you can quickly fine-tune these models on your own data, often achieving outstanding results with minimal effort.

Sample from SST-2

"sentence": "An exhilarating experience!", "label": 1
"sentence": "Not my cup of tea.", "label": 0

Typical fine-tuning workflow:

  1. Load a pre-trained model (e.g., BERT) and its tokenizer.
  2. Tokenize your text to get input IDs and attention masks compatible with the pre-trained model.
  3. Attach a new classification head (a simple linear/fully connected layer).
  4. Train only the classification head (optionally, some top Transformer layers) on your labeled dataset.

PyTorch Model Example (using Hugging Face Transformers):

from transformers import BertTokenizer, BertModel
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        pooled = outputs.pooler_output
        out = self.dropout(pooled)
        return self.fc(out)

Typical usage:
You tokenize with BertTokenizer. For each text, you get input_ids, attention_mask (handled automatically by tokenizer(..., padding=True, truncation=True, return_tensors='pt')). You pass those to your model; during training, you compute a loss (usually cross-entropy) against your actual labels and optimize the model.


Summary Table: Tasks, Workflows, and Datasets

TaskWorkflowPopular DatasetExample Input/Output
Text ClassificationTokenize → Index → Pad & Batch → Model → PredictIMDB, AG News, SST-2Review → Positive/Negative
Text GenerationTokenize → Index → Batch → Model → Sample OutputWikiText-2, Penn TreebankPrompt → Model completes sentence
Summarization/Translation (Transformer)Tokenize Source/Target → Seq2Seq ModelCNN/DailyMail, WMTArticle/English → Summary/French
Transfer LearningTokenize → Pretrained Model + Head → Fine-tuneSST-2, IMDB, custom datasetsText → Class (e.g., sentiment/intent/NER)

Getting Started

  • Data loading: Use torchtext, datasets (Hugging Face), or write your own code to preprocess tokenized text into numerical format, pad sequences (torch.nn.utils.rnn.pad_sequence), and batch.
  • Model training: Standard PyTorch routines: instantiate the model, set up optimizer (e.g., Adam), train with batches, evaluate accuracy or generation quality.
  • Inference: Pass your new text through the preprocessing pipeline, predict categories/sequences, and interpret results.

Here’s an overview of how the example PyTorch models are actually used in practice, along with concrete sample datasets for each. You’ll find a brief workflow for each use case, a typical dataset example, and a corresponding sample data snippet.


TaskHow Model Is Used in PracticeTypical Example DatasetSample Data Example
Text ClassificationPredicts the category or sentiment of input text (e.g., positive/negative review, topic classification). The process: • Preprocess: tokenize, lower-case, convert words to indices• Build vocabulary• Use dataloaders to batch-indexed sequences• Pass through embedding, LSTM/GRU/Transformer, and output layer• Train using cross-entropy loss• Inference: feed user text, output predicted classIMDB Movie Review Dataset[1][2]"review": "A wonderful movie with excellent performances.", "label": 1``"review": "The plot was dull and predictable.", "label": 0(1=positive, 0=negative)
Text GenerationLearns to generate new, syntactically/semantically correct text. The process: • Preprocess: similar tokenization• Usually trained as language model (predict next word given previous words)• During inference: input a prompt, use model to sample next word repeatedly• Used in chatbots, story generators, auto-completePenn Treebank, WikiText-2Input: "The weather today is"Model might generate: "The weather today is sunny and clear."
Summarization / Translation (Transformer)Turns a source text into a target summary or translation using sequence-to-sequence with attention (Transformer). • Preprocess: tokenize source and target texts• Split into train/val/test• Use DataLoaders• Output is trained to match target summary/translation • Inference: input new source text, generate output sequenceCNN/DailyMail (summarization), WMT (translation)Summarization:Input: "NASA launches new Mars rover to search for signs of life."Target: "NASA launches rover to Mars to search for life."Translation:Input: "Hello, how are you?"Target: "Bonjour, comment ça va ?"
Transfer Learning (BERT/XLM-R etc)Fine-tunes a pre-trained model on a smaller, task-specific corpus.• Use tokenizers from pretrained model• Preprocess so tokens match pre-trained vocab• Usually freeze lower layers, train classifier head• Used for quick, highly accurate adaptation to tasks like sentiment, QA, NERSST-2, IMDB, AG News[4]Sample for SST-2 (Stanford Sentiment Treebank v2):"sentence": "The movie was captivating!", "label": 1``"sentence": "Not my cup of tea.", "label": 0

Additional Details

  • Text Classification Sample Workflow (IMDB):
    1. Tokenization: 'I loved the plot.'['i', 'loved', 'the', 'plot']
    2. Vocabulary Mapping: {'i':1, 'loved':2, ...}
    3. Sequence Padding/Truncation: [1, 2, 3, 4] (padded as needed)
    4. Label Mapping: positive → 1, negative → 0
    5. Batched Data Sample: X_batch = [[1,2,3,4],[...]], y_batch = [1,0]
    6. Model Forward Pass: outputs = model(X_batch)
    7. Loss Computation: loss = criterion(outputs, y_batch)
  • Text Generation Sample Workflow (WikiText-2):
    • Input: “Once upon a”
    • Model predicts probability distribution over vocabulary for each next token, sampled sequentially to build a sentence.
  • Summarization/Translation:
    • Inputs and targets both sequences. Preprocessing ensures both are indexed, and sentences are batched.
    • During inference, models typically generate one token at a time with previous output as input (autogressive sampling).
  • Transfer Learning Example (SST-2):
    • Data: Each instance is a sentence and a label (binary sentiment).
    • Tokenizer matches pre-trained model’s vocabulary (e.g., WordPiece for BERT).
    • Fine-tune new classifier head with a small, labeled corpus—fast and effective[4].

For datasets:

  • IMDB for sentiment (positive/negative)—classic for text classification[1][2].
  • Penn Treebank, WikiText-2 for language modeling / text generation.
  • CNN/DailyMail for summarization.
  • WMT for translation.
  • SST-2 for transfer learning and binary sentiment[4].

These datasets are widely available and easy to start with using PyTorch and/or Hugging Face data loaders.


Conclusion

Whether you’re classifying movie reviews, generating stories, summarizing articles, or packing state-of-the-art performance into your app with transfer learning, PyTorch lets you build and experiment with cutting-edge NLP models easily. Try out the examples above, plug in your own data, and see the power of modern deep learning for text unfold!


Feel free to leave comments or questions below, and happy building with PyTorch! 🚀

Jesus Saves

By JCharisAI

Leave a Comment

Your email address will not be published. Required fields are marked *