PyTorch Beginner's Tutorial (7) - Textual Metaphor Binary Classification with BERT



中文 | English

Overview

This project explores an introductory NLP task on Kaggle (link), focusing on binary classification of text. In this challenge, participants classify Twitter posts as either relating to a disaster event or not. For instance, a tweet like “The White House is on fire, the flames are massive” is classified as a disaster-related tweet. Meanwhile, a post like “That cloud looks like it’s burning” contains related keywords but doesn’t refer to an actual disaster. The task, therefore, is to distinguish between these two cases.

The dataset can be downloaded from Kaggle (link) or via Baidu Netdisk (link).

You can upload your predictions to Kaggle to see how your model scores (link).

Environment Setup

This project uses the following library versions:

```
python==3.8.5
pandas==1.3.5
torch==1.11.0
transformers==4.21
```

To get started, import all the required libraries:

```python
import os
import pandas
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
# For loading the BERT model’s tokenizer
from transformers import AutoTokenizer
# For loading the BERT model
from transformers import AutoModel
from pathlib import Path
from tqdm.notebook import tqdm
```

Global Configuration

```python
batch_size = 16
# Maximum length of text sequences
text_max_length = 128
# Total number of training epochs (example value)
epochs = 100
# Portion of the training set to use as the validation set
validation_ratio = 0.1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Log loss after every specified number of steps
log_per_step = 50

# Dataset directory path
dataset_dir = Path("./dataset")
os.makedirs(dataset_dir) if not os.path.exists(dataset_dir) else ''

# Model storage path
model_dir = Path("./drive/MyDrive/model/bert_checkpoints")
# Create the model directory if it doesn't exist
os.makedirs(model_dir) if not os.path.exists(model_dir) else ''

print("Device:", device)
```
```
Device: cuda
```

Data Processing

Loading the Dataset

First, download the dataset and extract it into the dataset directory. The directory should contain three files: train.csv, test.csv, and sample_submission.csv.

We'll use pandas to load the training data. For our purposes, we only need the text and target columns from the training data:

```python
pd_data = pandas.read_csv(dataset_dir / 'train.csv')[['text', 'target']]
```

After loading, let's take a look at the data content:

```python
pd_data
```
text target
0 Our Deeds are the Reason of this #earthquake M... 1
1 Forest fire near La Ronge Sask. Canada 1
2 All residents asked to 'shelter in place' are ... 1
3 13,000 people receive #wildfires evacuation or... 1
4 Just got sent this photo from Ruby #Alaska as ... 1
... ... ...
7608 Two giant cranes holding a bridge collapse int... 1
7609 @aria_ahrary @TheTawniest The out of control w... 1
7610 M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt... 1
7611 Police investigating after an e-bike collided ... 1
7612 The Latest: More Homes Razed by Northern Calif... 1

The variable text represents the tweet text, while target indicates whether the tweet describes a disaster event (1 for yes, 0 for no).

Dataset and Dataloader

We split the training data randomly into training and validation sets according to a specified ratio:

```python
pd_validation_data = pd_data.sample(frac=validation_ratio)
pd_train_data = pd_data[~pd_data.index.isin(pd_validation_data.index)]
```

Once the data is loaded, we can create the Dataset class, which will return each tweet along with its target label:

```python
class MyDataset(Dataset):

    def __init__(self, mode='train'):
        super(MyDataset, self).__init__()
        self.mode = mode
        # Select the corresponding data based on the mode
        if mode == 'train':
            self.dataset = pd_train_data
        elif mode == 'validation':
            self.dataset = pd_validation_data
        elif mode == 'test':
            # For test mode, return the tweet along with its id. Using the id as the target here simplifies the process of saving results later.
            self.dataset = pandas.read_csv(dataset_dir / 'test.csv')[['text', 'id']]
        else:
            raise Exception("Unknown mode {}".format(mode))

    def __getitem__(self, index):
        # Retrieve the item at the given index
        data = self.dataset.iloc[index]
        # Get the tweet text, applying basic cleaning
        source = data['text'].replace("#", "").replace("@", "")
        # Get the corresponding target
        if self.mode == 'test':
            # In test mode, use id as the target
            target = data['id']
        else:
            target = data['target']
        # Return the tweet and its target
        return source, target

    def __len__(self):
        return len(self.dataset)
```
```python
train_dataset = MyDataset('train')
validation_dataset = MyDataset('validation')
```

Let’s take a quick look at our data:

```python
train_dataset.__getitem__(0)
```
```
('Our Deeds are the Reason of this earthquake May ALLAH Forgive us all', 1)
```

After setting up our Dataset, we can proceed to build the Dataloader. Before that, however, we need to define a tokenizer:

```python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```

Let’s try using the tokenizer:

```python
tokenizer("I'm learning deep learning", return_tensors='pt')
```
```
{'input_ids': tensor([[ 101, 1045, 1005, 1049, 4083, 2784, 4083,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
```

It works correctly. Here, 101 represents the "start" token ([CLS]), and 102 indicates the "end" of the sentence ([SEP]).

Now, let’s build our Dataloader. We’ll define a collate_fn function to handle encoding, padding, and batching:

```python
def collate_fn(batch):
    """
    Converts a batch of text sentences to tensors and organizes them into a batch.
    :param batch: A batch of sentences, e.g., [('text', target), ('text', target), ...]
    :return: The processed result, e.g.:
             src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])}
             target: [1, 1, 0, ...]
    """
    text, target = zip(*batch)
    text, target = list(text), list(target)

    # `src` will be fed into BERT, so no special processing is needed; we can directly use the tokenizer output.
    # padding='max_length' pads to a fixed length
    # truncation=True truncates if the length exceeds the limit
    src = tokenizer(text, padding='max_length', max_length=text_max_length, return_tensors='pt', truncation=True)

    return src, torch.LongTensor(target)
```
```python
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
```

Let's take a look at the data in train_loader:

```python
inputs, targets = next(iter(train_loader))
print("inputs:", inputs)
print("targets:", targets)
```
```
inputs: {'input_ids': tensor([[  101,  4911,  1024,  ...,     0,     0,     0],
        [  101, 19387, 11113,  ...,     0,     0,     0],
        [  101,  2317,  2111,  ...,     0,     0,     0],
        ...,
        [  101, 25595, 10288,  ...,     0,     0,     0],
        [  101,  1037, 14700,  ...,     0,     0,     0],
        [  101, 12361,  2042,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}
targets: tensor([1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0])
```

Building the Model

```python
class MyModel(nn.Module):

    def __init__(self):
        super(MyModel, self).__init__()

        # Load the BERT model
        self.bert = AutoModel.from_pretrained("bert-base-uncased")

        # Define the final prediction layer
        self.predictor = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, src):
        """
        :param src: Tokenized tweet data
        """

        # Feed src directly into BERT. Since BERT and the tokenizer work as a pair, we can use this approach.
        # Retrieve the encoder output and use the [CLS] token's output as input to the final linear layer.
        outputs = self.bert(**src).last_hidden_state[:, 0, :]

        # Use the linear layer to make the final prediction
        return self.predictor(outputs)
```
```python
model = MyModel()
model = model.to(device)
```
```python
model(inputs.to(device))
```
```
tensor([[0.5121],
        [0.5032],
        [0.5032],
        [0.4913],
        ...
        [0.5333],
        [0.4967],
        [0.4951]], device='cuda:0', grad_fn=<SigmoidBackward0>)
```

Training the Model

Now, let's start training the model by defining the loss function and the optimizer. Since this is a binary classification task, Binary Cross Entropy (BCE) is appropriate:

```python
criteria = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-5)
```

I found this learning rate through testing. Initially, I tried 3e-4, but it wouldn’t converge. This really highlights the importance of choosing the right learning rate.

```python
# Since `inputs` is a dictionary type, define a helper function to transfer tensors to the device
def to_device(dict_tensors):
    result_tensors = {}
    for key, value in dict_tensors.items():
        result_tensors[key] = value.to(device)
    return result_tensors
```

Define a validation function to calculate the accuracy and loss on the validation set.

```python
def validate():
    model.eval()
    total_loss = 0.
    total_correct = 0
    for inputs, targets in validation_loader:
        inputs, targets = to_device(inputs), targets.to(device)
        outputs = model(inputs)
        loss = criteria(outputs.view(-1), targets.float())
        total_loss += float(loss)

        correct_num = (((outputs >= 0.5).float() * 1).flatten() == targets).sum()
        total_correct += correct_num

    return total_correct / len(validation_dataset), total_loss / len(validation_dataset)
```

Start training:

```python
# Set the model to training mode first
model.train()

# Clear CUDA cache if available
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Define variables to help print the loss
total_loss = 0.
# Track the number of steps
step = 0

# Keep record of the best accuracy on the validation set
best_accuracy = 0

# Begin training
for epoch in range(epochs):
    model.train()
    for i, (inputs, targets) in enumerate(train_loader):
        # Get training data from the batch
        inputs, targets = to_device(inputs), targets.to(device)
        # Pass inputs through the model (forward pass)
        outputs = model(inputs)
        # Calculate the loss
        loss = criteria(outputs.view(-1), targets.float())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += float(loss)
        step += 1

        # Log progress at regular intervals
        if step % log_per_step == 0:
            print("Epoch {}/{}, Step: {}/{}, total loss: {:.4f}".format(epoch+1, epochs, i, len(train_loader), total_loss))
            total_loss = 0

        # Free up memory for inputs and targets
        del inputs, targets

    # After each epoch, validate on the validation set
    accuracy, validation_loss = validate()
    print("Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}".format(epoch+1, accuracy, validation_loss))
    torch.save(model, model_dir / f"model_{epoch}.pt")

    # Save the best-performing model
    if accuracy > best_accuracy:
        torch.save(model, model_dir / f"model_best.pt")
        best_accuracy = accuracy
```
```
Epoch 1/100, Step: 49/429, total loss:28.4544
Epoch 1/100, Step: 99/429, total loss:22.8545
Epoch 1/100, Step: 149/429, total loss:21.7493
...
Epoch 10/100, Step: 288/429, total loss:3.1754
Epoch 10/100, Step: 338/429, total loss:3.3069
Epoch 10/100, Step: 388/429, total loss:1.8836
Epoch 10, accuracy: 0.8292, validation loss: 0.0561
```

Model Usage

Load the best model, then assemble the CSV file according to Kaggle’s requirements.

```python
model = torch.load(model_dir / f"model_best.pt")
model = model.eval()
```

Construct a dataloader for the test dataset. Note that the test set does not include targets.

```python
test_dataset = MyDataset('test')
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
```

Pass the test data through the model to get the results and assemble them in the required Kaggle format:

```python
results = []
for inputs, ids in tqdm(test_loader):
    outputs = model(inputs.to(device))
    outputs = (outputs >= 0.5).int().flatten().tolist()
    ids = ids.tolist()
    results = results + [(id, result) for result, id in zip(outputs, ids)]
```
```python
with open(dataset_dir / 'results.csv', 'w', encoding='utf-8') as f:
    f.write('id,target\n')
    for id, result in results:
        f.write(f"{id},{result}\n")
print("Finished!")
```
```
Finished!
```

Now, upload your results to Kaggle and see your score. After 10 epochs, I scored 0.83573, which isn’t too bad.

Next Post Previous Post
No Comment
Add Comment
comment url