PyTorch Beginner's Tutorial (7) - Textual Metaphor Binary Classification with BERT
Overview
This project explores an introductory NLP task on Kaggle (link), focusing on binary classification of text. In this challenge, participants classify Twitter posts as either relating to a disaster event or not. For instance, a tweet like “The White House is on fire, the flames are massive” is classified as a disaster-related tweet. Meanwhile, a post like “That cloud looks like it’s burning” contains related keywords but doesn’t refer to an actual disaster. The task, therefore, is to distinguish between these two cases.
The dataset can be downloaded from Kaggle (link) or via Baidu Netdisk (link).
You can upload your predictions to Kaggle to see how your model scores (link).
Environment Setup
This project uses the following library versions:
``` python==3.8.5 pandas==1.3.5 torch==1.11.0 transformers==4.21 ```
To get started, import all the required libraries:
```python import os import pandas import torch from torch import nn from torch.utils.data import Dataset, DataLoader # For loading the BERT model’s tokenizer from transformers import AutoTokenizer # For loading the BERT model from transformers import AutoModel from pathlib import Path from tqdm.notebook import tqdm ```
Global Configuration
```python batch_size = 16 # Maximum length of text sequences text_max_length = 128 # Total number of training epochs (example value) epochs = 100 # Portion of the training set to use as the validation set validation_ratio = 0.1 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Log loss after every specified number of steps log_per_step = 50 # Dataset directory path dataset_dir = Path("./dataset") os.makedirs(dataset_dir) if not os.path.exists(dataset_dir) else '' # Model storage path model_dir = Path("./drive/MyDrive/model/bert_checkpoints") # Create the model directory if it doesn't exist os.makedirs(model_dir) if not os.path.exists(model_dir) else '' print("Device:", device) ```
``` Device: cuda ```
Data Processing
Loading the Dataset
First, download the dataset and extract it into the dataset
directory. The directory should contain three files: train.csv
, test.csv
, and sample_submission.csv
.
We'll use pandas
to load the training data. For our purposes, we only need the text
and target
columns from the training data:
```python pd_data = pandas.read_csv(dataset_dir / 'train.csv')[['text', 'target']] ```
After loading, let's take a look at the data content:
```python pd_data ```
text | target | |
---|---|---|
0 | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | Forest fire near La Ronge Sask. Canada | 1 |
2 | All residents asked to 'shelter in place' are ... | 1 |
3 | 13,000 people receive #wildfires evacuation or... | 1 |
4 | Just got sent this photo from Ruby #Alaska as ... | 1 |
... | ... | ... |
7608 | Two giant cranes holding a bridge collapse int... | 1 |
7609 | @aria_ahrary @TheTawniest The out of control w... | 1 |
7610 | M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt... | 1 |
7611 | Police investigating after an e-bike collided ... | 1 |
7612 | The Latest: More Homes Razed by Northern Calif... | 1 |
The variable text
represents the tweet text, while target
indicates whether the tweet describes a disaster event (1
for yes, 0
for no).
Dataset and Dataloader
We split the training data randomly into training and validation sets according to a specified ratio:
```python pd_validation_data = pd_data.sample(frac=validation_ratio) pd_train_data = pd_data[~pd_data.index.isin(pd_validation_data.index)] ```
Once the data is loaded, we can create the Dataset class, which will return each tweet along with its target label:
```python class MyDataset(Dataset): def __init__(self, mode='train'): super(MyDataset, self).__init__() self.mode = mode # Select the corresponding data based on the mode if mode == 'train': self.dataset = pd_train_data elif mode == 'validation': self.dataset = pd_validation_data elif mode == 'test': # For test mode, return the tweet along with its id. Using the id as the target here simplifies the process of saving results later. self.dataset = pandas.read_csv(dataset_dir / 'test.csv')[['text', 'id']] else: raise Exception("Unknown mode {}".format(mode)) def __getitem__(self, index): # Retrieve the item at the given index data = self.dataset.iloc[index] # Get the tweet text, applying basic cleaning source = data['text'].replace("#", "").replace("@", "") # Get the corresponding target if self.mode == 'test': # In test mode, use id as the target target = data['id'] else: target = data['target'] # Return the tweet and its target return source, target def __len__(self): return len(self.dataset) ```
```python train_dataset = MyDataset('train') validation_dataset = MyDataset('validation') ```
Let’s take a quick look at our data:
```python train_dataset.__getitem__(0) ```
``` ('Our Deeds are the Reason of this earthquake May ALLAH Forgive us all', 1) ```
After setting up our Dataset
, we can proceed to build the Dataloader
. Before that, however, we need to define a tokenizer:
```python tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ```
Let’s try using the tokenizer:
```python tokenizer("I'm learning deep learning", return_tensors='pt') ```
``` {'input_ids': tensor([[ 101, 1045, 1005, 1049, 4083, 2784, 4083, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])} ```
It works correctly. Here, 101
represents the "start" token ([CLS]
), and 102
indicates the "end" of the sentence ([SEP]
).
Now, let’s build our Dataloader
. We’ll define a collate_fn
function to handle encoding, padding, and batching:
```python def collate_fn(batch): """ Converts a batch of text sentences to tensors and organizes them into a batch. :param batch: A batch of sentences, e.g., [('text', target), ('text', target), ...] :return: The processed result, e.g.: src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])} target: [1, 1, 0, ...] """ text, target = zip(*batch) text, target = list(text), list(target) # `src` will be fed into BERT, so no special processing is needed; we can directly use the tokenizer output. # padding='max_length' pads to a fixed length # truncation=True truncates if the length exceeds the limit src = tokenizer(text, padding='max_length', max_length=text_max_length, return_tensors='pt', truncation=True) return src, torch.LongTensor(target) ```
```python train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn) validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn) ```
Let's take a look at the data in train_loader
:
```python inputs, targets = next(iter(train_loader)) print("inputs:", inputs) print("targets:", targets) ```
``` inputs: {'input_ids': tensor([[ 101, 4911, 1024, ..., 0, 0, 0], [ 101, 19387, 11113, ..., 0, 0, 0], [ 101, 2317, 2111, ..., 0, 0, 0], ..., [ 101, 25595, 10288, ..., 0, 0, 0], [ 101, 1037, 14700, ..., 0, 0, 0], [ 101, 12361, 2042, ..., 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], ..., [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0]])} targets: tensor([1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0]) ```
Building the Model
```python class MyModel(nn.Module): def __init__(self): super(MyModel, self).__init__() # Load the BERT model self.bert = AutoModel.from_pretrained("bert-base-uncased") # Define the final prediction layer self.predictor = nn.Sequential( nn.Linear(768, 256), nn.ReLU(), nn.Linear(256, 1), nn.Sigmoid() ) def forward(self, src): """ :param src: Tokenized tweet data """ # Feed src directly into BERT. Since BERT and the tokenizer work as a pair, we can use this approach. # Retrieve the encoder output and use the [CLS] token's output as input to the final linear layer. outputs = self.bert(**src).last_hidden_state[:, 0, :] # Use the linear layer to make the final prediction return self.predictor(outputs) ```
```python model = MyModel() model = model.to(device) ```
```python model(inputs.to(device)) ```
``` tensor([[0.5121], [0.5032], [0.5032], [0.4913], ... [0.5333], [0.4967], [0.4951]], device='cuda:0', grad_fn=<SigmoidBackward0>) ```
Training the Model
Now, let's start training the model by defining the loss function and the optimizer. Since this is a binary classification task, Binary Cross Entropy (BCE) is appropriate:
```python criteria = nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), lr=3e-5) ```
I found this learning rate through testing. Initially, I tried 3e-4, but it wouldn’t converge. This really highlights the importance of choosing the right learning rate.
```python # Since `inputs` is a dictionary type, define a helper function to transfer tensors to the device def to_device(dict_tensors): result_tensors = {} for key, value in dict_tensors.items(): result_tensors[key] = value.to(device) return result_tensors ```
Define a validation function to calculate the accuracy and loss on the validation set.
```python def validate(): model.eval() total_loss = 0. total_correct = 0 for inputs, targets in validation_loader: inputs, targets = to_device(inputs), targets.to(device) outputs = model(inputs) loss = criteria(outputs.view(-1), targets.float()) total_loss += float(loss) correct_num = (((outputs >= 0.5).float() * 1).flatten() == targets).sum() total_correct += correct_num return total_correct / len(validation_dataset), total_loss / len(validation_dataset) ```
Start training:
```python # Set the model to training mode first model.train() # Clear CUDA cache if available if torch.cuda.is_available(): torch.cuda.empty_cache() # Define variables to help print the loss total_loss = 0. # Track the number of steps step = 0 # Keep record of the best accuracy on the validation set best_accuracy = 0 # Begin training for epoch in range(epochs): model.train() for i, (inputs, targets) in enumerate(train_loader): # Get training data from the batch inputs, targets = to_device(inputs), targets.to(device) # Pass inputs through the model (forward pass) outputs = model(inputs) # Calculate the loss loss = criteria(outputs.view(-1), targets.float()) loss.backward() optimizer.step() optimizer.zero_grad() total_loss += float(loss) step += 1 # Log progress at regular intervals if step % log_per_step == 0: print("Epoch {}/{}, Step: {}/{}, total loss: {:.4f}".format(epoch+1, epochs, i, len(train_loader), total_loss)) total_loss = 0 # Free up memory for inputs and targets del inputs, targets # After each epoch, validate on the validation set accuracy, validation_loss = validate() print("Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}".format(epoch+1, accuracy, validation_loss)) torch.save(model, model_dir / f"model_{epoch}.pt") # Save the best-performing model if accuracy > best_accuracy: torch.save(model, model_dir / f"model_best.pt") best_accuracy = accuracy ```
``` Epoch 1/100, Step: 49/429, total loss:28.4544 Epoch 1/100, Step: 99/429, total loss:22.8545 Epoch 1/100, Step: 149/429, total loss:21.7493 ... Epoch 10/100, Step: 288/429, total loss:3.1754 Epoch 10/100, Step: 338/429, total loss:3.3069 Epoch 10/100, Step: 388/429, total loss:1.8836 Epoch 10, accuracy: 0.8292, validation loss: 0.0561 ```
Model Usage
Load the best model, then assemble the CSV file according to Kaggle’s requirements.
```python model = torch.load(model_dir / f"model_best.pt") model = model.eval() ```
Construct a dataloader for the test dataset. Note that the test set does not include targets.
```python test_dataset = MyDataset('test') test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn) ```
Pass the test data through the model to get the results and assemble them in the required Kaggle format:
```python results = [] for inputs, ids in tqdm(test_loader): outputs = model(inputs.to(device)) outputs = (outputs >= 0.5).int().flatten().tolist() ids = ids.tolist() results = results + [(id, result) for result, id in zip(outputs, ids)] ```
```python with open(dataset_dir / 'results.csv', 'w', encoding='utf-8') as f: f.write('id,target\n') for id, result in results: f.write(f"{id},{result}\n") print("Finished!") ```
``` Finished! ```
Now, upload your results to Kaggle and see your score. After 10 epochs, I scored 0.83573, which isn’t too bad.