Based off of the FastAI Tutorial on transformers: Documentation

What is a transformer?

A transformer is a deep learning model that is able to change the weights of input data on its own.

Importing a transformers pretrained model

Installing a transformers library:

!pip install -Uq transformers

Then import GPT2LMHeadModel and GPT2TokenizerFast:

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

We’re using the basic version of the GPT2 model:

pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

Prior to moving to the fine-tuning part, look at the tokenizer and the model. Tokenizers in HuggingFace usually do tokenization and numericalization in a single step:

ids = tokenizer.encode('This is an example of text, and')
ids
tokenizer.decode(ids)

Our model can be used to generate predictions. It has a generate method that expects a prompt batch, so we can feed it our ids and add a one batch dimension:

import torth

t = torch.LongTensor(ids)[None]
preds = model.generate(t)

Predictions, by default, are of length 20:

preds.shape,preds[0]

We can use the decode method with a numpy array:

tokenizer.decode(preds[0].numpy())

Bridging the gap with fastai

Now, let’s see how to use fastai to fine-tune the model on wikitext-2, utilizing all training utilities:

from fastai.text.all import *

Preparing the data

Download the dataset as two csv files:

path = untar_data(URLs.WIKITEXT_TINY)
path.ls()

Now look at what those csv files are:

df_train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_train.head()

We want to gather all of our text in a single numpy array:

all_texts = np.concatenate([df_train[0].values, df_valid[0].values])

Process the data to train a model, we’ll need to build a Transform that will be applied lazily. So in this case, we can do a pre-processing once and for all and use only the transform to decode.

With a fastai Transform, you can define:

  • an encodes method that is applied when you call the transform (a bit like the forward method in a nn.Module)
  • a decodes method that is applied when you call the decode method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)
  • a setups method that sets some inner state of the Transform (not needed here so we skip it)
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))