Based off of the FastAI Tutorial on transformers: Documentation
What is a transformer?
A transformer is a deep learning model that is able to change the weights of input data on its own.
Importing a transformers pretrained model
Installing a transformers library:
!pip install -Uq transformers
Then import GPT2LMHeadModel and GPT2TokenizerFast:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
We’re using the basic version of the GPT2 model:
pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)
Prior to moving to the fine-tuning part, look at the tokenizer
and the model
. Tokenizers in HuggingFace usually do tokenization and numericalization in a single step:
ids = tokenizer.encode('This is an example of text, and')
ids
tokenizer.decode(ids)
Our model can be used to generate predictions. It has a generate
method that expects a prompt batch, so we can feed it our ids and add a one batch dimension:
import torth
t = torch.LongTensor(ids)[None]
preds = model.generate(t)
Predictions, by default, are of length 20:
preds.shape,preds[0]
We can use the decode method with a numpy array:
tokenizer.decode(preds[0].numpy())
Bridging the gap with fastai
Now, let’s see how to use fastai to fine-tune the model on wikitext-2, utilizing all training utilities:
from fastai.text.all import *
Preparing the data
Download the dataset as two csv files:
path = untar_data(URLs.WIKITEXT_TINY)
path.ls()
Now look at what those csv files are:
df_train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_train.head()
We want to gather all of our text in a single numpy array:
all_texts = np.concatenate([df_train[0].values, df_valid[0].values])
Process the data to train a model, we’ll need to build a Transform
that will be applied lazily. So in this case, we can do a pre-processing once and for all and use only the transform to decode.
With a fastai Transform
, you can define:
- an encodes method that is applied when you call the transform (a bit like the forward method in a nn.Module)
- a decodes method that is applied when you call the decode method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)
- a setups method that sets some inner state of the Transform (not needed here so we skip it)
class TransformersTokenizer(Transform):
def __init__(self, tokenizer): self.tokenizer = tokenizer
def encodes(self, x):
toks = self.tokenizer.tokenize(x)
return tensor(self.tokenizer.convert_tokens_to_ids(toks))
def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))