Nature Language Process is the biggest topic of discussion when it comes to AI these days. With modern NLP tools we can generate, translate, classify, and analyze different texts to aid us. In this blog post we will look at some of the fundamental ideas that are foundations of the new and shiny “Large Language Models” that have been breaking the internet and to some extent the AI community itself with several discussions on safety and regulations.

A language model is a model whose purpose is to guess the next word in a sequence. It is a form of self-supervised learning which is a learning task where the data is not labeled and instead labels are automatically extracted from the data. Of course, NLP is not the only domain where such a technique is used but for now we will focus on self-supervised learning for NLP.

The chapter is dedicated to classifying IMDB movie reviews. However, the steps to do that well involves fine-tuning a language model that was trained to guess the next word in Wikipedia articles. The fine-tuning is done on an IMDB review corpus to ensure that our model knows some of the more technical lingo that goes into writing movies that may have been missing in the Wiki dataset.

So, the steps are as follows: * Train a language model to guess the next word in Wikipedia articles (This step is already done and we are from the beginning given this language model). * Fine-tune the language model to guess the next word in another dataset - in this case that is the IMDB dataset. * Use the pretrained model and use the language it has learned to carry out a classification task.

This was coined the named Universal Language Model Fine-Tuning and was a state-of-the-art when it first came out. Of course, now there are more fancy toys we can play with.

We now have a plan which is great but don’t Deep Learning models work on numbers like floats? How would we ever feed it actual language? Well, that’s the rest of the blog so let’s get started.

Text Preprocessing

Training a model on texts is composed of many non-trivial steps of getting the text into data that computers could process i.e. numbers. These preprocessing steps are very important to how the model performs on a specified task and therefore is an area of important discussion.

Here are the steps we will need: * Create a vocabulary. The vocabulary will consist of words or tokens in the corpus given certain constraints. * Replace each word in the corpus with their indices in the vocabulary. * Create an embedding matrix for this containing a row for each token in the vocab. * Use this embedding matrix as the first layer of a neural network.

To train the language model where our goal is to predict the next word in a sequence, first we concatenate all of the documents in our dataset into one big long string and split it into words (or tokens). Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second to last, and our dependent variable will be the sequence of words starting with the second word and ending with the last word.

The vocab will consist of the vocab from the pretrained language model AND also the new corpus it was fine-tuned on. As a result, the embedding matrix will also have rows that are similar to the embedding matrix from the pretrained model but also rows (corresponding to the new tokens) that are initialized with random numbers.

Let’s look at the two most important steps in training a language model next - tokenization and numericalization.

Tokenization

This is the process of breaking a text down into tokens which can be words, subwords, or even characters. Tokenization is a very challenging process as we have to worry about how we represent certain words as well as punctuations and the many other complexities of natural languages. Not all languages behave the same e.g. Japanese and Chinese don’t have a sense of a “word” in their language. In certain languages you can combine different words in different ways to create a new “word”. Thus, we need different methods of tokenization. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time. Let’s look at a few of the most common ones provided by fastai.

Word Tokenizer

Using a library comes with a lot of abstractions that we do not have to think about too much. That’s true especially for a layered API library like fastai. While, we do not really have to understand what is happening when we call certain functions, it is still a good idea to have some idea in case we need to debug. Fastai relies on external libraries like spaCy to handle a lot of text preprocessing. Let’s see how we would tokenize a long text into words (and punctuations) using fastai.

First we need to download some data. In this case we are using the IMDB dataset that’s provided by the library.

from fastai.text.all import *

path = untar_data(URLs.IMDB)

Once the data is available, we need to somehow read it. The way you read these text files is very similar to how we read all the images. First we need to get a hold of all the names. Then we can open one of the files to inspect.

files = get_text_files(path, folders = ["train", "test", "unsup"])

# display the first 75 characters of the first text file 
txt = files[0].open().read(); txt[:75]

Now the next step is to use the following tools that fastai provides. * WordTokenizer() - Fastai’s default word tokenizer * coll_repr() - a function that when given a collection and an integer, n, shows n of the collection items.

wordtokenizer = WordTokenizer()
tokens = first(wordtokenizer([txt])) # first() gets the first item
print(coll_repr(tokens, 30))

The WordTokenizer() instance will handle a lot of the heavy lifting for us. For example, it will deal with breaking contractions like “it” and “it’s” into separate tokens as well as understand when a period “.” represents the end of a sentence and when it is part of a token like “U.S.” In addition to the default functionalities of the external library WorkTokenizer() calls there are some extra things fastai does. It adds some special tokens that signal certain things about the sentence. E.g. “xxbos” is a token that represents the beginning of stream token. This start token tells the model to “forget” what was said previously and focus on upcoming words.

tkn = Tokenizer(wordtokenizer) 
print(coll_repr(tkn(txt), 30))

Some of the special tokens are:

xxbos: Specifies that it is the beginning of the sentence
xxmaj: Specifies that the next word will begin with an uppercase
xxunk: Specifies that the next token is not in vocab

Certain rules are in place to deal with the present of these tokens. The default ones can be viewed using the following:

defaults.text_proc_rules

These are some of the default rules:

replace_rep: Replaces characters repeated three or more times with a special token for repetition *(xxrep), the number of times it’s repeated, then the character
replace_maj: Replaces the uppercase with the (xxmaj) token followed by the word in lowercase.
lowercase: Replaces all words with its lowercase.
fix_html: Replaces special HTML characters with a readable version

That’s all for word tokenizers. They are fairly simple to use but not as universally effective as the next one we will see - subword tokenizer.

Subword Tokenizer

Word tokenizers work under the assumption that whitespaces are a meaninful delimeter to separate words. However, this is not true for all languages e.g. Japanese. Therefore, we need a tokenizer to handle those cases. This is a two step process:

Create a vocab from the most commonly occuring groups of letters - this is done using the setup function.
Tokenize using the vocab created.

Let’s look at the these two in action.

# grab the corpus to train the tokenizer on to create the vocab
files = get_text_files(path, folders = ["train", "test", "unsup"])
txts = L(o.open.read() for o in files[:200])

def subword(vocab_sz):
  subword_tokenizer = SubwordTokenizer(vocab_sz)
  subword_tokenizer.setup(txts)
  return ' '.join(first(sp([txt]))[:40])

The vocab_sz is a hyperparameter that will define the granularity of the tokenization process. The smaller the vocab size, the fewer characters are represented by each token and as a result, the more tokens we will need to represent a sentence and vice versa. The two extremes are therefore word-level tokenization and character-level tokenization. There is also a trade-off in choosing the extremes: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.

With our sentences broken into tokens, we now need to turn them into numbers using some strategy. That’s what we will look at next.

Numericalization

In order to pass the tokens we just created into a model we need to turn them into numbers. That’s what the instances of Numericalize() class do. Let’s try to use it on a subset of the texts showing all the steps we have done so far as well.

from fastai.text.all import * 
# get files
path = untar_data(URLs.IMDB)
files = get_text_files(path, folders=["train", "test", "unsup"])

# select a subset 
txts = L(o.open().read() for o in files[:200])

# tokenize
word_toknzr = WordTokenizer()
tokenizer = Tokenizer(word_toknzr)
tokens = txts.map(tokenizer)

# numericalize
num = Numericalize()
num.setup(tokens)

# display the vocab of the Numericalize object
coll_repr(num.vocab,20)

The vocab will first contain all the special tokens (xxbos, xxmaj, etc) and then each token once in order of frequency. The defaults to Numericalize are min_freq=3 and max_vocab=60000. max_vocab=60000 results in fastai replacing all words other than the most common 60,000 with a special unknown word token, xxunk so we don’t end up creating an embedding matrix that is too large and takes too long to train. Another advantage of replacing uncommon words with xxunk is that we don’t take up too much memory to represent words that almost never show up. The threshold on which uncommon words we want to replace with the special token is also set using the min_freq parameter.

To look at the indices of the each token in a sentence, we can just use the Numericalize instance we just setup.

nums = num(tokens)
nums

Our data preprocessing pipeline is almost complete. All we need now is to create batches of data and then we can train!

Creating Batches in NLP

As usual our goal is to create a huge tensor instead of many smaller tensors in order to harness the power of modern GPUs. Just like in CV, where images might be different sizes, different sentences are also of different lengths. This process will not be as simple as just resizing our sentences. We have to be careful so we do not lose information about the sentence but also preserve the order of the sentences.

Given that we have N tokens and we want a batches of size bs we need to break the tokens into lists of length N/bs. So a more concrete example is if we have 90 tokens and we want a batch-size of 6, we will want to break the 90 tokens into lists of length 90/6 = 15.

This is a small example but most of our texts might be huge; too big to load into a GPU in one go. Therefore, we have to come up with a way of dividing them into sub-sequences of fixed length. So, let’s now further divide the length 15 sequences into subsequences of length 5.

Let’s summarize what I just said using the IMDB dataset.

Step 1: Concatenate the individual reviews into one single stream. At the beginning of each epoch randomize the order of the different reviews before concatenating them.
Step 2: Divide the long stream into batches while preserving the order of the streams. For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. In order to ensure the model knows when a review ended and a new one started an xxbos token is added at the start of each text during preprocessing.
Step 3: Divide the minibatch streams further into mini-streams of fixed lengths and feed the mini-streams in order to the model.

All of these are done under the hood by fastai library when we use a LMDataLoader which takes in as a parameter the outputs of the numericalized tokens.

nums = num(tokens)
dl = LMDataLoader(nums)

We can inspect what the dl object holds using

x,y = first(dl)
print(' '.join(num.vocab[o] for o in x[0][:20]))
print(' '.join(num.vocab[o] for o in y[0][:20]))

The two outputs should be the same except y should be offset by one token.

This is all the preprocessing that is involved. Now that we have our preprocessing steps settled, we need to train a model!

Training a Text Classifier with fastai

The classifier training will be a two step process:

Fine-tune the pretrained model (on wikipedia texts) using the IMDB dataset
Use the fine-tuned language model to then create a classifier for the IMDB dataset.

Using `TextBlock` and the `DataBlock` API

All the preprocessing tasks we have talked about so far are handled by the TextBlock when passed to DataBlock. As a result all the arguments to Tokenizer and Numericalize can be passed to the TextBlock. The arguments to create a language model are slightly different from creating a classifier model. Let’s go through with the language model for the first step of fine-tuning.

# the DataBlock class only passes along one argument which is the path so all the others have to be
# passed before
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
  blocks = TextBlock.from_folder(path, is_lm=True),
  get_items=get_imdb,
  splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

You’ll notice a few differences between this and the dataloaders we created before with ImageBlock. The first obvious one is that we don’t have a block for output. That’s because there is no “output”. The second thing is we aren’t passing the TextBlock itself as an argument but instead one of the class methods of the class .from_folder which tells us where to get the documents from.

Another difference is, you also pass in the sequence lengths, seq_len to the dataloaders class. We did not do that for the image dataloaders since there is no sense of “sequence”.

This is a slow process as all the documents have to be loaded, read, and then a vocab can be built. As a result fastai parallelizes this process and also saves the preprocessed info in a temporary folder.

As usual we can call show_batch on a dataloaders object.

dls_lm.show_batch(max_n=2)

In this case we should get 2 examples of texts with their next word predictions.

Fine-tuning the language model

We have our dataloaders in place. All we have to do now is fine-tune. Let’s create a learner first of course.

learn = language_model_learner(
  dls_lm, AWD_LSTM, drop_mult = 0.3,
  metrics=[accuracy, Perplexity()]
  ).to_fp16()

We have not talked about architectures for sequence modeling yet. That’s a later topic. All we need to know is there are special architectures for language models and we are using the AWD_LSTM model in this case. drop_mult specifies something called a Dropout probability. The default loss function for the language_model_learner is cross-entropy loss as this is essentially a classification problem where we try to find what the most likely next token will be. Finally, of the metrics here is also new - Perplexity - which is just the exponential of the loss.

Training a language model takes a while and especially one with such a big dataset. Therefore, as we fine-tune the model we will save it after each epoch. To do that, we opt for the learn.fit_one_cycle() method instead of .fine_tune(). This automatically freezes the pretrained embeddings and only trains on the embeddings that were randomly initialized.

learn.fit_one_cycle(1, 2e-2)

Saving and Loading models

To save the state of a model we use:

learn.save('1epoch')

and to load we just use:

learn = learn.load('1epoch')

Once the first run of the fit_one_cycle is completed, we will unfreeze the entire model and run it again a few more times.

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

Then after we are done, we will want to save everything in the model EXCEPT the last layer which is a task-specific layer - in this case the task is to choose the most likely word. In CV, this last layer is called the head and the rest is the body. In NLP, however, the rest of the model is called the encoder. Therefore, to save the encoder we use:

learn.save_encoder('finetuned')

Now we have a language model that is fine-tune on the IMDB dataset and contains IMDB dataset specific vocab alongside vocab from Wikipedia.

Aside: Generating synthetic reviews

Our model right now guesses the next word. This means we can actually use it to make some synthetic reviews.

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]
print("\n".join(preds))

N_WORDS is how many words in the future we want to generate. We generate N_SENTENCES number of reviews. temperature is a parameter used in natural language processing models to increase or decrease the “confidence” a model has in its most likely response. It’s somewhat like a threshold.

While this was not our original goal, it is definitely fascinating to be able to do this and do this well enough as this model does!

Let’s get back to our original task of classification.

Creating a classifier using the `DataBlock` API

Since our goal is classification, we need to build a dataloaders for that.

dls_classify = DataBlock(
  blocks = (TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
  get_items = partial(get_text_files, folders=['train', 'test']),
  get_y = parent_label,
  splitter = GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)


dls_classify.show_batch(max_n = 3)

Let’s talk about the new things in our DataBlock that we haven’t seen yet.

TextBlock: We no longer pass is_lm=True (default is False) because now we want to tell the model that we have the usual labeled data. We also pass in the vocabulary that we want the model to use which in this case is the vocab from the fine-tuned IMDB language model.
GrandparentSplitter: So far we have used a RandomSplitter which splits the data itself randomly. Here, we are using a splitter that splits based on the directory. The grandparent of each file is the directory named “train” or “test”. Here we specify as you see that the validation set has the directory name of “test”.

Dynamic padding of sentences

In PyTorch DataLoaders, all the items in a batch are collated into a single tensor with a fixed shape. We resized our images for this reason. We need to find a way to do this for our texts as well. Since, resizing or cropping is really not an option for texts since we will lose valuable information about our sentence, all we are left with is padding. But in doing so we need to answer the following questions:

How do we pad?
What do we pad with?

The answer to the questions is non-trivial and needs to be done carefully to ensure that we can efficiently and accurately train LMs. In order to be efficient, what we do is we dynamically pad the sentences in each batch. That is to say, we do the padding after we break it into batches and the max length of the sentences is defined by the max length of the sentence in each batch. To further optimize what we can do is sort the documents/texts by length and then batch them so we don’t have to pad any sentences by too much relative to the others in a batch.

In order to ensure we are not padding with some token that might show up in some other context, we use yet another special token. This allows the model to know that seeing that token will mean we are only padding it and it contains no real info.

Gradual Fine-Tuning in NLP

Let’s create a learner.

learn = text_classifier_learner(
  dls_classifier, AWD_LSTM, drop_mult = 0.5, 
  metrics = [accuracy]
).to_fp16()

We then load the encoder weights into the learner.

learn = learn.load_encoder('finetuned')

We use load_encoder instead of load because we have only pretrained weights available for the encoder; load by default raises an exception if an incomplete model is loaded.

In CV appplications, fine-tuning can be done at once. However, it’s been found that in NLP gradual unfreezing and discriminative learning-rate performs better. Let’s try that.

learn.fit_one_cycle(1, 2e-2)

This trains only the last layer as we know that by default the rest are frozen.

Then we unfreeze the last two layers.

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

Then we unfreeze the last three layers.

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

Finall we unfreeze everything and train for a few more epochs.

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

And there we have it. Our classifier that can accurately classify reviews with 94.3% accuracy! Some more clever tricks can push the accuracy to 95.1%.

Putting all the code together for reference

# import
from fastai.text.all import *

# download data
path = untar_data(URLs.IMDB)

### Language model 
get_imdb = partial(get_text_files, folder=["train", "test", "unsup"])
dls_lm = DataBlock(
  blocks = TextBlock.from_folder(path, is_lm=True),
  get_items = get_imdb,
  splitter = RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

learn = language_model_learner(
  dls_lm, AWD_LSTM, drop_mult = 0.3,
  metrics = [accuracy, Perplexity()]
).to_fp16()


learn.fit_one_cycle(1, 2e-2)
learn.unfreeze() # unfreeze all the layers
learn.fit_one_cycle(10, 2e-3)

learn.save_encoder('finetuned')

### Classifier model
dls_classifier = DataBlock(
  blocks = (TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
  get_items = partial(get_text_files, folder=["train", "test"]),
  splitter = GrandparentSplitter(valid_name="test")
).dataloaders(path, path=path, bs=128, seq_len=72)

learn = language_model_learner(
  dls_classifier, AWD_LSTM, drop_mult = 0.5,
  metrics = [accuracy]
).to_fp16()

learn = learn.load_encoder('finetuned')

learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

learn.save('imdb_classifier')

Training a Text Classifier with Huggingface Transformers

So far we have looked at how we can use the fastai library for NLP tasks. However, there is one more library which is arguably THE library for NLP - Huggingface. In lesson-4, Jeremy shows us a very simple example of getting started with Huggingface Transformers. This section is entirely dedicated to that. We will follow a similar workflow to what we have so far to showcase that this is a universal workflow for NLP tasks.

Getting the data

We use the Kaggle dataset US Patent Phrase-To-Phrase Matching. Jeremy gives us a template to either use kaggle datasets on our local machine or on their kernels. Here’s the template:

import os
# checks if the environment variable exists 
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

!pip install kaggle

# put your credentials here
creds = '{"username":"xxx","key":"xxx"}'

from pathlib import Path

# make a directory that saves your kaggle.json you need to download from the website
cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

# add a path to your dataset
path = Path('us-patent-phrase-to-phrase-matching')

# download and extract the dataset
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

We now have access to the data.

Inspecting the data

The data we have are smaller documents where each row in a CSV is one text. For larger documents, you have one document itself as one text and the different document categories are stored in different directories.

The path contains three files - ‘train.csv’, ‘test.csv’, and ‘sample_submission.csv’. Let’s open up the training csv and look at the data.

import pandas as pd

df = pd.read_csv(path/'train.csv')
df.describe(include='object')

We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with “component composite coating” for instance appearing 152 times.

Whether two phrases are related depends on three things: the two phrases themselves of course and also the context. Therefore, before we build our tokens and vocab let’s create a text that concatenates all three.

df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

df.input.head()

On to text preprocessing for NLP.

Tokenization, Numericalization and Batching

Transformers uses a Dataset object for storing a dataset. This is how we create one from a pandas dataframe.

!pip install datasets

from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds

> Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

The Dataset object contains a list of the features and the number of rows.

To tokenize and numericalize the texts we first have to decide on which model to use. This is what we did previously as well. The reason why we need to first and foremost choose a model is that we cannot tokenize texts that use some vocabulary that the model was not pretrained on; that would make no sense. So every step of the preprocessing is model dependent when we use a pretrained model (which should more or less be always if possible).

model_nm = 'microsoft/deberta-v3-small'

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_nm)
tokenizer.tokenize("G'day folks, I'm Jeremy from fast.ai!")

> ['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

The start of a new word is represented by the _.

Let’s create a function that we can parallely map to the dataset, ds.

def tokenize_func(x):
  return tokenizer(x['input'])

# creates tokens AND batches them
tokenized_ds = ds.map(tokenize_func, batched=True)

This adds a new item to our dataset called input_ids. These input_ids are the indices in the vocab corresponding to the different tokens. We can take a look at the first row.

row = tok_ds[0]
row['input'], row['input_ids']

You can also look at the index of a word.

tokenizer.vocab['▁of']
> 265

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it’s currently score. Therefore, we need to rename it:

tokenized_ds = tokenized_ds.rename_columns({'score':'label'})

That is all about the preprocessing. Now let’s work on creating a validation set as well as applying the same preprocessing rules to that data.

Splitting the data

dds = tokenized_ds.train_test_split(0.25, seed=42)
dds

As usual we want a validation set to know how well we are ACTUALLY performing as well as if we are over or under fittting. Once we have the data ready we must preprocess the data the same way as our training data. Using .train_test_split() will create a DatasetDict object where the keys are the different types of datasets like train and validation (or as they call it, test).

Let’s also have the test set ready.

eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor

eval_ds = Dataset.from_pandas(eval_df)
eval_ds = eval_ds.map(tokenizer_func, batched=True)

That’s it. It was that simple.

Metrics and Training

Training using Huggingface requires two things: (1) Pre-defining the arguments of the training (2) Setting up a trainer (much like a learner).

We will do those but first and foremost we need to define the metric we will be using to understand how well the model is doing. The competiton wants us to use correlation coefficient. So, let’s create a custom metric function.

import numpy as np

def corr_d(eval_preds):
  # np.corrcoeff returns a covariance matrix. We just need one of the offdiagonal values here.
  pearson_r = np.corrcoeff(*eval_preds)[0][1]

  # huggingface wants us to return a dictionary for the metric
  return {
    'pearson': pearson_r
  }

Now, we can set up training arguments and trainer to fine-tune our deberta-v3-small model.

from transformers import TrainingArguments, Trainer

bs = 128
lr = 8e-5
epochs = 4

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokenizer, compute_metrics=corr_d)
trainer.train();
preds = trainer.predict(eval_ds).predictions.astype(float)

# we want the predictions to be between 0 and 1 so let's clip the extremes.
preds = np.clip(preds, 0, 1)

preds

The TrainingArguments class takes in a lot of parameters but usually we only fiddle with the batch size, learning rate, and the number of epochs. Another thing you might’ve noticed is the num_labels which is set to 1. For classification tasks we usually have more than one label we predict. But in this case, since we are actually predicting a number instead we have a classification problem that acts as a regression problem (setting num_labels = 1 automatically tells huggingface that it’s a regression problem).

Finally to submit to kaggle we do:

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Conclusion

This blog was just an introduction to the world of NLP that is growing at an extremely fast pace. There’s a lot to learn and specifically to learn a lot about the Huggingface Transformer library. There’s a course on NLP that is offered by Huggingface that is probably a good next step. Another very important next step is understanding sequence modeling architectures like RNN, LSTM, GRU, and of course Transformer. All that is to come in the future!