fastText

8 minute read

FastText is an open-source, free, lightweight library that allows users to learn text/word representations and text classifiers.

The major benefits of using fastText are that it works on standard, generic hardware and the models can later be reduced in size to even fit on mobile devices.

Introduction

Most of the techniques represent each word of the vocabulary with a distinct vector i.e. without a shared parameter between words. In other words, they ignore the internal structure of words which might affect the learning of languages rich in morphology.

Thus, Enriching Word Vectors with Subword Information proposes an alternative approach where they learn representations for character n-grams and represent words as the sum of the n-gram vectors.

Experimental setup

Subword model

The proposed model sisg (Subword Information Skip Gram) is based on the continuous skipgram model introduced by Mikolov et al. (2013b).

Since the base skipgram model ignores the internal structure of words by using a distinct word representation for each word, sisg proposes a different scoring function s, in order to take into account the internal structure information.

where each word w is represented as a bag of character n-gram, Gw belongs to the set { 1, …, G } of the n-grams of size G. Each n-gram g is associated to a vector represention zgT.

Now, for example, the word where with n=3 will be represented by the character n-grams as <wh, whe, her, ere, re>.

and the special sequence <where>. Here < and > are added as special boundary symbols to distinguish prefixes and suffixes from other characters sequences.

Optimization

The optimization problem is solved by using stochastic gradient descent on the negative log-likelihood function. The optimization is being carried out in parallel where all threads share parameters and update vectors in an asynchronous manner.

Implementation details

Coming to implementation details, sisg has word vectors of dimension 300 where 5 negatives are sampled at random for each positive example. The context window of size c lies between 1 and 5. The step size is set to 0.05 since this is the default value set in the word2vec package and works well for sisg model too.

Also, while building the word dictionary, only those words were kept which appeared at least 5 times in the training set.

Dataset

The sisg model was trained on Wikipedia data which consists of nine languages. The Wikipedia data was pre-processed using a perl script. All the datasets are shuffled and used to train the model over 5 passes.

Now because of the simplicity, the sisg model trains fast and does not require heavy preprocessing or supervision.

Text classification with fastText

Text classification is a core problem to many applications and the fastText tool helps us easily solve this problem.

Installation

Download and unzip the most recent fastText release:

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip

Move to the fastText directory and install as follows:

$ cd fastText-0.9.2
 
# to install using the command-line tool
$ make
 
# to install via python bindings (we select this approach)
$ pip install .

We check the installation by importing fastText in a Python console:

>>> import fasttext
>>> help(fasttext.FastText)
Help on module fasttext.FastText in fasttext:
 
NAME
    fasttext.FastText
 
DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.
 
FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.
 
    read_args(arg_list, arg_dict, arg_names, default_values)
 
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
 
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
 
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might want to consult standard preprocessing scripts such
        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
 
        The input file must must contain at least one label per line. For an
        example consult the example datasets which are part of the fastText
        repository such as the dataset pulled by classification-example.sh.
 
    train_unsupervised(*kargs, **kwargs)
        Train an unsupervised model and return a model object.
 
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might want to consult standard preprocessing scripts such
        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
 
        The input field must not contain any labels or use the specified label prefix
        unless it is ok for those words to be ignored. For an example consult the
        dataset pulled by the example script word-vector-example.sh, which is
        part of the fastText repository.

Dataset

Let’s download example questions from the Stackexchange:

$ wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

Before training a text classifier we need to split the dataset into training and validation sets. We use wc command to check the number of lines in the dataset:

$ wc cooking.stackexchange.txt
15404  169582 1401900 cooking.stackexchange.txt

The dataset contains 15404 lines i.e. 15404 examples which we split into a training set of 12404 examples and a validation set of 3000 examples:

$ head -n 12404 cooking.stackexchange.txt > cooking.train
$ tail -n 3000 cooking.stackexchange.txt > cooking.valid

Train

To train the text classifier we import fastText and then use the train_supervised method by providing the training set as an input parameter.

>>> import fasttext
>>> model = fasttext.train_supervised(input="cooking.train")
Read 0M words
Number of words:  8974
Number of labels: 735
Progress: 100.0% words/sec/thread:   77120 lr:  0.000000 avg.loss:  9.961853 ETA:   0h 0m 0s

Save

We save the model with save_model so that we can load it later with load_model function:

>>> model.save_model("model_cooking.bin")

Test

We can test the model as follows:

>>> model.predict("Which baking dish is best to bake a banana bread ?")
(('__label__baking',), array([0.21342881]))

The predict method predicts baking tag for the given text input. Let’s look at another example:

>>> model.predict("Why not put knives in the dishwasher?")
(('__label__food-safety',), array([0.09138963]))

The label predicted in this case is food-safety which is not relevant for the given input. To get a better understanding, let’s test the model on the validation set:

>>> model.test("cooking.test")
(3000, 0.172, 0.07438373936860314)

The output contains the number of samples (3000), the precision at one (0.172), and the recall at one (0.074).

What is precision?

Precision is a measure of how precise or accurate the model is out of those predicted positive, how many of them are actually positive.

What is recall?

Recall is a measure that calculates how many of the Actual Positives of the model are True Positives.

We can also compute the precision and recall at k (here we use k=5) as follows:

>>> model.test("cooking.test", k=5)
(3000, 0.07286666666666666, 0.1575609052904714)

Optimization

We can optimize and improve the performance of the model by performing various steps given below

Preprocessing the dataset

The raw dataset usually contains elements like uppercase letters or punctuations which are not required for training and might/might not affect the model’s performance. Thus, we can normalize the dataset by using command line tools such as sed and tr:

 $ cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
 $ head -n 12404 cooking.preprocessed.txt > cooking.train
 $ tail -n 3000 cooking.preprocessed.txt > cooking.valid

Now, we retrain our model on the preprocessed dataset:

>>> model = fasttext.train_supervised(input="cooking.train")
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   46336 lr:  0.000000 avg.loss: 10.019582 ETA:   0h 0m 0s
 
>>> model.test("cooking.test")
(3000, 0.17466666666666666, 0.07553697563788381)

We can observe a slight improvement in the results which can be significant in other cases.

Tweaking number of epochs and learning rate

fastText sees each training example only 5 times (epochs=5) by default which may be pretty small depending on the size of the dataset. We can change this by using the epoch option while training.

Also, the learning rate of the model corresponds to how much the model changes after processing each example and we can tweak it by using the lr option.

>>> model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25)
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   60929 lr:  0.000000 avg.loss:  4.399605 ETA:   0h 0m 0s
 
>>> model.test("cooking.test")
(3000, 0.585, 0.25299120657344676)

We observe drastic changes in the output results and thus, it is evident that experimenting with hyperparameters such as learning rate and epochs can significantly improve a model’s performance.

n-grams

Currently, we use unigrams for training the model which generally does not help much. Instead, we can use bigrams which might cover prefixes or suffixes properly. In bigrams, we split a sentence or corpus of text into 2 tokens or words, unlike unigrams.

>>> model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   66974 lr:  0.000000 avg.loss:  3.152711 ETA:   0h 0m 0s
 
>>> model.test("cooking.test")
(3000, 0.6083333333333333, 0.2630820239296526)

The results have further improved with just a single easy step.

Hierarchical Softmax

Finally, we replace the regular softmax function with a hierarchical softmax function for loss since it helps training models on large datasets faster.

>>> model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:  899564 lr:  0.000000 avg.loss:  2.271247 ETA:   0h 0m 0s

Here, bucket is used to define the bucket size and dim is the dimension of the word vectors.

Autotune

We observed that finding the best hyperparameters is crucial for building efficient models but doing it manually is difficult. This is where fastText’s autotune feature comes to help.

FastText’s autotune feature allows you to automatically perform hyperparameter optimization for the model by providing a validation file with the autottuneValidationFile parameter.

>>> model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')
Progress: 100.0% Trials:   12 Best score:  0.335514 ETA:   0h 0m 0s
Training again with best arguments
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   66732 lr:  0.000000 avg.loss:  4.540132 ETA:   0h 0m 0s
 
>>> model.test("cooking.test")
(3000, 0.5583333333333333, 0.24145884388064004)

We get the best F1-score in the output after a default duration of 5 minutes which can be changed by setting the autotuneDuration parameter.

What is F1-score?

F1-score is a function of Precision and Recall as shown below. F1-score is an important measure that is required to seek a proper balance between Precision and Recall.

Reference

  1. Text classification

Updated: