We Can Hear Languages Different as a Baby

BERT Text Classification in a different language

Build a non-English (German) BERT multi-class text classification model with HuggingFace and Simple Transformers.

Philipp Schmid

Photo by Ivan Shilov on Unsplash

Originally published at https://www.philschmid.de on May 22, 2020.

Introduction

Currently, we have 7.5 billion people living on the world in around 200 nations. Only 1.2 billion people of them are native English speakers. This leads to a lot of unstructured non-English textual data.

Most of the tutorials and blog posts demonstrate how to build text classification, sentiment analysis, question-answering, or text generation models with BERT based architectures in English. In order to overcome this missing, I am going to show you how to build a non-English multi-class text classification model.

World of native English speaker

Opening my a rticle let me guess it's safe to assume that you have heard of BERT. If you haven't, or if you'd like a refresh, I recommend reading this paper.

In deep learning, there are currently two options for how to build language models. You can build either monolingual models or multilingual models.

"multilingual, or not multilingual, that is the question" — as Shakespeare would have said

Multilingual models describe machine learning models that can understand different languages. An example of a multilingual model is mBERT from Google research. This model supports and understands 104 languages. Monolingual models, as the name suggest can understand one language.

Multilingual models are already achieving good results on certain tasks. But these models are bigger, need more data, and also more time to be trained. These properties lead to higher costs due to the larger amount of data and time resources needed.

Due to this fact, I am going to show you how to train a monolingual non-English BERT-based multi-class text classification model. Wow, that was a long sentence!

BERT — GOT Meme

Tutorial

We are going to use Simple Transformers — an NLP library based on the Transformers library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.

As the dataset, we are going to use the Germeval 2019, which consists of German tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes: PROFANITY, INSULT, ABUSE, and OTHERS. The highest score achieved on this dataset is 0.7361.

We are going to:

  • install Simple Transformers library
  • select a pre-trained monolingual model
  • load the dataset
  • train/fine-tune our model
  • evaluate the results of training
  • save the trained model
  • load the model and predict a real example

I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.

Install Simple Transformers library

First, we install simpletransformers with pip. If you are not using Google colab you can check out the installation guide here.

Select a pre-trained monolingual model

Next, we select the pre-trained model. As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the Transformers library and all community-uploaded models. For a list that includes all community-uploaded models, I refer to https://huggingface.co/models.

We are going to use the distilbert-base-german-cased model, a smaller, faster, cheaper version of BERT. It uses 40% less parameters than bert-base-uncased and runs 60% faster while still preserving over 95% of Bert's performance.

Load the dataset

The dataset is stored in two text files we can retrieve from the competition page. One option to download them is using 2 simple wget CLI commands.

Afterward, we use some pandas magic to create a dataframe.

Since we don't have a test dataset, we split our dataset — train_df and test_df. We use 90% of the data for training ( train_df) and 10% for testing ( test_df).

Load pre-trained model

The next step is to load the pre-trained model. We do this by creating a ClassificationModel instance called model. This instance takes the parameters of:

  • the architecture (in our case "bert")
  • the pre-trained model ("distilbert-base-german-cased")
  • the number of class labels (4)
  • and our hyperparameter for training (train_args).

You can configure the hyperparameter within a wide range of possibilities. For a detailed description of each attribute, please refer to the documentation.

Train/fine-tune our model

To train our model we only need to run model.train_model() and specify which dataset to train on.

Evaluate the results of training

After we trained our model successfully we can evaluate it. Therefore we create a simple helper function f1_multiclass(), which is used to calculate the f1_score. The f1_score is a measure for model accuracy. More on that here.

We achieved an f1_score of 0.6895. Initially, this seems rather low, but keep in mind: the highest submission at Germeval 2019 was 0.7361. We would have achieved a top 20 rank without tuning the hyperparameter. This is pretty impressive!

In a future post, I am going to show you how to achieve a higher f1_score by tuning the hyperparameters.

Save the trained model

Simple Transformers saves the model automatically every 2000 steps and at the end of the training process. The default directory is outputs/. But the output_dir is a hyperparameter and can be overwritten. I created a helper function pack_model(), which we use to pack all required model files into a tar.gz file for deployment.

Load the model and predict a real example

As a final step, we load and predict a real example. Since we packed our files a step earlier with pack_model(), we have to unpack them first. Therefore I wrote another helper function unpack_model() to unpack our model files.

To load a saved model, we only need to provide the path to our saved files and initialize it the same way as we did it in the training step. Note: you will need to specify the correct (usually the same used in training) args when loading the model.

After initializing it we can use the model.predict() function to classify an output with a given input. In this example, we take two tweets from the Germeval 2018 dataset.

Our model predicted the correct class OTHER and INSULT.

Resume

Concluding, we can say we achieved our goal to create a non-English BERT-based text classification model.

Our example referred to the German language but can easily be transferred into another language. HuggingFace offers a lot of pre-trained models for languages like French, Spanish, Italian, Russian, Chinese, …

Thanks for reading. You can find the colab notebook with the complete code here.

If you have any questions, feel free to contact me.

We Can Hear Languages Different as a Baby

Source: https://towardsdatascience.com/bert-text-classification-in-a-different-language-6af54930f9cb

0 Response to "We Can Hear Languages Different as a Baby"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel