Sequence to Sequence modelling for English to French Translations NLP

Link to Colab Notebook (Hosted on Github)

All the code needed to preprocess data & train the model is present in the above link. Please run it on Colab to gain a better understanding

Model and Dataset Information

Translating from one language to another can be considered as a sequence to sequence task i.e., going from one sequence to another.

We have 2 approaches in training a translator model:

  1. Train from Scratch – If you have a big enough corpus to train the model
  2. Fine Tune Existing Model – Faster & uses less resources

Model : We will be using a Marian model which is already pre-trained to translate from English to French

Dataset: To train the model we will be using the KDE4 dataset

 

Colab notebook  has below steps for training the model:

  1. Preparing the Dataset – Splitting, Padding, Train / Test Split
  2. Tokenizer  – Instantiating a tokenizer for the particular model ; We also have to specify the target language for translation to the tokenizer
  3. Data Collation – Used for Padding data when we use dynamic batching ; Lables are padded using -100 , so that these padded values are ignored in loss computation. We use a special Data Collator – DataCollatorForSeq2Seq
  4. Evaluation Metrics – SacreBLEU metric is used to evaluate the french translations of the model ; Score can go from 0 to 100, higher the better
  5. Fine Tuning (Training) the model – We pass the following to the trainer to start training:
    • model
    • training arguments
    • train dataset
    • eval dataset
    • data collator
    • tokenizer
    • compute metrics
  6. Evaluating the Model – We use trainer.evaluate to check the metrics on how well the model is trained
  7. Saving the Model – Push the model hugging face hub after each epoch

Accelerator – Using this we design a custom training loop