Bidirectional Language Modeling: A Systematic Literature Review

In transfer learning, two major activities, i.e., pretraining and fine-tuning, are carried out to perform downstream tasks. )e
advent of transformer architecture and bidirectional language models, e.g., bidirectional encoder representation from transformer
(BERT), enables the functionality of transfer learning. Besides, BERT bridges the limitations of unidirectional language models by
removing the dependency on the recurrent neural network (RNN). BERT also supports the attention mechanism to read input
from any side and understand sentence context better. It is analyzed that the performance of downstream tasks in transfer learning
depends upon the various factors such as dataset size, step size, and the number of selected parameters. In state-of-the-art, various
research studies produced efficient results by contributing to the pretraining phase. However, a comprehensive investigation and
analysis of these research studies is not available yet. Therefore, in this article, a systematic literature review (SLR) is presented
investigating thirty-one (31) influential research studies published during 2018–2020. Following contributions are made in this
paper: (1) thirty-one (31) models inspired by BERT are extracted. (2) Every model in this paper is compared with RoBERTa
(replicated BERTmodel) having large dataset and batch size but with a small step size. It is concluded that seven (7) out of thirty one
(31) models in this SLR outperforms RoBERTa in which three were trained on a larger dataset while the other four models are
trained on a smaller dataset. Besides, among these seven models, six models shared both feedforward network (FFN) and attention
across the layers. Rest of the twenty-four (24) models are also studied in this SLR with different parameter settings. Furthermore, it
has been concluded that a pretrained model with a large dataset, hidden layers, attention heads, and small step size with parameter
sharing produces better results. )is SLR will help researchers to pick a suitable model based on their requirements.