Excellent document about BERT transformers / models and their parameters: - L=number of layers. - H=size of the hidden layer = number of vectors for each word in the sentence. - A = Number of self-attention heads - Total parameters.
Repository of all Bert models, including small. Start using this model for testing.
Jupyter notebook to test Bert
Best summary of Natural Language Processing and terms - model (a language model - e.g. BertModel, defines encoder and decoder and their properties), transformer (a specific neural network based on attention paper), encoder (series of transformers on input), decoders (series of transformers on output). Bert does NOT use decoder. TensorFlow and PyTorch are possible backends to Transformers (NN). Summary: BERT is a highly complex and advanced language model that helps people automate language understanding.
Use the Bert model to train on Yelp dataset