Search
Results
Cross-Attention in Transformer Architecture
[https://vaclavkosar.com/ml/cross-attention-in-transformer-architecture] - - public:isaac
Merge two embedding sequences regardless of modality, e.g., image with text in Stable Diffusion U-Net with encoder-decoder attention.
Optimum
[https://huggingface.co/docs/optimum/index] - - public:mzimmerm
Optimum is an extension of Transformers that provides a set of performance optimization tools to train and run models on targeted hardware with maximum efficiency. It is also the repository of small, mini, tiny models.
google-research/bert: TensorFlow code and pre-trained models for BERT
BERT Transformers – How Do They Work? | Exxact Blog
[https://www.exxactcorp.com/blog/Deep-Learning/how-do-bert-transformers-work] - - public:mzimmerm
Excellent document about BERT transformers / models and their parameters: - L=number of layers. - H=size of the hidden layer = number of vectors for each word in the sentence. - A = Number of self-attention heads - Total parameters.
Solving Transformer by Hand: A Step-by-Step Math Example | by Fareed Khan | Level Up Coding
[https://levelup.gitconnected.com/understanding-transformers-from-start-to-end-a-step-by-step-math-example-16d4e64e6eb1] - - public:mzimmerm
Doing what a transformer is doing, by hand