EKOAHAMEKOAHAM
Research
EKOAHAM Research · Vol. 7, Issue 2
Research Article

Transformer Architectures: Origins, Foundations, and the State of the Art

23 min readVol. 7, Issue 2

Executive Summary: The Transformer architecture, introduced by Vaswani et al. (2017), revolutionised sequence modeling by replacing recurrence with self-attention, enabling much greater parallelism and efficiency in training. This design – a stack of encoder and decoder layers built from multi-head self-attention, feed-forward sublayers, residual connections and layer normalisation – rapidly became the basis of state-of-the-art models in NLP and beyond. Pioneering Transformer-derived models like BERT (2018) and GPT-3 (2020) demonstrated that large pretrained Transformers could generalise across tasks via fine-tuning or few-shot learning. Subsequent research (e.g. Kaplan et al., 2020) quantified “scaling laws” showing predictable gains as model size, data and compute grow. Transformers now underlie modern large language models and multimodal systems because they scale efficiently on modern hardware, support transfer learning, and can be extended (e.g. to sparse or cross-modal attention) in many ways. However, they impose vast compute/data costs and still struggle with biases, factual correctness and true reasoning (e.g. “hallucinations” noted by Lewis et al. (2020)). This article traces the Transformer’s history (from RNN/LSTM predecessors to the breakthrough of “Attention Is All You Need”), details its mathematics and components (self-attention, positional encodings, multi-head layers, etc.), and surveys its dominance (parallelism, pretraining, scaling) and contemporary frontiers (long-context models, multimodality, retrieval augmentation, sparsity, adapters, safety/interpretability). We conclude with open problems and future research directions, supported by rigorous comparison tables and timeline summaries of key milestones.


History and Origins

Early sequence models used recurrent neural networks (RNNs) and their gated variants (LSTM, GRU) to process tokens sequentially. However, by the mid-2010s it was clear that fixed-length “bottleneck” encodings limited performance on long sequences. In 2014–15, Bahdanau et al. introduced a soft attention mechanism for machine translation: the decoder dynamically “searches” for relevant parts of the source sentence when generating each target word, instead of encoding the entire input into one vector. This improved translation quality dramatically (matching state-of-the-art statistical systems) and effectively formed alignment weights between source and target (the “(soft-)alignments” matched human intuition). Luong et al. (2015) further refined attention variants, but these still sat atop RNN encoder–decoder backbones.

The true paradigm shift came with “Attention Is All You Need” (Vaswani et al., 2017). This work proposed the Transformer, an architecture built entirely from attention mechanisms and feed-forward layers, with no recurrence or convolution at all. As the authors explain, RNNs and CNNs in sequence models require sequential computation or large kernels, whereas a pure attention model “draws global dependencies between input and output” in a single layer. Critically, each Transformer layer applies multi-headed self-attention (so every token attends to all others) in parallel, followed by a position-wise feed-forward network, with residual connections and layer normalisation wrapping each sublayer. The original Transformer quickly achieved state-of-the-art BLEU scores on translation benchmarks while training much faster than recurrent models.

Timeline of Key Transformer Milestones. Each row is supported by primary literature or official reports.

YearMilestoneReference
2014Soft-attention for NMT (Bahdanau et al., ICLR 2015) – introduced alignment-based attention, improving translation quality.Bahdanau et al. (2015)
2017Transformer (Vaswani et al., NeurIPS 2017) – proposed an encoder–decoder solely using self-attention and feed-forward layers (no RNN/CNN). Faster training and SOTA on WMT’14 EN–DE translation.Vaswani et al. (2017)
2018BERT (Devlin et al., NAACL 2019) – pretrained deep bidirectional encoder on BooksCorpus+Wiki; fine-tuned to many NLP tasks (QA, NLI, etc.). Achieved large gains across GLUE/SQuAD benchmarks.Devlin et al. (2019)
2018GPT-1 (Radford et al., OpenAI blog) – introduced the first large pretrained decoder-only Transformer for language modeling. Used unsupervised pretraining followed by supervised fine-tuning, ushering in the GPT series.Radford et al. (2018)
2019GPT-2 (Radford et al., 2019) – a 1.5B-param decoder-only Transformer trained on 40GB of WebText (Reddit-scraped data). Demonstrated strong zero-shot abilities in text generation and NLP tasks.Radford et al. (2019)
2020GPT-3 (Brown et al., 2020) – scaled up to 175B parameters, trained on diverse web corpora. Achieved unprecedented few-shot performance across many NLP tasks. (Also: BERT lineage innovations proliferated – RoBERTa, XLNet, ALBERT.)Brown et al. (2020)
2020T5 (Raffel et al., 2020) – the “Text-to-Text” unified model. A seq2seq Transformer (up to 11B parameters) pretrained on C4 (Colossal Clean Crawl); reframes all tasks as text generation (translation, summarization, QA, etc.). Achieved SOTA on numerous benchmarks.Raffel et al. (2020)
2020Vision Transformer (ViT) (Dosovitskiy et al., ICLR 2021) – showed that a pure Transformer (with input as 16×16 image patches) can match or exceed CNNs on vision tasks when pretrained on large image datasets.Dosovitskiy et al. (2021)
2021Sparse & Long-Context Models: e.g. Longformer (Beltagy et al., 2020) – uses windowed + global attention to scale linearly for thousands of tokens; BigBird (2020) – random+global sparse attention; Linformer/Reformer – low-rank or hashing attention. These allow processing much longer sequences than the O(n²) naïve self-attention.Beltagy et al. (2020)
2021Mixture-of-Experts: Switch Transformer (Fedus et al., 2021) – a sparse-MoE variant of T5. By routing tokens to a subset of “expert” feed-forward layers, it scaled up to ~1T parameters with ~4× speedup over dense models while using the same computational cost.Fedus et al. (2021)
2022PaLM (Chowdhery et al., 2022) – a 540B-parameter fully dense Transformer trained on 780B tokens. Achieved new benchmarks on language understanding and multi-step reasoning tasks, demonstrating further benefits of massive scale (e.g. outperforming humans on many BIG-bench tasks).Chowdhery et al. (2022)
2023LLaMA (Touvron et al., 2023) – Meta’s “LLaMA” suite of foundation models (7B–65B parameters) trained on 1–1.4T tokens. Showed that relatively smaller models can approach prior large-LM performance with careful scaling. (Update: LLaMA 2 in mid-2023.)Touvron et al. (2023)
2023GPT-4 (OpenAI, 2023) – a large multimodal Transformer. Achieved human-level performance on many academic/professional benchmarks (e.g. top ~10% on the bar exam). Leverages a transformer backbone with extensive pretraining plus alignment fine-tuning.OpenAI (2023)

This rapid succession of milestones underscores how the Transformer concept – self-attention plus scalable feed-forward layers – became the default architecture. A schematic flow of this evolution is shown below:

flowchart LR
    LSTM["1997: LSTM (Hochreiter & Schmidhuber)"] --> Attention["2014: Attention (Bahdanau et al.)"]
    Attention --> Transformer["2017: Transformer (Vaswani et al.)"]
    Transformer --> BERTGPT["2018: BERT (Devlin) & GPT (Radford)"]
    BERTGPT --> GPT2T5["2019–20: GPT-2, T5, XLNet, etc."]
    GPT2T5 --> SparseEff["2021: Sparse/Long models (Longformer, BigBird), Vision (ViT)"]
    SparseEff --> MoE["2021: Switch Transformer (MoE)"]
    MoE --> PaLM["2022: PaLM (540B)"]
    PaLM --> LLaMA["2023: LLaMA (7B–65B)"]
    LLaMA --> GPT4["2023: GPT-4 (multimodal)"]

Technical Foundations

At the heart of the Transformer is self-attention, a mechanism that computes interactions among all positions in an input sequence. Given input embeddings (or previous-layer features) (X = [x_1, \dots, x_n]), each is linearly projected to queries (Q), keys (K), and values (V) of dimension (d). The scaled dot-product attention is computed as: [ \text{Attention}(Q,K,V) = \text{softmax}!\Bigl(\frac{Q K^T}{\sqrt{d_k}}\Bigr), V, ] where (d_k) is the key dimension. In practice, this means each output at position (i) is a weighted sum of all value vectors, with weights given by the softmax of the dot-products between query (q_i) and all keys. (The scaling by (\sqrt{d_k}) helps keep the softmax in a well-behaved regime.) Vaswani et al. also compared this to additive (“Bahdanau”) attention; they note that dot-product attention is faster and more space-efficient in modern hardware and “results in marginally better accuracy.”

To enrich representations, multi-head attention performs multiple attentions in parallel. Concretely, we project the input (X) by (h) different learnable linear mappings ({W_i^Q,W_i^K,W_i^V}_{i=1}^h), compute attention in each of the (h) subspaces, then concatenate and linearly project the result. This allows the model to attend to information from different representation subspaces at different positions. Formally: [ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O,\quad \text{where }\text{head}_i=\text{Attention}(QW_i^Q,; KW_i^K,; VW_i^V). ] Each Transformer layer contains one multi-head self-attention block (where (Q=K=V) come from the same sequence) and a feed-forward sublayer. The feed-forward network is the same for every position and consists of two linear layers with a ReLU (or GELU) nonlinearity: [ \text{FFN}(x) = \max(0, x W_1 + b_1),W_2 + b_2. ]

Every sublayer is wrapped by a residual (skip) connection followed by layer normalisation. Concretely, if the sublayer implements a function (F(\cdot)), the output is (\text{LayerNorm}(x + F(x))). This design (borrowed from He et al.‘s ResNets and Ba et al.‘s layer norm) stabilises training even for very deep stacks.

Finally, since pure self-attention has no built-in notion of word order, Transformers add positional encodings to the input embeddings. The original Transformer used sinusoids: each position (pos) is encoded as (\sin(pos/10000^{2i/d})) or (\cos(pos/10000^{2i/d})) for each dimension (i). These continuous functions allow the model to learn to attend by relative positions even beyond the maximum trained length. The authors note that learned absolute position embeddings work similarly; what matters is that each input token’s representation includes its position information.

The figure below illustrates one encoder layer of a Transformer, showing the multi-head self-attention sublayer with its residual/normalisation and the feed-forward sublayer:

flowchart LR
    Input["Input (embeddings + positional)"] --> MHSA["Multi-Head\nSelf-Attention"]
    MHSA --> AddNorm1["Add & Norm"]
    AddNorm1 --> FFN["Feed-Forward"]
    FFN --> AddNorm2["Add & Norm"]

A full encoder (or decoder) is a stack of such layers. In the original Transformer, the encoder comprises 6 identical layers, and the decoder also 6 layers (each decoder layer has an extra encoder–decoder attention sublayer). Decoders use masked self-attention (preventing tokens from attending to future positions) to enforce auto-regressive generation. Thus, Transformers admit three main variants: (1) Encoder-only (like BERT, for understanding tasks); (2) Decoder-only (like GPT, for text generation); and (3) Encoder–Decoder (like original Transformer or T5, for seq2seq tasks).

Because attention computes interactions for all token pairs, its complexity per layer is (O(n^2 d)) for sequence length (n) and hidden size (d). By contrast, an RNN layer costs (O(n d^2)), and a convolutional layer with kernel size (k) costs (O(kn d)). Vaswani et al. observe that for typical sequence lengths (often (n<d)), self-attention can be faster than RNNs due to parallelism. However, the quadratic (n^2) factor becomes costly for very long sequences, motivating sparse or local attention variants (see Long-context models below). In practice, training large Transformers scales well on GPU/TPU hardware because all positions in a layer are processed concurrently (unlike the sequential dependency of RNNs).

Layer normalisation is applied after each residual addition, which the authors found crucial for stable deep training. Empirically, Transformers learn powerful representations: the original paper notes that many attention heads “learn to perform different tasks” (some capture syntax or positional roles), which also aids interpretability by inspecting attention patterns. The Transformer’s success owes both to its mathematical elegance and to these practical design choices.


Why Transformers Remain Dominant

Transformers now dominate NLP (and increasingly other AI domains) for several intertwined reasons:

  • Massive Parallelism: Unlike RNNs, whose hidden state must be updated sequentially, a Transformer layer connects all positions via attention “in a constant number of sequential operations.” Vaswani et al. highlight that Transformer layers allow much more parallel computation per layer. In practice, this means training time decreases as one throws more GPUs at the problem. For example, Vaswani et al. report training their big model (273M parameters) in ~3.5 days on 8 NVIDIA P100 GPUs – far faster than equivalent RNN/CNN models on the same task. Modern accelerators (GPU/TPU) are highly optimised for matrix multiplies and attention’s batched operations, further amplifying this advantage.
  • Scalability and Transfer: Transformers scale remarkably with data and compute. Early work (Kaplan et al., 2020) demonstrated clear power-law scaling laws for language model performance: larger models trained on more data uniformly yield lower loss. This underpins the “bigger is better” trend. In practice, Transformers have been scaled from millions of parameters (original Transformer) to hundreds of billions (GPT-3, PaLM) and beyond, with consistent improvements in benchmarks. Pretraining on vast unlabeled corpora (Wikipedia, web crawl, books, code, etc.) imbues the model with broad knowledge. One can then fine-tune or use few-shot prompting on downstream tasks, often with minimal additional training. BERT (110M–340M params) fine-tuned on tasks achieved new SOTA on GLUE/SQuAD with just a task-specific output layer. GPT-3 (175B params) showed strong zero/few-shot learning without any fine-tuning. This transfer paradigm – learning a general “foundation model” and adapting it – has proven more efficient than training task-specific architectures from scratch. The unified architectures mean improvements (like bigger models or better data) benefit many tasks simultaneously.
  • Expressiveness: Self-attention can model long-range dependencies more directly than RNN gates or convolution. Each layer allows any token to attend to any other, learning context-dependent relationships in one step. The multi-head mechanism lets the model capture diverse linguistic or visual relations. Combined with deep stacks, Transformers can approximate very complex functions over sequences. The empirical success of models like BERT and GPT-3 suggests that the Transformer’s representational capacity is adequate for a vast range of problems, from core NLP to knowledge retrieval.
  • Transfer Learning Ecosystem: A wealth of research has built around Transformers. Large pretrained weights and codebases are publicly released for BERT, GPT-family, T5, and vision-language models (e.g. CLIP, DALL-E, etc.). This ecosystem accelerates research and applications. Official documentation and tooling (e.g. TensorFlow, PyTorch Transformers libraries) focus on Transformers. Moreover, emerging paradigms (adapters, prompt tuning) layer on top of Transformers to enable efficient domain adaptation without full retraining.
  • Hardware Trends: Modern training hardware (GPUs/TPUs) favors dense matrix operations and moderate sequence lengths. The quadratic attention fits well up to a few thousand tokens, and typical language tasks (sentences, paragraphs) work well within that. Specialized AI chips (like Google’s TPUv4) are designed with matrix operations (e.g. for attention and feed-forward nets) in mind. This synergy of architecture and hardware keeps Transformers competitive.

Overall, the Transformer’s combination of parallelism and modularity has allowed NLP research to ride Moore’s Law. As results in scaling laws indicate, the compute-optimal path is often to train larger models on more data. Transformers unlock this path more readily than RNNs did.


Current Trends and Directions

With the Transformer paradigm now established, research is exploring many extensions, optimisations, and applications:

  • Continued Scaling: The obvious trend is ever-larger models. GPT-4 (2023) and PaLM (540B) pushed scales to the hundreds of billions of parameters, yielding emergent abilities (e.g. complex reasoning, coding, multilingual) that smaller models lacked. New architectures like Mixture-of-Experts (MoE) enable extreme scale with controlled compute: Switch Transformer (1.6T params) and GLaM (1.2T, Du et al., 2022) use sparse routing so each token only engages a fraction of the model. These models report much higher parameter counts at similar FLOPs and show that effective capacity can grow without linear cost.
  • Longer Contexts and Efficiency: Classical self-attention is quadratic in sequence length. Many tasks (document QA, code, genomics) require thousands or millions of tokens. New “efficient Transformer” variants address this. Sparse attention methods (Longformer, BigBird) restrict each token to attend within a window or via random/global patterns, reducing complexity to (O(n)) or (O(n\sqrt{n})). Low-rank or kernel attention (Linformer, Performer) approximate the full attention matrix with projections or random features to achieve linear scaling. Reformer uses hashing to limit attention candidates. These methods achieve effective performance on long sequences with far less memory. In parallel, hardware improvements (e.g. 3D-stacked memory) and library optimisations (flash attention kernels) extend the feasible context length of full attention.
  • Multimodality: Transformers now power vision, audio, and cross-modal models. After ViT’s success in vision, transformers became core in models like CLIP (contrastive text-image), DALL·E (text-to-image), and vision-language transformers (e.g. Flamingo, PaLI). These often use a shared or paired Transformer encoder on image patches and text tokens. GPT-4 and others natively accept both text and images. Beyond vision, Transformers are used in audio/speech (Whisper by OpenAI), time series, and even biological data (AlphaFold). The unifying theme is treating any modality as a sequence of tokens (pixels, spectrogram patches, etc.) and applying attention.
  • Retrieval-Augmented Models: A major direction is combining large Transformers with external knowledge. Retrieval-Augmented Generation (RAG) models (Lewis et al., 2020) integrate a dense text retriever (e.g. from Wikipedia) with a seq2seq Transformer. At inference, relevant documents are fetched for a query and provided as extra context. This improves factuality and enables dynamic updating of knowledge without re-training. RAG-style models now underpin systems like Google’s search-assistant and ChatGPT’s browsing. In effect, the Transformer becomes a controller over an external, non-parametric memory, mitigating issues of “stale” knowledge and hallucinations by grounding answers in retrieved text. Ongoing work refines retrieval (bi-encoder retrievers, dense vs sparse indexing) and reading strategies (one-pass vs iterative retrieval per token).
  • Sparsity and Parameter-Efficient Tuning: In addition to MoE sparsity, there’s interest in sparse or low-rank weights (pruning, quantisation) and in adapter modules. Adapters are small bottleneck layers inserted into a pretrained Transformer (Houlsby et al., 2019). They allow fine-tuning on a new task by learning only a few extra parameters, keeping the main weights fixed. This greatly reduces training/storage cost for multiple tasks. Recent works (“Adapters Strike Back” by Pfeiffer et al., 2020) show adapters matching full fine-tuning performance on many tasks. Similarly, prompt tuning fixes a model and learns only a short prefix of “soft prompts” in input space. These techniques make it cheaper to specialise large Transformers without retraining them fully.
  • Continual and Lifelong Learning: Deploying LMs over time raises the question of how to update them with new data without forgetting old knowledge. Researchers are exploring continual learning algorithms for Transformers (e.g. elastic weight consolidation, replay of samples, dynamic expansions). Adapter and modular approaches dovetail here (new tasks get new adapter modules). This remains an active area of research.
  • Robustness, Safety, and Interpretability: As Transformers power real applications, ensuring they behave reliably is crucial. Studies of adversarial attacks on attention, of bias in generated text, and of alignment (ensuring outputs meet user intent) are thriving. For example, GPT-3 studies note that it still makes factual and ethical errors and may reflect biases in web data. In response, techniques like RL with human feedback (RLHF, used in GPT-4) are employed. Meanwhile, interpretability research examines attention patterns: Vaswani et al. observe that many heads align with syntax or coreference, hinting at how the network structures information. Other methods (saliency, concept activation) are applied to understand Transformers.

In summary, current trends push scaling (size, data, compute), efficiency (long-context, sparsity), and multifunctionality (multimodal, retrieval support). These advances aim to retain Transformers’ flexibility while mitigating cost and expanding capability.


Limitations and Open Problems

Despite their success, Transformers face significant challenges:

  • Compute and Data Costs: Training state-of-the-art Transformers demands enormous resources. For context, training GPT-3 required on the order of hundreds of GPU-years (billions of dollars in compute). Even inference with huge models is costly and has high latency. This raises environmental and economic concerns (“Green AI”). Moreover, the need for vast high-quality data is acute; biases and gaps in web corpora propagate into model outputs. While scaling laws suggest larger models are more “compute-efficient” in terms of performance per parameter, the absolute costs remain prohibitive for all but major labs.
  • Bias and Fairness: Transformers ingest human-generated text and thus can encode societal biases (gender, race, ideology, etc.) present in the data. Without careful filtering and mitigation, models may produce stereotyped or offensive content. Large-scale audits (e.g. of GPT-3 or PaLM) reveal that toxic and biased outputs can emerge, especially in adversarial settings. Designing objective metrics for bias and mechanisms to suppress it is an ongoing difficulty.
  • Hallucination and Factuality: Even state-of-the-art models occasionally generate false or nonsensical statements (“hallucinations”), especially when asked for knowledge beyond their training or reasoning beyond their statistical grasp. For instance, GPT-3 can invent fictitious references or answer incorrectly on trivia datasets, as Lewis et al. note. Despite retrieval augmentation or fine-tuning, guaranteeing factual accuracy in open-ended generation is unsolved. Tighter integration with symbolic knowledge or logical constraints is an open research path.
  • Long-Range and Complex Reasoning: Pure attention does not inherently encode logical or world-knowledge rules. Transformers excel at pattern extrapolation from training data, but struggle with problems requiring multi-step logical reasoning, planning, or genuine understanding. While some reasoning abilities emerge at scale (e.g. few-shot math problem solving), failure modes remain. Bridging this gap likely requires architectural innovations (e.g. explicit memory modules or hybrid neural-symbolic systems).
  • Sample Efficiency and Adaptability: Current large LMs are sample-inefficient learners: fine-tuning or few-shot adaptation still requires significant data or effort. When adapting to novel domains or languages, catastrophic forgetting or poor generalisation can occur. Techniques like adapters and prompt engineering help, but fully solving continual learning (adapting smoothly to new tasks without retraining from scratch) is an open challenge.
  • Interpretability and Verification: Understanding how Transformers make decisions is still limited. Although attention provides some transparency, there is no guarantee that paying attention to a word means it directly influenced the output in a human-interpretable way. Formal verification of Transformer behavior (e.g. verifying they satisfy certain invariances or safety constraints) is largely unexplored due to model size and complexity. Methods from explainable AI are being adapted, but the field lags behind the technology’s capability.

In short, while powerful, Transformers raise profound open problems at the intersection of ML, ethics, and systems engineering. A responsible future research agenda must address these limitations.


Future Outlook

Looking forward, research on Transformer-based systems has many exciting directions:

  • Architectural Innovations: We may see new architectures that retain the strengths of Transformers while improving efficiency or expressivity. For example, SpiNN or Neural Memory architectures could be hybridized with attention. The success of MoE suggests other sparse/dynamic architectures could be fruitful. There is ongoing work on “universal transformers” (with iterative refinement loops) or adding recurrence at higher levels.
  • Cross-Modal and Embodied AI: Transformers are spreading into robotics and control, processing sequences of sensory inputs or actions. Combining Transformers with reinforcement learning (e.g. Decision Transformers, which model trajectories as sequences) is a nascent but growing area. Integration of language and vision with actionable interfaces (instructable robots, for instance) is on the horizon.
  • Efficient Continual Learning: Future models may better learn incrementally. Techniques like Dynamic Sparse Attention (learning which tokens to attend) or lifelong learning algorithms could allow a Transformer to adapt to streams of data. Research into Meta-learning and adapters may yield models that quickly specialize to new tasks with minimal retraining.
  • Algorithmic and Linguistic Reasoning: Some work aims to give Transformers more robust reasoning skills. For example, “Chain-of-thought” prompting lets models generate intermediate reasoning steps. Another line is combining Transformers with formal reasoning engines or external calculators (e.g. models that call Python). Architectures that explicitly model causality or maintain internal symbolic structures could bridge the gap to human-like reasoning.
  • Smaller Models with Big Capabilities: LLaMA and other “distilled” or compact models hint that scaling data and training tricks can yield smaller models that perform like their giants. Research on more data-centric scaling (better curation, synthetic data generation, multilingual balancing) may produce leaner yet capable systems. This is important for democratization: if high performance can be achieved with 10–100B parameters instead of 1000B, many more researchers can contribute.
  • Safety, Alignment, and Governance: With greater deployment of AI, research into aligning Transformer behaviors with human values will intensify. This includes adversarial robustness, detection of malicious uses, and interpretability tools. We expect more formal frameworks to verify model properties or to guarantee that updates (e.g. via RLHF) do not introduce regressions. The community is also exploring legal and policy frameworks (model cards, auditing standards, regulatory oversight) for these powerful models.
  • Quantum and Neuromorphic Accelerators: Though speculative, novel hardware could influence architecture design. Quantum computing or neuromorphic chips might one day handle certain parts of attention or memory differently, potentially leading to new forms of Transformer-like networks. Keeping an eye on hardware trends (e.g. attention accelerators, optics-based computation) will be important.
  • Theoretical Understanding: Finally, there is room for deeper theory. Why exactly do Transformers scale so well? What are the limits of attention for learning linguistic structure? Recent works (e.g. Studying Transformers with mechanism analysis) hint at formal properties, but a complete theory of how self-attention implements algorithms (sorting, graph problems, etc.) is lacking.

Comparison of Major Models

The table below summarises representative Transformer-based models in research and industry, their scale, data, tasks, and key innovations.

Model (Origin)YearParametersTraining DataMain Tasks/Use-caseInnovations
BERT (Google)2018110M (base), 340M (large)BooksCorpus (800M words), English Wikipedia (2.5B words)NLP understanding: QA, NLI, NER, etc. via fine-tuningDeep bidirectional encoder pretraining (Masked LM + NSP tasks); first single-model SOTA on diverse tasks.
GPT-2 (OpenAI)20191.5BWebText (~40 GB of text from 8M web pages)Language modeling and zero-shot generationUnsupervised generative pretraining at large scale; emergent zero-shot capabilities; showed that larger LM yields stronger few-shot performance.
GPT-3 (OpenAI)2020175BCommon Crawl (filtered), WebText2, Wikipedia, Books, etc. (trillions of tokens)Few-shot NL tasks, text generationMassive scale: few-shot learning; no fine-tuning needed for many tasks; novel emergent behavior as model size grew.
GPT-4 (OpenAI)2023~trillions (undisclosed)Diverse web text + code + multimodal (images)Multimodal generation & reasoningMultimodal Transformer with advanced alignment; human-level on many benchmarks; improved factuality and instruction-following (via RLHF).
T5 (Google)2020220M (base); 770M (large); 11B maxC4 (Colossal Clean Crawled Corpus, filtered Common Crawl)Text-to-text tasks: translation, summarization, QA, etc.Unified “text-to-text” framework: any problem (classification, generation) cast as sequence generation; massive pretraining on large web corpus with standardised tasks.
ViT (Google)202186M (Base), 307M (Large)ImageNet21K (14M images) or JFT-300M (300M images)Image classification (and feature extraction)Pure transformer on image patches; no convolution layers. When pretrained on enough data, matches or exceeds CNN accuracy with simpler architecture.
Switch Transformer (Google)20211.6T (sparse)C4 (like T5)Language modeling / multilingual NLPMixture-of-experts (MoE): each token selects one of many “expert” feed-forward networks. Achieves trillion parameters with ~7× training speedup (sparse compute) vs dense.
PaLM (Google)2022540B780B tokens (CommonCrawl, code, multilingual, etc.)Language understanding and generation; few-shot reasoningExtremely large dense Transformer. Demonstrated breakthrough few-shot learning and reasoning (outperforming finetuned SOTA on many tasks).
LLaMA (Meta)20237B–65B1–1.4T tokens (filtered Common Crawl, news, books, etc.)General-purpose LLM for research and fine-tuningScaled-down models trained on massive data; showed competitive performance with far fewer parameters. Released with focus on democratization and research accessibility.

Each of these models built on Transformer blocks but introduced new ideas (bidirectionality, scale, multi-modality, sparsity, etc.) that expanded the frontier of what these architectures can do. Notably, the GPT series emphasised decoder-only models for generation; BERT and T5 emphasised encoder or encoder-decoder structures for understanding and seq2seq tasks; ViT demonstrated that even vision can be done with Transformers.

Complexity: For reference, a single Transformer layer processes (n) tokens of dimension (d) in (O(n^2 d)) time and (O(n^2)) memory (due to the (n\times n) attention matrix). In contrast, a recurrent layer costs (O(n d^2)) (because each of (n) steps does a matrix multiply of size (d\times d)). Thus for sentences ((n) in the hundreds) with (d) also in the hundreds, the costs are comparable – but Transformers get much more GPU parallelism (since all positions are processed together). In practice, self-attention is often faster on modern hardware for typical sequence lengths, as noted by Vaswani et al.


References

  • Vaswani et al., “Attention Is All You Need,” NIPS 2017 (Transformer architecture).
  • Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers,” NAACL 2019 (BERT details).
  • Radford et al., “Language Models are Unsupervised Multitask Learners,” OpenAI (GPT-2).
  • Brown et al., “Language Models are Few-Shot Learners,” NeurIPS 2020 (GPT-3).
  • Raffel et al., “Exploring Transfer Learning with T5,” Google AI Blog 2020.
  • Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
  • Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models,” JMLR 2022.
  • Kaplan et al., “Scaling Laws for Neural Language Models,” arXiv 2020 (scaling laws).
  • Beltagy et al., “Longformer: The Long-Document Transformer,” arXiv 2020.
  • Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020 (RAG).
  • Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” ICLR 2015.
  • Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways,” arXiv 2022.
  • Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv 2023.