

Picture by Writer | Canva
We use massive language fashions in lots of our day by day duties. These fashions have been educated on billions of on-line paperwork and various datasets, making them able to understanding, comprehending, and responding in human-like language. Nonetheless, not all LLMs are created the identical method. Whereas the core thought stays comparable, they differ of their underlying architectures and these variations have a major influence on their capabilities. For instance, as seen throughout numerous benchmarks, DeepSeek excels at reasoning duties, Claude performs effectively in coding, and ChatGPT stands out in inventive writing.
On this article, I’ll stroll you thru 7 common LLM architectures to provide you a transparent overview, all in simply as many minutes. So, let’s get began.
1. BERT
Paper Hyperlink: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, BERT marked a major shift in pure language understanding by introducing deep bidirectional consideration in language modeling. In contrast to earlier fashions that learn textual content in a left-to-right or right-to-left method, BERT makes use of a transformer encoder to think about each instructions concurrently. It’s educated utilizing two duties: masked language modeling (predicting randomly masked phrases) and next-sentence prediction (figuring out if one sentence logically follows one other). Architecturally, BERT is available in two sizes: BERT Base (12 layers, 110M parameters) and BERT Giant (24 layers, 340M parameters). Its construction depends solely on encoder stacks and consists of particular tokens like [CLS] to characterize the complete sentence and [SEP] to separate two sentences. You possibly can fine-tune it for duties like sentiment evaluation, query answering (like SQuAD), and extra. It was the primary of its type to really perceive the complete which means of sentences.
2. GPT
Paper Hyperlink (GPT 4): https://arxiv.org/pdf/2303.08774
The GPT (Generative Pre-trained Transformer) household was launched by OpenAI. The sequence started with GPT-1 in 2018 and has advanced to GPT-4 by 2023, with the most recent model, GPT-4o, launched in Could 2024, showcasing multimodal capabilities, dealing with each textual content and pictures. They’re pre-trained on very massive textual content corpora with a typical next-token prediction language modeling goal: at every step the mannequin predicts the subsequent phrase in a sequence given all earlier phrases. After this unsupervised pre-training stage, the identical mannequin could be fine-tuned on particular duties or utilized in a zero-/few-shot method with minimal further parameters. The decoder-only design means GPT attends solely to earlier tokens in contrast to BERT’s bidirectional encoder. What was notable at introduction was the sheer scale and functionality of GPT: as every successive technology (GPT‑2, GPT‑3) grew bigger, the mannequin demonstrated very fluent textual content technology and few-shot studying talents, establishing the “pre-train and immediate/fine-tune” paradigm for giant language fashions. Nonetheless, they’re proprietary, with entry usually supplied by way of APIs, and their precise architectures, particularly for current variations, aren’t totally disclosed.
3. LLaMA
LLaMA 4 Weblog Hyperlink: https://ai.meta.com/weblog/llama-4-multimodal-intelligence/
Paper Hyperlink (LLaMA 3) : https://arxiv.org/abs/2407.21783
LLaMA, developed by Meta AI and first launched in February 2023, is a household of open-source decoder-only transformer fashions. It ranges from 7 billion to 70 billion parameters, with the most recent model, Llama 4, launched in April 2025. Like GPT, LLaMA makes use of a Transformer decoder-only structure (every mannequin is an autoregressive Transformer) however with some architectural tweaks. For instance, the unique LLaMA fashions used the SwiGLU activation as an alternative of GeLU, rotary positional embeddings (RoPE) as an alternative of mounted ones, and RMSNorm rather than layer norm. The LLaMA household was launched in a number of sizes from 7B as much as 65B parameters in LLaMA1, later even bigger in LLaMA3 to make large-scale fashions extra accessible. Notably, regardless of comparatively modest parameter counts, these fashions carried out competitively with a lot bigger contemporaries: Meta reported that LLaMA’s 13B mannequin outperformed OpenAI’s 175B GPT-3 on many benchmarks, and its 65B mannequin was aggressive with contemporaries like Google’s PaLM and DeepMind’s Chinchilla. LLaMA’s open (although research-restricted) launch spawned intensive neighborhood use; its key novelty was combining environment friendly coaching at scale with extra open entry to mannequin weights.
4. PaLM
PaLM 2 Technical Report: https://arxiv.org/abs/2305.10403
Paper Hyperlink (PaLM): https://arxiv.org/pdf/2204.02311
PaLM (Pathways Language Mannequin) is a sequence of enormous language fashions developed by Google Analysis. The unique PaLM (introduced 2022) was a 540-billion parameter, decoder-only Transformer and is a part of Google’s Pathways system. It was educated on a high-quality corpus of 780 billion tokens and throughout hundreds of TPU v4 chips in Google’s infrastructure, using parallelism to attain excessive {hardware} utilization. The mannequin additionally has multi-query consideration to scale back reminiscence bandwidth necessities throughout inference. PaLM is understood for its few-shot studying capabilities, performing effectively on new duties with minimal examples due to its large and various coaching knowledge, which incorporates webpages, books, Wikipedia, information, GitHub code, and social media conversations. PaLM 2, introduced in Could 2023, additional improved multilingual, reasoning, and coding capabilities, powering purposes like Google Bard and Workspace AI options.
5. Gemini
Gemini 2.5 Weblog: https://weblog.google/know-how/google-deepmind/gemini-model-thinking-updates-march-2025/
Paper Hyperlink (Gemini 1.5): https://arxiv.org/abs/2403.05530
Paper Hyperlink (Gemini): https://arxiv.org/abs/2312.11805
Gemini is Google’s next-generation LLM household (from Google DeepMind and Google Analysis), launched in late 2023. Gemini fashions are natively multimodal, which means they’re designed from the bottom as much as deal with textual content, photographs, audio, video, and even code in a single mannequin. Like PaLM and GPT, Gemini is predicated on the Transformer, however its key options embrace huge scale, help for terribly lengthy contexts, and (in Gemini 1.5) a Combination-of-Specialists (MoE) structure for effectivity. For instance, Gemini 1.5 (“Professional”) makes use of sparsely activated skilled layers (a whole lot of skilled sub-networks, with only some energetic per enter) to spice up capability with out proportional compute price. The Gemini 2.5 sequence, launched in March 2025, constructed upon this basis with even deeper “considering” capabilities. In June 2025, Google launched Gemini 2.5 Flash and Professional as steady fashions and previewed Flash‑Lite, their most cost-efficient, quickest model but, optimized for high-throughput duties whereas nonetheless supporting the million-token context window and power integrations like search and code execution. The Gemini household is available in a number of sizes (Extremely, Professional, Nano) so it could actually run from cloud servers right down to cellular units. The mix of multimodal pretraining and MoE-based scaling makes Gemini a versatile, extremely succesful basis mannequin.
6. Mistral
Paper Hyperlink (Mistral 7B): https://arxiv.org/abs/2310.06825
Mistral is a French AI startup that launched its first LLMs in 2023. Its flagship mannequin, Mistral 7B (Sept 2023), is a 7.3 billion-parameter Transformer-based decoder mannequin. Architecturally, Mistral 7B is much like a GPT-style mannequin however consists of optimizations for inference: it makes use of grouped-query consideration (GQA) to hurry up self-attention and sliding-window consideration to deal with longer contexts extra effectively. When it comes to efficiency, Mistral 7B outperformed Meta’s Llama 2 13B and even gave sturdy outcomes versus 34B fashions, whereas being a lot smaller. Mistral AI launched the mannequin underneath an Apache 2.0 license, making it freely accessible to be used. Its subsequent main launch was Mixtral 8×7B, a sparse Combination-of-Specialists (MoE) mannequin that includes eight 7 B-parameter skilled networks per layer. This design helped Mixtral match or beat GPT‑3.5 and LLaMA 2 70B on duties like arithmetic, coding, and multilingual benchmarks. In Could 2025, Mistral launched Mistral Medium 3, a proprietary mid-sized mannequin aimed toward enterprises. This mannequin delivers over 90% of the rating of pricier fashions like Claude 3.7 Sonnet on customary benchmarks, whereas lowering per-token price dramatically ( roughly $0.40 in vs $3.00 for Sonnet). It helps multimodal duties (textual content + photographs), skilled reasoning, and is obtainable by an API or for on-prem deployment on as few as 4 GPUs. Nonetheless, in contrast to earlier fashions, Medium 3 is closed-source, prompting neighborhood criticism that Mistral is transferring away from its open-source ethos. Shortly after, in June 2025, Mistral launched Magistral, their first mannequin devoted to express reasoning. The small model is open underneath Apache 2.0, whereas Magistral Medium is enterprise-only. Magistral Medium scored 73.6% on AIME2024, with the small model scoring 70.7%, demonstrating sturdy math and logic expertise in a number of languages.
7. DeepSeek
Paper Hyperlink (DeepSeek-R1): https://arxiv.org/abs/2501.12948
DeepSeek is a Chinese language AI firm (spin-off of Excessive-Flyer AI, based 2023) that develops massive LLMs. Its current fashions (like DeepSeek v3 and DeepSeek-R1) make use of a extremely sparsely activated Combination-of-Specialists Transformer structure. In DeepSeek v3/R1, every Transformer layer has a whole lot of skilled sub-networks, however only some are activated per token. This implies as an alternative of working all elements of the mannequin directly, the mannequin has a whole lot of skilled networks and prompts only some (like 9 out of 257) relying on what’s wanted for every enter. This enables DeepSeek to have an enormous whole mannequin dimension (over 670 billion parameters) whereas solely utilizing about 37 billion throughout every response, making it a lot quicker and cheaper to run than a dense mannequin of comparable dimension. Like different trendy LMs, it makes use of SwiGLU activations, rotary embeddings (RoPE), and superior optimizations (together with experimental FP8 precision throughout coaching) to make it extra environment friendly. This aggressive MoE design lets DeepSeek obtain very excessive functionality (akin to a lot bigger dense fashions) at decrease compute price. DeepSeek’s fashions (launched underneath open licenses) attracted consideration for rivaling main fashions like GPT-4 in multilingual technology and reasoning, all whereas considerably lowering coaching and inference useful resource necessities.
Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with drugs. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.