Some interesting numbers about GPT and LLMs

  • ~$25 million is the cost to train a 70 billion parameter model on 1.4 trillion tokens.
  • The cost ratio of fine-tuning vs training from scratch (pre-training) is less than 0.001.
  • The cost ratio of GPT-4 to GPT -3.5 Turbo is approximately 25.
  • On average, there are 1.3 tokens per word.
  • The average human reading speed is between 3 – 5 words per second.
  • GPU Memory Capacity for NVIDIA GPUs are: V100 has 16GB, A10G has 24GB, A100 has 40/80GB, L40 has 48GB.
  • The typical GPU memory requirements of an LLM for serving are twice the number of parameters. (For instance, a 7 Billion parameter model requires approx. 14 GB of VRAM in total.
  • ChatGPT (GPT 3.5) Can issue about 40 tokens per second, GPT 4 can issue about 15 tokens per second.
  • A LLM with 13 Billion parameters using llama.cpp on a high-end modern CPU can issue about 10 tokens per second
  • A LLM with 13 Billion parameters using a newer GPU with CUDA can issue about 85 tokens per second.
  • LLaMa2 with 70 Billion parameters used about 1720320 hours of GPU hours to do the pre-training
  • In LLaMa2 close to 90% of the dataset used to do pretraining is English, the remaining parts is programming languages and other languages.
  • A LLM with 70 Billion parameters using a newer GPU with CUDA can issue about 15 tokens per second
  • ChatGPT costs a whopping $700,000/day to operate.
  • ChatGPT has over 180 million users.
  • GPT-3 (which preceded GPT-4) had 175 billion parameters.
    • This was a significant increase from GPT-2, which had 1.5 billion parameters.
  • Training GPT-3 was estimated to cost tens of millions of dollars in computing resources.
  • GPT models have a “knowledge cutoff” date.
    • GPT 3-5 goes up until September 2021.
    • For instance, GPT-4 training data goes up to January 2022.
  • For the GPT-3 model, a typical forward pass (generating a response) takes about 3.14 milliseconds on an NVIDIA V100 GPU.
  • The main parts to get LLMs to run locally are VRAM Size, memory bandwidth, and FP32 Performance.
  • Processing the entire “Lord of the Rings” page by page to make a summary (576,458 words) would cost about $90 using GPT-4.
  • If we process the Lord of the Ring books over 75 times using GPT-4, we could afford an NVIDIA A100.
  • Claude2 could fit the entire book into the context window.
    • Claude2 supports 100,000 tokens and would cost about $40.

Leave a Reply

Scroll to Top