My notes from the "Future of AI Compute"
The story behind Groq's ultra-fast AI chips and what it could mean for NVIDIA
With a stunning 265% revenue growth1, 76% gross margin and $2.3T market cap2, NVIDIA has been in the spotlight as a key winner in the AI boom.
However, does NVIDIA have any serious competition? A startup called Groq recently went viral for demoing the fastest LLM responses with their proprietary AI chips.
Below are my takeaways from an interview with the founder of Groq earlier this year3.
The most interesting part of this discussion focused on how the chip market evolves from here. TL;DR the players who provision the entire AI stack could benefit more than a pure play chip vendor. And it appears that Groq, NVIDIA and the cloud hyperscalers (GCP, AWS, Azure) are all going in that direction.
Background story for Groq
J Ross, founder of Groq, was one of the creators of the TPU project at Google. Compared to GPUs, TPUs are more specialized chips for machine learning that are optimized for matrix multiplication. Google created their own TPUs to handle machine learning workloads more efficiently.
Groq built the LPU, or “Language Processing Unit, specifically for very large scale inference … we never expected was that large language models would be the thing that would be running on them.”Why has NVIDIA been so successful?
The challenge with using GPUs or TPUs is the developer experience. In order for developers to leverage these chips for AI, they need to use low-level software libraries called kernels.
Developers have been writing CUDA kernels for NVIDIA chips for a long time.
“CUDA is the best we’ve had … you got this double sided market where people are building CUDA kernels for NVIDIA GPUs when a new model comes out”What’s unique about computation for LLMs? What’s special about Groq?
Compared to traditional machine learning, “in generative AI, there's a sequential nature to it. For example, you can't predict the 100th token or word until you've predicted the 99th token or word.” Groq’s LPU chips are specifically designed to be fast at sequential processing.
Groq’s LPUs provide the fastest tokens per second for LLM inference. Prior to Groq, Artificial Analysis said the record was 120 tokens / second. Groq performs at 240 tokens / second.Why is speed important? A better UX and potentially new use cases.
On the web, “every 100 milliseconds of latency impacts conversion rates by about 8%”. Faster response times from LLMs could also enable novel use cases like real-time voice-based customer or support.
We’ve all heard the buzz around AI agents, like Devin. These systems, and more complex ones that use multi-agent interaction, will suddenly become much more performant and viable for real-world use.Why did the market allow vendor lock-in from NVIDIA?
Or in other words, “Why have people been so supportive of [NVIDIA’s] CUDA?”
NVIDIA has massive demand for their chips, and exercises discretion on who gets delivery. Hence, there’s “fear” from customers about NVIDIA not fulfilling their orders if they’re working with a competitor or creating one themselves.Future of competition? Are NVIDIA and the hyperscalers on a collision course?
“NVIDIA is already building their own cloud in a sneaky way … hosting the LLMs themselves. And so they are attempting to do the entire stack themselves”
“Amazon has spun their own silicon (Inferentia). Google has TPU and whatever derivatives they built since then. Meta said in their quarterly update that they're spinning their own inference silicon. So it just seems like everybody is going to compete with everybody.”How is the market for AI compute going to evolve?
“What we've learned is you can't sell a chip into this space. It's too complex. You need your own networking, software stack, compiler stack … you need everything.”
“Even AMD isn't being trusted right now … even though they can build a GPU, even though they can put it in a system, they don't have the software. They don't have all the rest of the stack. And that gap is only going to widen over time. So anyone who's building a chip is building the wrong thing.”
“Now the way that I see it … anyone who can build the entire stack top to bottom has a shot and anyone who can't is going to struggle. So what you're going to see is instead of chip companies challenging NVIDIA, you're going to see people offering services where they deploy the whole thing themselves end to end.”
“A ton of cloud vendors competing, and the ones that have their own proprietary hardware have a better shot.”What does the future for Groq look like?
“If you want to build a modern GPU, you build it on what's called the four or three nanometer process. It's incredibly expensive. It costs tens of billions of dollars to build these fabs to do it. And there's only a couple of geographies in the world that do this. Grok is actually built on a 14 nanometer process.”
Since Groq is on an older fab process, they’re not supply constrained. When they implement their chips on a more modern fab process, they’ll have a significant boost in performance and lower energy consumption.What’s special about Groq’s LPU architecture?
“We did everything different from the bottom up. It's going to get a bit technical, but we made a deterministic and synchronous chip. Another way to put it: sort of like we enabled calendar scheduling for the chip when you're using a GPU or CPU.
There's a bunch of queues, and data gets stuck in that queue and has to wait until there's availability of resources to process it. In our chips, even when we're running 640 different chips, Everything is scheduled in advance to the clock cycle. It'd be as if you had 640 people working on a project and everyone handed over their task to the next person at exactly the right time.
And that's a totally different paradigm. We had to create our own networking to make that work. We had to create our own chips, our own compiler, our own software to make all of that work. It's a huge lift.”
“That must also be the limiting factor in scale - just getting that scheduler to be able to scale across multiple chips and racks and systems?”
“No, the scheduling is what allows us to bring so many chips together. Typically with a GPU solution, you have about 8 GPUs when you're doing inference. Because the speed at which you have to give an answer is very, you have to be very fast.
If you're not scheduling, then you're losing a lot of that time waiting for stuff to show up. It's actually the scheduling that allows us to scale. We have plans to start running that model, which is currently running on 640 chips. We're planning to start moving that up to 2,560 by the end of this year, because we're actually going to be much faster and lower cost by scaling.”What are the economics of using Groq compared to the alternatives?
An anecdote from an AI engineer says purchasing the hardware is currently cost prohibitive. However, Groq claims their API pricing is highly competitive.
Disclaimer: The information provided in this post is for informational purposes only and is written in a personal capacity. The views expressed herein are my own and do not reflect the views of any employer or organization I am associated with. The content in this post does not constitute financial or investment advice.
From latest earnings release on Feb 21st, 2024 (https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Announces-Financial-Results-for-Fourth-Quarter-and-Fiscal-2024/)
Based on market price of May 17th, 2024
Quotes in this post are pulled from an AI-generated transcript from the interview on Feb 5th, 2024 linked below. The AI-generated transcript may contain transcription errors and may not reflect the source content word-for-word.