The GPU fable
Would you use a 10-ton truck to transport groceries from the supermarket to your car trunk?
Numbers, times and facts
During inference, the following occurs:
A ‘user’ composes prompt on the local machine. Say, 20 words. It takes maybe a minute.
This prompt is sent to some remote GPU. 50ms.
The previous content of this ‘session’ is pre-pended to the prompt, making a total of, say, 4000 words.
(Let’s assume, for simplicity’s sake, that 1 word = 1 token, 200B model, and one single 80 GB GPU.)
These 4000 tokens are processed, all in parallel, with GPU, through, say, 100 layers, in 4 cycles.
(Let’s assume that the GPU has 1000 tensor units, so it can do the above all in 4 cycles.)
This is called ‘prefill’, and takes, say, 5 seconds. It builds the KV cache, used in the next phase (generation).
The result of the prefill is the first token of the inference (’answer’).
Then this first token is ran, sequentially, through all 100 layers, to generate the second token of the inference.
This can use only one tensor unit, out of 1000, and it takes around 30ms.
Then the 2nd resulting token is also processed in the same way, through the 100 layers. and so on. This process, running single successive tokens through the layers is called ‘generation’.
Let’s assume that the answer has 200 tokens. How long does it take? 200 * 0.03 = 6 seconds.
the GPU burns about 600W on average.
The answer is then transmitted to the user, 50ms.
The KV cache is about 18GB.
The Consequences
How long did it take for the user to see the answer appear? About 11 seconds at best.
This means that for 5 seconds (45% of the time), 1000 tensor cores (100%) were used. Then for 6 seconds (55% of the time), 1 tensor core (0.1%) was used. the average utilization is about 45%.
If a user generates next prompt in one minute, to maintain 11 second performance, no more than 6 users can be served. That’s 100W per user.
There are tricks to make this a bit better, but just a bit. There is no workaround for the sequential nature of generation phase.
If the context is much smaller, and answers much shorter, and model much smaller (smaller KV cache), then things get closer to marketing PR, and eventually ‘100 concurrent users’ can be achieved (presumably asking “is 1 + 1 two, answer yes or no”).
The Alternative
Now imagine the same user, running the same model, on local device with enough RAM (64-128 GB).
As the user types input, it is tokenized word by word, streamed and immediately processed in the forward pass token by token.
No need to process anything in parallel, the machine can run the token through 100 layers faster than the user can type.
After the user types the last word of the prompt, the first token of the answer is ready. TTFT (time to first token) = 0.
The KV cache is never deleted, it is stored in a file for each session.
For preloaded session the startup time is 0. For switching a session, few seconds.
About 20W per active session (1 prompt per minute). 5 times less than GPU.
No Internet needed.
No subscription.
No data ever leaves the device.
Better performance
Some execs are having nightmares.

