Some folks from Cerebras were on the most recent episode of the Oxide and Friends podcast. I’d not heard of Cerebras before, but they’ve developed custom silicon for doing AI inference called the WSE-3. It takes up an entire silicon wafer, and has 900,000 cores and 44Gb of on die SRAM:
This gives every core single-clock-cycle access to fast memory at extremely high bandwidth – 21 PB/s.
Those are some pretty fun figures. All of this aims to make AI inference fast. You can try it out with a small selection of models at inference.cerebras.ai. For the couple of prompts I tried the responses were nearly instantaneous, which is mighty impressive. Of course, their table of comparative figures against the Nvidia H100 does not include power consumption, but I imagine it is possible that it’s better than a cluster of individual machines.