Post by Etched

34,675 followers

We're coming out of stealth. We've built our first racks after a successful A0 tapeout, $1B+ in customer contracts, and $800m raised. Early customer tests show us achieving SOTA throughput, latency, and power efficiency on inference workloads. Our first racks ship this summer. We're a team of 400+ engineers from NVIDIA, Google TPUs, Broadcom, SK Hynix, TSMC, & more. We're backed by Jane Street, HRT, Two Sigma, and Jump, with strategic investment from VentureTech Alliance. We're excited to deepen our partnership with the world's leading semiconductor manufacturer. Our Series B was led by Stripes, with participation from Ribbit Capital, Radical Ventures, Positive Sum, Primary, & Argo. Our inference systems are built to push the entire pareto curve on frontier models, including many-trillion parameter MoEs, long context, and agentic workloads. Today, we're sharing two breakthroughs to make this happen: Low-Voltage Inference (LVI) for high throughput workloads. Today, AI chips can't scale FLOPs without thermal throttling. As FLOPs utilization increases, AI chips draw more power and downregulate clock speed. This often results in sustained inference throughput under half of peak FLOPs. Chips in other industries solve the power problem by running at lower voltages. Bitcoin miners run at under 3x the voltage of AI chips! We’ve designed a new architecture to run our chip’s math blocks at under half the voltage of most AI chips. This enables multiple times the FLOPs density of AI chips today. Cluster-Scale Memory (CSM) for low latency workloads. Today's AI chips using HBM can’t achieve SRAM-level decode speeds due to memory subsystem and interconnect bottlenecks. SRAM-only chips have lower FLOPs density and memory capacity, sacrificing throughput. You’re forced to make a tradeoff: serve at much slower speeds, or run at low batch sizes and suffer from higher costs. When running large MoE models, token routing across experts requires sending data through a deep memory hierarchy and a networking switch to reach a destination expert. Each memory layer inherently adds latency; thus, the best layer is no layer. We’ve designed a new architecture that creates a shared low-latency memory pool across the entire scale-up domain. We use a proprietary ultra-low-latency, high-bandwidth interconnect to enable dramatically faster memory access across chips. Our HBM/SRAM hybrid design solves both memory capacity and mem2mem latency, enabling high throughput and interactivity simultaneously. CSM improves latency and avoids today's cost, reliability, yield, thermal, and compute tradeoffs of SRAM-only chips, 3D DRAM chips, or optics. We're scaling production as fast as possible. We built a 2MW datacenter in our office and opened a Taiwan factory for 24/7 engineering. Performance, roadmap, and more updates are coming this summer. If this excites you, join us to build the future of gigawatt-scale inference: https://lnkd.in/gaQBX-Fz.