Post by Mani Narayanan

Senior Director Of Engineering at Medallia

𝐅𝐚𝐬𝐭𝐞𝐫 𝐢𝐬 𝐨𝐧𝐥𝐲 𝐟𝐚𝐬𝐭𝐞𝐫 𝐰𝐡𝐞𝐧 𝐭𝐡𝐞 𝐚𝐧𝐬𝐰𝐞𝐫 𝐬𝐭𝐚𝐲𝐬 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞. The optimization moves were picked with the target hardware in mind. The DGX Spark hosts three models in unified memory. The L40S is the target for bandwidth on a single model. FP8 was the headline because it is native to both, so a config that holds on one migrates to the other. The first cut latency by forty-four percent and changed what the model produced. Sparse inputs picked up content the source did not have. Dense inputs lost what was there. The order of outputs changed between runs. That move was rejected. The wave kept the slower version that returns the same answer twice. The path forward is hardware, not a quality trade: 1. DGX Spark today, twenty-seven seconds. 2. L40S, ten to twelve seconds with the bandwidth lift alone. 3. Future Blackwell hardware with NVFP4, under ten seconds in reach. #AIArchitecture #EngineeringLeadership #HardwareArchitecture