Sub-Millisecond Inference Platforms

Why your AI is too slow to survive.

Sep 16, 2025

Welcome to Silicon Sands News—the go-to newsletter for investors, senior executives, and founders navigating the intersection of AI, deep tech, and innovation. Join ~35,000 industry leaders across all 50 U.S. states and 117 countries—including top VCs from Sequoia Capital, Andreessen Horowitz (a16z), Accel, NEA, Bessemer Venture Partners, Khosla Ventures, and Kleiner Perkins.

Our readership also includes decision-makers from Apple, Amazon, NVIDIA, and OpenAI, some of the most innovative companies shaping the future of technology. Subscribe to stay ahead of the trends defining the next wave of disruption in AI, enterprise software, and beyond.

This week, we will explore inferencing platforms. Why they exist, who the market leaders are, what the value is, what the different times are and when to use them.

Let's Dive Into It...

Key Takeaways

For VCs and LPs:

The AI inference market is projected to reach nearly $255 billion by 2030, representing a massive and largely untapped investment opportunity compared to the crowded training market.
Specialized inference hardware companies like Groq and Cerebras are demonstrating significant performance gains over traditional GPUs, creating a new class of high-growth startups.
Portfolio companies that rely on real-time AI applications will see significant margin improvement by adopting specialized inference platforms, boosting their own valuations.
The shift to inference-first infrastructure is creating a new wave of M&A opportunities as established players look to acquire cutting-edge technology.
Early-stage investment in startups building on top of these new inference platforms will capture the next wave of AI-native innovation.

For Senior Executives:

Adopting specialized inference platforms can reduce AI operational costs by up to 10x, directly impacting your bottom line.
The consistent low latency of these platforms enables new real-time AI applications that can transform customer experiences and internal workflows.
On-premise and private cloud deployment options from providers like Cerebras offer enhanced data security and regulatory compliance for sensitive enterprise data.
The speed and efficiency of these platforms can accelerate your company’s AI-driven innovation, giving you a significant competitive advantage.
Partnering with these emerging leaders can de-risk your AI strategy and provide access to cutting-edge technology before it becomes mainstream.

For Founders:

Building on specialized inference platforms can give your startup a significant performance advantage, allowing you to build products that were previously impossible.
The lower cost of inference can dramatically improve your unit economics and extend your runway.
The ability to offer real-time AI features can be a powerful differentiator in a crowded market.
The developer-friendly APIs and tools offered by these platforms can accelerate your time to market.
The growing ecosystem around these platforms provides access to a new community of developers, partners, and investors.

The Real Work Has Begun

For the past few years, the AI world has been obsessed with training. Bigger models, more data, and eye-watering compute budgets have dominated the headlines. But the gold rush is over. The real work of putting AI to use in the real world has begun, and it’s happening in the often-overlooked world of inference.

Inference is where AI models go to work, making predictions, generating text, and powering the applications that are changing our lives. And while training is a periodic, high-intensity affair, inference is a constant, real-time demand. This is where the rubber meets the road, and where the actual cost and performance of AI are felt.

The numbers tell the story. The AI inference market is projected to explode from $106.15 billion in 2025 to a staggering $254.98 billion by 2030, a compound annual growth rate of nearly 20%. This explosive growth trajectory (Figure 1) represents the next wave of AI innovation and investment focus.

And a new breed of company is emerging to meet this demand, building specialized hardware and software platforms that are leaving traditional GPUs in the dust.

Two of the most prominent players in this new landscape are Groq and Cerebras. These companies are not just building faster chips; they are rethinking the entire AI infrastructure stack, from the silicon up to the software. And they are delivering performance gains that are not just incremental, but orders of magnitude better than what was previously possible. Groq alone has attracted over 2 million developers and teams to its platform.

A Tale of Two Titans

The performance advantages become even more apparent when compared across the broader landscape, where specialized inference platforms demonstrate clear superiority over traditional solutions.

The Need for Speed

Groq is a company that is obsessed with speed. Their LPU™ (Language Processing Unit) is a custom-built chip designed from the ground up for one purpose: to run AI inference as fast as humanly possible. And the results are staggering. Groq claims sub-millisecond latency that remains consistent even as workloads scale. This is a game-changer for real-time applications like voice assistants, chatbots, and online gaming, where even a few hundred milliseconds of lag can be the difference between a seamless experience and a frustrating one.

The company’s deterministic, single-core architecture is a radical departure from the multi-core approach of traditional GPUs. This design eliminates the need for complex scheduling and synchronization, which can introduce unpredictable delays. The result is a level of performance and predictability that is not possible with off-the-shelf hardware.

Groq's impressive technology has attracted a growing list of high-profile customers, including Dropbox, Vercel, Canva, and Robinhood. These companies are using Groq's platform to power a new generation of AI-native products and services. The company has also been on a fundraising tear, raising $640 million in a Series D round in August 2024 at a $2.8 billion valuation. And if the rumors are true, they are in the process of raising another $600 million at a valuation of around $6 billion. This funding activity reflects broader investor confidence in the inference platform market.

The Power of Scale

While Groq is focused on speed, Cerebras is all about scale. Their Wafer Scale Engine (WSE) is a marvel of modern engineering, a single chip the size of a dinner plate that contains trillions of transistors. This allows Cerebras to pack an unprecedented amount of computing power into a single device, enabling them to train and run AI models on a scale that was previously unimaginable. The company's infrastructure is designed to serve over 40 million tokens per second by the end of 2025.

The company's WSE-3, unveiled in March 2024, is the fastest AI chip on the planet, and it is powering a new generation of AI supercomputers. Cerebras is not just selling chips; they are providing a full-stack platform that allows customers to train, fine-tune, and serve models on a single, unified infrastructure. This dramatically simplifies the AI development process and will enable companies to go from idea to production in record time.

Cerebras has also built an impressive roster of customers, including some of the biggest names in tech and life sciences. Perplexity, AlphaSense, Meta, GSK, and the Mayo Clinic are all using Cerebras' platform to push the boundaries of what is possible with AI. The company has also been successful in the public sector, with a significant partnership with the US Department of Energy. With over $720 million in funding, Cerebras is well-capitalized to continue its ambitious roadmap and challenge the dominance of NVIDIA in the AI market.

Why Speed Equals Money

The shift from training to inference is not just about technology; it's about economics. Training an AI model is a one-time cost, but inference is an ongoing expense that scales with usage. As AI applications become more popular and more widely deployed, the cost of inference can quickly dwarf the initial training costs. This is why companies are increasingly focused on optimizing their inference infrastructure.

Consider the economics of a popular AI-powered chatbot. Training the underlying language model might cost millions of dollars, but serving billions of queries per month can cost tens of millions more. Every millisecond of latency reduction and every percentage point of efficiency improvement translates directly to the bottom line. This is where specialized inference platforms like Groq and Cerebras shine.

Groq's platform, for example, claims to offer the lowest cost per token in the industry, according to independent benchmarks from ArtificialAnalysis.ai. The cost advantages are significant, with some platforms offering 6-10x lower costs per token compared to traditional cloud APIs.

This is not just about raw performance; it's about delivering that performance efficiently and cost-effectively. For a company serving millions of AI queries per day, the savings can be substantial.

Cerebras takes a different approach, focusing on the total cost of ownership. By providing a full-stack platform that handles training, fine-tuning, and inference, they eliminate the need for complex multi-vendor integrations and reduce operational overhead. This can be particularly valuable for large enterprises that need to deploy AI at scale across multiple use cases.

Where Inference Platforms Make the Difference

The actual test of any technology is how it performs in the real world. Both Groq and Cerebras have impressive customer lists, but what are these companies actually doing with these platforms?

Perplexity AI represents one of Cerebras's most significant partnerships and success stories. In February 2025, Perplexity launched Sonar, a groundbreaking search model powered by Cerebras infrastructure that processes an unprecedented 1,200 tokens per second. Built on Llama 3.3 70B, Sonar delivers near-instant answer generation that is transforming how users search and discover information. Denis Yarats, CTO of Perplexity, noted that "Cerebras' cutting-edge AI inference infrastructure has enabled us to achieve unprecedented speeds and efficiency, setting a new standard for search."

AlphaSense, a leading financial search and analytics company, is using Cerebras to power its AI-driven insights platform. According to Raj Neervannan, CTO and co-founder of AlphaSense, "By partnering with Cerebras, we are integrating cutting-edge AI infrastructure that allows us to deliver the unprecedented speed, most accurate and relevant insights available – helping our customers make smarter decisions with confidence." For a company that processes vast amounts of financial data in real-time, the speed and scale of Cerebras's platform are a game-changer.

Meta is using Cerebras to accelerate their AI development process. Ahmad Al-Dahle, VP of GenAI at Meta, notes that "By delivering over 2,000 tokens per second for Scout – more than 30 times faster than closed models like ChatGPT or Anthropic, Cerebras is helping developers everywhere to move faster, go deeper, and build better than ever before." This kind of performance improvement allows Meta to iterate faster and deploy AI features more quickly.

In the healthcare sector, GSK is leveraging Cerebras to revolutionize drug discovery. Kim Branson, SVP of AI and ML at GSK, explains that "With Cerebras' inference speed, GSK is developing innovative AI applications, such as intelligent research agents, that will fundamentally improve the productivity of our researchers and drug discovery process." The ability to run complex molecular simulations in real-time is opening up new possibilities for pharmaceutical research.

On the Groq side, companies like Dropbox, Vercel, and Canva are using the platform to power real-time AI features in their products. The sub-millisecond latency of Groq's LPU™ enables these companies to offer AI-powered features that feel instantaneous to users, creating a competitive advantage in user experience.

Willow Voice, an AI-powered dictation tool, exemplifies the transformative impact of Groq's infrastructure. Lawrence Liu, CTO and Co-founder of Willow, explains how switching to Groq solved their reliability challenges: "Since switching to Groq, we've had zero downtime. That's been transformational for our users and our team." The performance improvements were equally impressive, with Liu noting that "We expected latency to increase linearly with longer token counts, but with Groq, it didn't. That was a huge win."

The PGA of America has also embraced Groq to power their "PGA Assistant," an AI tool that helps employees with routine tasks. Kevin Scott, CTO of the PGA of America, highlights Groq's developer-friendly approach: "Groq has become our go-to playground for developers – super useful and lightning-fast." He emphasizes the practical benefits: "If we have things where performance matters more, we come to Groq - you deliver real, working solutions, not just buzzwords."

When to Choose Inference Platforms Over Traditional Solutions

The decision to adopt a specialized inference platform is not always straightforward. Traditional GPU-based solutions from NVIDIA and cloud providers like AWS, Google, and Microsoft have their own advantages, including broad ecosystem support and proven reliability. When does it make sense to choose a specialized platform like Groq or Cerebras?

The answer depends on your specific use case and requirements. If you are building a real-time application where latency is critical, Groq's sub-millisecond performance can be a game-changer. If you need to train and deploy large-scale models quickly, Cerebras's full-stack platform can significantly reduce time to market. If cost optimization is your primary concern, the efficiency gains of these platforms can deliver substantial savings at scale. The decision framework shows clear patterns across different use cases and requirements.

For startups and smaller companies, the decision often comes down to developer experience and ease of use. Both Groq and Cerebras offer developer-friendly APIs and tools that can accelerate development. Groq's GroqCloud™ platform, for example, offers OpenAI-compatible APIs, making it easy for developers to switch from existing solutions. Cerebras offers similar compatibility and has recently made its platform available through AWS Marketplace, further simplifying adoption.

For large enterprises, the decision is often more complex, involving considerations around security, compliance, and integration with existing systems. Cerebras's on-premise and private cloud options can be particularly attractive for companies with strict data governance requirements. The company's SOC2 and HIPAA certifications also make it suitable for regulated industries like healthcare and finance.

A New Era of AI Infrastructure

The emergence of specialized inference platforms is reshaping the AI infrastructure landscape. Traditional players like NVIDIA, which has dominated the AI training market with their H100 and A100 GPUs, are now facing competition from a new generation of purpose-built solutions. The positioning of these players reveals distinct strategic approaches across performance and cost dimensions.

Hardware-First Approaches

Beyond Groq and Cerebras, several other companies are taking hardware-first approaches to AI inference. SambaNova Systems has built its own Reconfigurable Dataflow Units (RDUs) and claims to deliver the world's fastest AI inference service, with over 100 tokens per second for Llama 3.1 405B. Their SambaCloud platform recently launched as a turnkey AI inference solution deployable in just 90 days.

Graphcore has developed Intelligence Processing Units (IPUs) explicitly designed for AI workloads, offering low-latency, high-performance inference solutions through their IPU Inference Toolkit.

Cloud Provider Custom Silicon

The major cloud providers are not sitting idle. Google's Tensor Processing Units (TPUs) have evolved from training-focused chips to powerful inference accelerators, with TPU v5e delivering 3x more inference throughput per dollar than previous generations.

Amazon Web Services has developed Inferentia chips specifically for machine learning inference, offering significantly lower costs for high-volume inference workloads compared to traditional GPU instances.

Software-First Platforms

A new category of companies is taking a software-first approach, optimizing inference performance through advanced algorithms and distributed computing rather than custom hardware. Together AI and Fireworks AI are leading this charge, offering rapid inference speeds and high uptime across broad catalogs of open-source models.

These platforms excel at making state-of-the-art models accessible through simple APIs, with Together AI focusing on fast, reliable hosted model inference and Fireworks AI delivering real-time performance with minimal latency.

Serverless and Developer-Focused Platforms

The serverless AI inference space has exploded with platforms designed for developer productivity. Modal offers high-performance AI infrastructure with auto-scaling capabilities, while RunPod provides GPU-enabled environments that can be spun up instantly.

Replicate has carved out a niche in making AI models accessible to developers through simple APIs. At the same time, platforms like Baseten and Hugging Face Inference Endpoints offer fast iteration and model hosting capabilities.

This is not just about hardware; it's about rethinking the entire AI stack. The result is a rapidly evolving ecosystem where performance, cost, and ease of use are the key differentiators, with each platform targeting different segments of the market from enterprise-scale deployments to individual developer projects.

The market is also seeing the emergence of new business models. While traditional hardware vendors sell chips, companies like Groq and Cerebras are offering AI infrastructure as a service. This allows customers to access cutting-edge technology without the upfront capital investment and operational complexity of managing their own hardware.

The Next Wave of AI Innovation

For investors, the shift to inference represents a massive opportunity. The AI inference market is projected to grow from $106.15 billion in 2025 to $254.98 billion by 2030, representing a compound annual growth rate of nearly 20%. This growth is being driven by the increasing deployment of AI applications across industries and the need for more efficient and cost-effective inference solutions.

The funding activity around companies like Groq and Cerebras is a clear indication of investor interest. Groq's recent $640 million Series D round and rumored $6 billion valuation demonstrate the market's confidence in the company's technology and business model. Similarly, Cerebras's ability to raise over $715 million in funding shows that investors are betting big on the future of specialized AI infrastructure.

But the opportunity extends beyond these two companies. The shift to inference is creating opportunities across the entire AI stack, from chip design to software optimization to application development. Startups that can leverage these new platforms to build innovative AI applications will have a significant advantage in the market.

Let's Wrap This Up

The AI landscape is shifting. The era of training-at-all-costs is giving way to a new focus on efficient, real-time inference. This is where the actual value of AI will be unlocked, and where the next generation of AI-native companies will be built. Groq and Cerebras are at the forefront of this shift, with their specialized hardware and software platforms that are delivering unprecedented levels of performance and scale.

For investors, the message is clear: the inference market represents a massive and largely untapped opportunity. For executives, the efficiency gains and cost savings of these platforms can provide a significant competitive advantage. For founders, the ability to build real-time AI applications that were previously impossible opens up entirely new categories of products and services.

The future of AI is not just about bigger models and more data; it's about putting AI to work in the real world, efficiently and cost-effectively. The companies that understand this shift and position themselves accordingly will be the winners in the next phase of the AI revolution.

UPCOMING EVENTS:

Disclaimer: This is educational content, not financial advice. Full source links above.The views and opinions expressed above are current as of the date of this document and are subject to change without notice. Materials referenced above will be provided for educational purposes only. None of the above will include investment advice, a recommendation or an offer to sell, or a solicitation of an offer to buy, any securities or investment products.