Microsoft has revamped Bing’s search technology by integrating advanced language models, promising cost savings alongside quicker and more accurate search outcomes.
The latest updates feature a combination of large language models (LLMs), small language models (SLMs), and cutting-edge optimization strategies aimed at refining search performance.
Microsoft revealed these enhancements in a recent announcement, emphasizing their commitment to improving search. According to the company:
“At Bing, innovation drives our approach to search. By utilizing both Large Language Models (LLMs) and Small Language Models (SLMs), we’ve reached a pivotal moment in boosting search efficiency. While transformer models have been effective, evolving user demands require even more capable systems.”
Implementing LLMs often raises challenges related to speed and operational costs. To address this, Bing developed SLMs, which are said to operate 100 times faster than traditional LLMs. The company elaborated:
“LLMs are resource-intensive and slow to execute. To enhance performance, we’ve trained SLMs, achieving approximately 100x throughput improvement over LLMs, resulting in faster and more precise query processing.”
Additionally, Bing employs NVIDIA’s TensorRT-LLM technology to optimize the performance of these models. TensorRT-LLM enables faster and more cost-effective execution of large models on NVIDIA GPUs.
Microsoft’s technical documentation highlights how integrating TensorRT-LLM has transformed Bing’s “Deep Search” functionality, which uses SLMs to deliver relevant results in real-time.
Previously, Bing’s transformer model exhibited a latency of 4.76 seconds per batch (20 queries) and a throughput of 4.2 queries per second per instance. With TensorRT-LLM, latency has dropped to 3.03 seconds per batch, while throughput increased to 6.6 queries per second per instance. This represents a 36% reduction in latency and a 57% improvement in operational efficiency.
Microsoft affirmed:
“Our mission is to deliver the best search experience without compromising quality. TensorRT-LLM allows us to reduce inference time, enhancing overall response speed while maintaining top-notch results.”
These advancements offer several advantages for Bing users:
By adopting a hybrid approach of LLMs, SLMs, and TensorRT optimization, Bing is positioning itself as a leader in handling increasingly complex user queries.
As search engines strive to meet growing expectations, Bing’s use of smaller, highly optimized models demonstrates how modern search can balance speed, precision, and efficiency.