TorchServe GenAI use cases and showcase¶
This document shows interesting usecases with TorchServe for Gen AI deployments.
Enhancing LLM Serving with Torch Compiled RAG on AWS Graviton¶
In this blog, we show how to deploy a RAG Endpoint using TorchServe, increase throughput using torch.compile
and improve the response generated by the Llama Endpoint. We also show how the RAG endpoint can be deployed on CPU using AWS Graviton, while the Llama endpoint is still deployed on a GPU. This kind of microservices-based RAG solution efficiently utilizes compute resources, resulting in potential cost savings for customers.