Trustpilot has successfully implemented a sophisticated real-time architecture to process millions of user reviews while adhering to strict latency and cost constraints. This initiative marks a significant step as the company transitions its core technology stack towards generative AI, focusing on building a high-volume streaming pipeline utilizing fine-tuned Gemma models.
Deep Review Intelligence
At the heart of Trustpilot's operations is the need to deliver actionable review intelligence. The platform, which emphasizes transparency and authentic feedback, prioritizes data integrity and value extraction from incoming reviews. Leveraging large language models (LLMs) has proven effective for tasks such as named entity recognition (NER), sentiment scoring, and customer intent analysis.
Benefits of Fine-Tuning Open Models
Trustpilot opted to fine-tune open-weight models like Gemma instead of relying on off-the-shelf models. This approach offers several advantages:
- Model Independence: Trustpilot maintains control over the retraining lifecycle, avoiding dependency on third-party updates.
- Cost Predictability: Transitioning to fixed infrastructure costs allows for scalable and financially viable predictions.
- MLOps Expansion: In-house model development enhances Trustpilot's capabilities while integrating its unique review intelligence.
- Architectural Continuity: Standardizing on open-weight models enables seamless upgrades and performance improvements.
Instead of deploying a single large model, Trustpilot developed a suite of specialized models based on the lightweight google/gemma-2-9b. This strategy involved creating high-quality training datasets through consensus annotation of a diverse review corpus.
System Architecture Overview
The architecture leverages Dataflow and Gemini Enterprise Agent Platform Endpoints, utilizing the VertexAIModelHandlerJSON for efficient integration. Trustpilot established two distinct endpoints:
- Classifier: A FastAPI-based endpoint for handling preprocessing and chaining tasks.
- LLM: A dedicated endpoint for serving the Gemma model via vLLM.
This separation ensures clean processing and allows independent scaling based on traffic demands.
Performance Optimization
To maximize the efficiency of the vLLM-based endpoints, Trustpilot focused on optimizing the backend configuration, particularly for A2 VMs with A100 GPUs. Key strategies included:
- Adjusting engine parameters to prevent bottlenecks.
- Selecting appropriate data types and enabling prefix caching.
- Implementing a reusable load testing framework to determine optimal server capacity.
Challenges Faced
During the architecture's development, Trustpilot encountered several challenges:
- Private Networking: Achieving full isolation through private endpoints was hindered by a lack of native support for direct communication.
- Deployment Observability: Slow or opaque deployments occasionally required additional troubleshooting.
- GPU Scarcity: Accessing A100 GPUs in the EU region posed significant challenges, complicating resource allocation.
Achieving Results
Collaborating with Google Cloud, Trustpilot effectively harnessed the capabilities of Gemma on the Gemini Enterprise Agent Platform, enabling the processing of millions of reviews daily in near real-time. This achievement provided performance comparable to Gemini models at a significantly reduced cost, transforming customer reviews into immediate, actionable insights.