Organizations often struggle to process vast amounts of data while ensuring low latency. Smartsheet, a work management platform, faced this challenge as it processed hundreds of thousands of events per second for features like live collaboration and notifications. The existing system required significant resources, leading to inefficiencies and high costs.
To address these issues, Smartsheet developed a Real-time Dynamic Filtering (RDF) system using Amazon Managed Service for Apache Flink. This innovation not only reduced messaging costs by over $40,000 monthly but also improved live collaboration latency by 1.8 times.
Challenges with Existing Architecture
Smartsheet's event-driven architecture relied on Amazon Simple Notification Service (SNS) to publish events, with teams subscribing via Amazon Simple Queue Service (SQS) using static filter policies. These policies dictated the types of events received, such as updates or deletions of sheet rows.
However, changes to these filter policies could take up to 15 minutes to propagate, creating delays that hindered real-time collaboration. As a workaround, teams subscribed to all events, resulting in over 90% of these events being discarded after processing, which was both costly and inefficient.
Implementing Real-time Dynamic Filtering
To enhance efficiency, Smartsheet's RDF system moved filtering logic directly into the stream processing layer. By utilizing Flink’s KeyedCoProcessFunction, the system maintained dynamic filter policies within Flink's state management, specifically RocksDB.
The RDF application processes two streams of data, allowing for real-time updates and filtering based on active collaborators. For consumers needing all events, Flink’s broadcast state feature was employed to replicate necessary policies across tasks, minimizing memory overhead.
Immediate Impact and Benefits
With the RDF system, filter policy changes are now reflected within one second, eliminating the need for costly DynamoDB lookups for each event. This transformation has led to significant improvements:
- Reduction of over $40,000 in monthly messaging costs.
- 1.8x enhancement in live collaboration latency.
- A scalable platform adopted by multiple internal teams.
Future Expansion and Applications
Initially adopted by the live collaboration team, the RDF architecture is now being expanded to support workflow automation and notification routing. Smartsheet is also exploring automatic scaling policies to optimize costs during off-peak hours.
Conclusion
Smartsheet’s implementation of the Real-time Dynamic Filtering system showcases how leveraging advanced stream processing technologies can lead to substantial cost savings and performance improvements. Organizations facing similar challenges in event processing may find this approach beneficial for reducing costs and enhancing real-time capabilities.