October 26, 2022

How we deliver complex model insights in under 100ms

Julien Van Beveren
Full Stack Developer

We at tekst.ai are training custom AI models that automatically detect topic, sentiment and language of incoming customer support tickets. Using this information we can predict the category and right customer agent for a certain ticket and route it directly to them, saving companies a lot of money.

Whilst routing tickets, we collect large amounts of data to track the performance of our model. This data is stored in a timestream database and we needed an efficient, fast and cheap way to generate and show analytics to our end users.

Although generating the analytics from our raw data takes up to 5 seconds, we managed to deliver analytics in less than 100ms, while keeping costs low. This is how we did it!

The process of receiving analytics on the front-end consists of multiple steps but our biggest time saver is caching. We cache in 2 places, on the front-end as well as on the backend. After the first time a user visited his insights all the analytics are cached in localStorage, allowing us to render a slightly outdated version of the stats within 20ms, and requesting more updated analytics in the background. But it doesn’t end there. On the backend we never wait for the analytics generation to finish before returning stats. This sounds counterintuitive but in actuality we use 2 endpoints for analytics. One endpoint that actually returns the analytics, but another one to generate them. When a request hits /analytics this endpoints takes an already generated analytics json from dynamoDB, checks how old the data is and if it is older than the allowed age specified in the request it will trigger a new analytics generation. No matter the age of the data, it returns what it got from dynamo instantly and the refetching happens in the background. If the data happens to be outdated, we attach a stale flag to the data, notifying the front-end that there is new data waiting to be requested in the backend.

Here you have a quick summary of our approach:

  • cache analytics on front-end localStorage (20ms response time)
  • cache analytics on backend dynamoDB (100ms response time)
  • always return the analytics you have, but if outdated refetch in the background
  • only request new data when a certain age is passed

Another approach you could take is to mirror the expensive timestream data in an s3 bucket but only save the data needed for the analytics (which is a fraction). This way you only have to query every piece of data once and you can use cheap s3 for further requests. (and could be easily done with a cronjob + triggers)

Continue reading