Search Auto Complete

The search auto-complete system is based on the below pieces; please scroll down to find more details for each of these components:

  • Requirements: We discuss here a set of requirements that the service will cater
  • Storage Estimate: We discuss here the storage estimate
  • Search Logs: Contains search logs from applications indicating search words and times
  • Log aggregators: Takes the search logs and aggregate them into counts by time duration, and stores result into DB
  • Log Aggregated Data Store: Storage for aggregated log data
  • Shard manager: Provides shards distribution based on query frequencies
  • Trie Builders: Takes data from an aggregate data store, merges it with real-time data, and constructs Trie, which it writes into an in-memory distributed DB as binary data
  • Search nodes: Load the binary trie data from in memory distributed DB into process memory and serves query requests
  • Query Web server: Utilizes Zookeeper to find the search nodes responsible for the query and round robins the requests between them
  • Periodic processing: Periodically rebuilds the Trie while using data from real-time queries and merges it with historical data
  • Weights for query words: Describes how weights are assigned to queries based on frequency and time
  • Memory Usage: Discusses memory usage based on the design
  • Trie Data Structure: Discusses the trie data structure used for search autocomplete
  • Pros and Cons regarding this design
  • References

Requirements

Build a system that can provide search auto complete suggestions. The system should utilize a near real time approach in building a data model that can serve auto complete suggestions. The system should have latency under 100ms and be able to process up to a max of 10K concurrent auto complete requests.

Storage Estimate

Lets assume that we have 100m queries daily. And that each search query contains around 25 bytes of data. This amounts to 2.5GB of data. The daily data is fairly low, we can choose to store a month of data since it would amount to less than 90GB of disk space (replicated 3x, a month’s of logs will amount to 300GB of data).

We will utilize a distributed in memory database to store real time stream logs. The distributed in memory space needed to save real time logs is less than 1GB (replicated 3x it will be 3GB). See more on the memory usage in the section on memory usage.

We will also store the binary representation of the current Trie in a distributed in memory db with a physical key value data store behind the in memory db. As we will see in the design below, each Trie is designed to have less than 1GB of space and thus even if we create 26 Trie instances, they will require less than 32 GB of memory and 32 GB of disk space. (replicated 3x, total memory will be less than 100GB and similarly total disk space will be less than 100GB).

Search Logs:

Applications processing search requests will write search words and time to a log which we will use to build our search auto-complete system. The search logs would be in the below form:

Query WordTime
Leet code2021-04-20 10:01:323
Chat System design2021-04-20 10:05:503
Weather2021-04-20 11:01:123

Log Aggregator:

Log aggregator will aggregate the logs into buckets based on time duration (i.e., day) and write the aggregated frequencies for search words into a database table.

Log Aggregated Data Store:

We can store this data in an RDBMS or a no SQL database such as Cassandra.

Query WordDayFrequency
Chat2021-04-194000
Chat2021-04-205500
Green2021-04-18100000
Green2021-04-19202020
Green2021-04-20303031

Shard Manager:

A naïve but straightforward implementation would be to SHARD the data per character and build a trie where each trie builder process is dedicated to building a trie per character. However, a better strategy is to analyze the data distribution based on historical queries and construct shards. For example, if data frequency for words starting with ‘a’ is the same as words start with characters [s,t,u] then one shard could be dedicated to the character ‘a’ while another is dedicated to words starting with characters s,t,u.

Trie Builder:

Trie builder takes the historical data from log aggregated data store and real-time data from Redis in-memory cache, merges the data, and constructs one or more tries based on the starting characters that the Trie is responsible for. For more on handling and merging real-time data please see the section on real-time data processing below. The Trie is written into write-through cache-based off of a distributed in-memory DB such as REDIS and in a physical database such as Cassandra as a document store. Once the Trie is built:

  •  The process registers itself with Zookeper to identify that it now is responsible for processing queries for that character.
  • The process also sends a notification message to search nodes that were previously registered against Zookeeper as providers of query results for that character, indicating that a new trie has been built. See more on coordination between trie builder and search nodes below.

Search nodes:

Above in the trie builder, we discovered that a trie builder builds a trie and then registers itself as a search node. However, prior to the new Trie getting built, search nodes were already serving results for the same characters. We saw above that the trie builder sends a notification message to the search nodes that were processing the trie result to load a new trie from the data store. This message is transmitted on a queue with a count indicating the number of the message starting from 0. Since the message is transmitted on a queue only one search node will pick it up. This node will unregister itself from Zookeeper, load the new Trie from Redis, register itself back with the Zookeeper and then transmit the message on the queue again with an increased count. The last node receiving the message (it knows that it is the last node when the message count reaches a threshold) unregisters itself as a search node and changes its responsibility to become the next trie builder. This design will enable search nodes to gradually roll over to the new Trie representation and thereby continue to provide query results without interruption. See more about real-time trie building in the real-time section below.

Query Web Server:

Query web server utilizes Zookeeper to find search nodes registered to process queries for the word starting from a given character and round robins search queries between them.

Periodic processing of real time data:

Deleting stale data

Notice above that the log datastore has a frequency of words by day. This can be utilized to delete queries for data before a certain start date. For example, if we are interested in keeping query aggregations for the last two weeks, we will have a job to delete data older than two weeks that runs once every day.

Streaming queries

As clients query for results, the query word and its time will get stream over a transport such as Kafka. This is so that fresh queries can be accounted for in the trie construction and thus in the query search results.

Real-time process

              The real-time processor will take the streaming queries and write them periodically to an in-memory distributed data store such as Redis so that the trie builder can include the recent queries. 

Real-time distributed in-memory database

              The real-time distributed in-memory database is a sharded data store containing the below table.

WordCurrent CountPrior Count
Mango205000
Train10700

              The prior count contains the number of queries since the start of the data stream that were not accounted for in the log aggregated data store. The current count has new word queries since the Trie related to that word was last built. When the trie builder builds a trie for a given set of words, it moves the current count into the prior count by aggregating whatever value, if any, existed in the prior count to the current count’s value.

Periodic Trie Reconstruction

Once a new trie is built, it will register the trie construction time against a coordination service which after a certain time threshold will publish a trie reconstruction message to the next trie builder, which was waiting for this event.

Weights for query words

Part of the motivation to process real-time queries is to provide recent queries more weight. Since we store the word searches per day in the log aggregated data store, we can build an algorithm that gives more recent searches a higher weight. In addition, we have access to real-time data, which above we split into two blocks: Current Count (very recent data) and Prior Data (data since the start of day). We can use this information to tune further the word weights based on query times. We can further divide the search chunks per a more granular time boundary, such as an hour of the day, and use that to more accurately tune word weights based on recent usage. 

Memory usage

Storing one or more tries entirely in memory may seem very memory intensive but let’s calculate how much memory we will really need. Typically, the search queries are not very long. To avoid the issue of storing huge Trie’s in memory, we can limit the maximum length of the query, which we will process to a small number of characters – I,e 50.
Assume the average query consist of five words, and each word is five characters long. This means each query will amount to 25 bytes of data. In addition, let’s assume that we have 10m users doing ten queries per day. This will amount to 100m queries a day. Let’s assume that 20% of the queries daily consist of new words. This amounts to 0.5GB of daily new query data that needs to be saved. Based on the above, the daily data consists of around 2.5GB (assuming 0.5GB amounted to 20% of total data). If we divide the nodes keeping the trie cache in memory evenly across characters the data can easily fit in process memory. If the data doesn’t fit into single process memory, we can further divide our trie shards to be based on two starting characters instead of a single starting character. That would significantly reduce the amount of data that needs to fit in single process memory.

Trie Data Structure

See more about the trie data structure on Wikipedia here: https://en.wikipedia.org/wiki/Trie

See my implementation of a trie data structure for search queries on github here: https://github.com/zcoderz/leetcode/blob/main/src/main/java/trie/AutoCompleteSystem.java

My implementation above contains per node in the Trie a “tree set” containing words that can be referenced from that node and sorted by their frequency. Allocating a “tree set” per node in a trie is a very memory-heavy operation.

Alternatively, we could search the subtree to find all words and then sort them to find the top K words. However, that would result in O(p) – traversing the prefix tree + O(c) – traversing the child subtree + O(N*Log(N)) for sorting the N-words found in the subtree. This is a reasonably CPU-heavy operation. Since we are interested in returning search results very quickly, this will not work.

An alternative approach can be that as we build the Trie, during the construction of the Trie, we walk the child nodes and calculate top K words per node in a bottom-up fashion while storing those words per node. The words can be stored in an array that’s pre-sorted. While this will amount to a lot of extra memory, the advantage is that the query results can be returned in almost O(1) time.

Pros & Cons of this design

  • Pros:
    • The system is built via a series of services where each service is small and simple
    • The design incorporates processing real-time query updates
    • Real-time processing leverages the same layers as the batch does to build and serve the query, so duplication of code doesn’t exist for batch vs real-time
    • Design is scalable and resilient
    • We can process requests very efficiently due to our trie design discussed above
  • Cons:
    • Trie Builders and Search node utilize a large amount of in-process memory to return query search results quickly
    • The system is technically not real-time, it processes new queries on a periodic interval
    • The trie construction process is slow

References

https://medium.com/@prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-autocomplete-c20f98e2eff1

System Design Interview – Alex Xu