Google - Design Search Autocomplete System
1. Think Process
1.1. Problem Goal
The autocomplete system must:
-
Quickly find all sentences that start with a user’s current prefix (like “i l” → “i love”, “i like”, …)
-
Rank them by frequency (most used first), and lexicographically for ties.
So we need both:
-
Fast prefix lookup
-
Efficient ranking by frequency
1.2. Data structure choice
1.2.1. Trie
Root
├── 'i' → "i"
│ ├── ' ' → "i "
│ │ ├── 'l' → "i l"
│ │ │ ├── 'o' → "i lo"
1.2.2. Min-Heap at Each Node
Each Trie node stores a min-heap (priority queue) of size ≤ k, containing the top k most frequent sentences passing through that node.
Each heap entry = (-freq, sentence)
-
-freq → using negative frequency so the smallest heap value is least frequent.
-
This ensures the heap root is the lowest-ranked among top k, making it easy to maintain.
1.3. Operations
1.3.1. Insertion (Build/Update)
For every sentence:
-
Walk through the Trie for each character in the sentence.
-
Update frequency count in a global dictionary: count[sentence] += 1
For every Trie node along the path:
-
Push (-freq, sentence) into its heap.
-
If heap size > k → pop the smallest (to keep only top k).
Complexity:
- Each character processed once → O(L log k) per sentence.
1.3.2. Query (User Typing)
-
As user types each character, move through the Trie.
-
If at any point a node doesn’t exist → return empty list.
-
When reaching the node for current prefix:
- Retrieve its heap.
- Convert heap to a sorted list (descending frequency, lexicographically).
-
Complexity: O(L) to traverse + O(k log k) to sort small heap.
2. Code Walk-through
import heapq
class TrieNode:
def __init__(self):
self.children = {}
self.heap = [] # stores (-freq, sentence)
class AutocompleteSystem:
def __init__(self, sentences, times, k=3):
self.root = TrieNode()
self.k = k
self.count = {} # global frequency map
self.prefix = ""
self.curr = self.root
for s, t in zip(sentences, times):
self.count[s] = t
self.insert(s, t)
def insert(self, sentence, freq):
node = self.root
for c in sentence:
if c not in node.children:
node.children[c] = TrieNode()
node = node.children[c]
# push into node heap
heapq.heappush(node.heap, (-freq, sentence))
# trim to top k
if len(node.heap) > self.k:
heapq.heappop(node.heap)
def input(self, c):
# Case 1: when user finishes a sentence
if c == '#':
self.count[self.prefix] = self.count.get(self.prefix, 0) + 1
self.insert(self.prefix, self.count[self.prefix])
self.prefix = ""
self.curr = self.root
return []
self.prefix += c
# Case 2: node do not exists
if c not in self.curr.children:
self.curr.children[c] = TrieNode()
self.curr = self.curr.children[c]
return []
# Case 3: Happy case return top k items.
self.curr = self.curr.children[c]
return self._get_top(self.curr)
def _get_top(self, node):
# Convert heap to sorted list
result = sorted(node.heap, key=lambda x: (x[0], x[1])) # because freq is negative
return [s for _, s in result[::-1]] # reverse for descending frequency
-
Code first.
-
Explain the logic later.
- Time Complexity:
- Insert: O(L log k)
- Search: O(L + k log k)
- Space Complexity:
- Insert: O(total_characters + total_heaps)
- Search: O(L + k)