4968 views
# Word suggestion algorithm improvements **Status**: Draft | 2025 **Author**: Victor Ma ## Terminology | Term | Definition | | -------------------------- | ---------- | | Current cell | The cell that the cursor is on. | Current slot | The slot that the cursor is on. | Current filter | The filter for the current slot. | Secondary slot | The slot that intersects the current slot, at the current cell. | Secondary filter | The filter for the secondary slot. | Intersecting slot | A slot that intersects a given slot. | Intersecting slots | All the slots that intersect a given slot. | Intersecting filters | All the filters for the intersecting slots of a given slot. | Word suggestion list | The UI element that shows a list of words that fit the current slot. | Word suggestion algorithm | An algorithm that generates word suggestions for the word suggestions list. | Dead end | An unfillable grid. | Dead end word | A word that seemingly fits in a given slot, but that eventually leads to a dead end. | Grid fill | A one-to-one mapping from the slots of a grid to words. ![Word suggestion list](word-suggestion-list.png)<br> *A screenshot that shows the word suggestion list.* ## Our current algorithm At a high level, our current word suggestion algorithm works like this, where $n$ is the offset of the current cell in the current slot, and $m$ is the offset of the current cell in the secondary slot: 1. **(Phase 1)** Get the set of possible letters for the current cell, constrained by the current filter. 1. Get the set of words that match the current filter. 1. Return the set of letters that appear as the $n^{\text{th}}$ letter, in the set of words. 1. **(Phase 2)** Get the set of possible words for the secondary slot, and the set of possible letters for the current cell---both constrained by the secondary filter and the current filter. 1. Get the set of words that satisfy both these constraints: * The word matches the secondary filter. * The $m^{\text{th}}$ letter of the word belongs to the set of letters from Step 1. 1. Get the set of letters that appear as the $m^{\text{th}}$ letter, in the set of words. 1. Return the set of words and the set of letters. 1. **(Phase 3)** Get the set of possible words for the current slot, constrained by the current filter and secondary filter. 1. Get the set of words that satisfy both these constraints: * The word matches the primary filter. * The $n^{\text{th}}$ letter of the word belongs to the set of letters from Step 2. 1. Return the set of words. 1. **(Return)** Return the set of words from Phase 2 and Phase 3. The [`word_list_find_intersection ()`](https://gitlab.gnome.org/jrb/crosswords/-/blob/0.3.15/src/word-list.c?ref_type=tags#L1372) function implements our current word suggestion algorithm. ### Limitations Our word suggestion algorithm only considers two constraints: * The filter for the current slot. * The filter for the secondary slot. This means that the algorithm doesn't know about any of the other intersecting slots---or what constraints they impose on the current slot. For example, consider the following grid: ``` +---+---+---+---+ | | | | Z | +---+---+---+---+ | | | | E | +---+---+---+---+ | | | | R | +---+---+---+---+ | W | O | R | | < current slot +---+---+---+---+ ``` The 4-Down slot begins with *ZER*, so the only word it can be is *ZERO*. This means that the cell in the bottom-right corner must be the letter *O*. The 4-Across slot starts with *WOR*. We know that the bottom-right cell is the letter *O*, so 4-Across must be *WORO*. But *WORO* is not a word. So both 4-Down and 4-Across are unfillable. Suppose the current slot is 4-Across. If the cursor is on the bottom-right cell, then our word suggestion algorithm correctly realizes that the two slots are unfillable. But if the cursor is on one of the other cells in 4-Across. Then, the algorithm has no idea about 4-Down and the constraint it imposes. So, the algorithm returns all words that match the filter *WOR?*, like *WORD* and *WORM*---even though these aren't valid word suggestions. ### Consequences There are two undesirable consequences of this. First, our algorithm can generate dead end words. These can be frustrating for the user. From their perspective: 1. They enter a word from the word suggestion list. 1. They continue working on the grid, building off of the initial word. 1. Eventually, they reach a dead end. 1. They have to undo all the work they did since picking the word. If they're lucky, maybe they can find replacements for some of the words and not have to erase everything. And second, our algorithm's output varies based on the current cell. This is semantically incorrect, because a word suggestion algorithm should operate on a slot level. The current cell should not change the output. And so, we should change our algorithm to fix these problems. ## Lookahead-based algorithm The solution is to add *lookahead* to our word suggestion algorithm. With lookahead, our algorithm *looks ahead* at the intersecting filters of the current slot. And it takes those filters into account when generating the word suggestions. Here's what our word suggestion algorithm looks like with lookahead: 1. Iterate through the intersecting filters of the current slot. * For each intersecting filter, determine the constraint it imposes on the current slot. 1. Return the set of words that meet these constraints: * The word meets the constraints imposed by the intersecting filters. * The word matches the current filter. Consider the example from before: ``` +---+---+---+---+ | | | | Z | +---+---+---+---+ | | | | E | +---+---+---+---+ | | | | R | +---+---+---+---+ | W | O | R | | < current slot +---+---+---+---+ ^ current cell ``` With lookahead, our word fill algorithm checks the filter for 4-Down, even if the cursor isn't on the bottom-right corner. The algorithm sees that the bottom-right cell can only be the letter *O*, and so the current filter must be `WORO`. This doesn't match any word. So, the algorithm returns the empty set for 4-Across. ### Problem solved? This approach solves the problem of our word suggestion algorithm producing different outputs for the same slot, based on the current cell. Now, the current cell has no impact on the output. This approach also significantly reduces the amount of dead end words. However, it does not prevent them altogether. ### Still broken Consider this grid: ``` +---+---+---+---+ | | | | N | +---+---+---+---+ | Q | U | I | | +---+---+---+---+ | | | | X | +---+---+---+---+ | W | E | S | | < current slot +---+---+---+---+ ``` The 2-Across slot starts with *QUI*, so the only word it can be is *QUIZ*. So, the filter for 4-Down is *NZX?*. However, no such word exists. So, 4-Down is unfillable. This, in turn, means that 4-Across is also unfillable. Now, suppose that the cursor is on 4-Across. Then, our algorithm with lookahead (as currently defined) does not realize that the current slot is unfillable. Our algorithm sees that 4-Down has the filter *N?X?*. But it does not see that 2-Across forces 4-Down to be *NZX?*. So, the algorithm thinks that 4-Down can be *NEXT*, and that 4-Across can be *WEST*---which is incorrect. This failure is because our algorithm checks the intersecting filters of the current slot---but it does not check the intersecting filters of the intersecting slots of the current slot. In other words, it only uses a single level of lookahead. ### Additional levels of lookahead To handle this grid appropriately, we need to add another level of lookahead. To add additional levels of lookahead, we recursively perform the lookahead process. For each additional level of lookahead we want, we add another level of recursion. To perform the lookahead process recursively, we visit all the intersecting slots of all the slots from the previous level of lookahead. And for each slot we visit, we take their filters into account when calculating the filters for the previous level of lookahead. | Level | Constraints considered | | ----- | ---------------------- | | 0 | Current filter. | 1 | Current filter, intersecting filters of the current slot. | 2 | Current filter, intersecting filters of the current slot, intersecting filters of the intersecting slots of the current slot. | ... | Our word suggestion algorithm with two levels of lookahead does handle this example grid properly. But it's clear that the algorithm can always generate a dead end word, given a worst-case grid---unless the number of levels is high enough that the algorithm visits every slot on the grid. ### Unsolveable? Consider the following example. Every slot on this grid is unfillable. This is because 4-Across (*ZXCVBN?*) is unfillable, and the "unfillability" propagates to every other slot. ``` +---+---+---+---+---+---+---+ | | U | A | L | I | T | | +---+---+---+---+---+---+---+ | U | ■ | ■ | ■ | ■ | ■ | O | +---+---+---+---+---+---+---+ | I | ■ | A | X | | ■ | G | +---+---+---+---+---+---+---+ | E | ■ | ■ | ■ | B | ■ | U | +---+---+---+---+---+---+---+ | | H | U | M | | ■ | R | +---+---+---+---+---+---+---+ | ■ | ■ | ■ | ■ | ■ | ■ | T | +---+---+---+---+---+---+---+ | Z | X | C | V | B | N | | +---+---+---+---+---+---+---+ ``` Now, suppose our cursor is on 2-Across (*AX?*). Then, our algorithm needs at least six levels of lookahead to properly detect that the grid is unfillable. Otherwise, it thinks that the grid can be filled with *AXE*, *EBB*, *THUMB*, *QUIET*, *QUALITY*, and *YOGURTS* (starting with 2-Across and spiralling out). However, it is not practical to add more than one or two levels of lookahead. The number of intersections that the algorithm has to check is roughly $s^n$, where $s$ is the average slot size, and $n$ is the number of levels. So realistically, the limit is one or two levels. This means that we cannot use the lookahead approach to check every slot in the grid and completely prevent dead end words. ## AC-3-based algorithm In order for our word suggestion algorithm to visit all the slots on the grid---and thus eliminate the possibility of dead end words---we need to use a completely different algorithm. For this, we look to the field of [constraint satisfaction problems (CSPs)](https://en.wikipedia.org/wiki/Constraint_satisfaction_problem). ### Express grid filling as a CSP For a given grid and word list, we can express the problem, *Find a valid grid fill for this grid*, as a CSP. What follows is one possible CSP model, though [other models are possible](https://cs.uwaterloo.ca/~vanbeek/Publications/cai01a.pdf). Let $V$ be the set of variables for the CSP, and $$ V = \{a_1, a_2, \dots, a_m\} \cup \{d_1, d_2, \dots, d_n\}, $$ where $a_i$ is the word in the $i$-Across slot, and $d_j$ is the word in the $j$-Down slot. So there is one variable for each slot, and each variable is set to the word chosen for that slot. Let $D$ be the set of domains for the CSP, and $$ D = \{D_v \mid v \in V\}. $$ Let $F_v$ be the set of words that match the filter for the slot that $v$ represents, and that are in the word list. Then, $$ D_v = F_v. $$ In other words, the domain of a variable is the set of word-list words that match the variable's slot's filter. Let $I$ be the set of intersection constraints for the CSP, and $$ I = \{I_{v_1v_2} \mid \text{The slots that $v_1$ and $v_2$ represent intersect each other.}\}, $$ where $$ I_{v_1v_2} = \text{$v_1$ and $v_2$ have the same letter at the offsets where their slots intersect} $$ Let $Q$ be the set of unique word constraints, and $$ Q = \{Q_{v_1v_2} \mid (|v_1| = |v_2|) \land (v_1 \neq v_2) \}, $$ where $$ Q_{v_1v_2} = v_1 \neq v_2. $$ Let $C$ be the set of constraints for the CSP, and $$ C = I \cup Q. $$ ### Solution to the CSP A solution to this CSP is a valid grid fill for the given grid and word list. Suppose the CSP has a standard word list, and the following grid: ``` +---+---+---+---+ | ■ | ■ | ■ | N | +---+---+---+---+ | T | I | M | | +---+---+---+---+ | ■ | ■ | ■ | X | +---+---+---+---+ | W | E | S | | +---+---+---+---+ ``` Then, the unique solution is \begin{align} a_1 &= \text{TIME}\\ a_2 &= \text{WEST}\\ d_1 &= \text{NEXT}.\\ \end{align} ### Arc consistency However, our goal is not to find a solution to the CSP. Our goal is to get the set of possible words for a slot. To do this, we make every variable *[arc consistent](https://en.wikipedia.org/wiki/Local_consistency#Arc_consistency)* with every other variable. To do this, we can use the [AC-3 algorithm](https://en.wikipedia.org/wiki/AC-3_algorithm). ### Resources #### Learning resources Resources for learning about CSPs and the AC-3 algorithm: * [UIUC lecture slides](https://courses.grainger.illinois.edu/ece448/sp2022/slides/lec18.pdf) * [FIT lecture slides](https://cs.fit.edu/~dmitra/ArtInt/lectures/constraint.pdf) * [UWaterloo lecture slides](https://cs.uwaterloo.ca/~jhoey/teaching/cs486/lecture4-nup.pdf) * [Textbook chapters on arc consistency](https://www.sciencedirect.com/topics/computer-science/arc-consistency) #### Other resources Miscellaneous resources related to CSPs and crosswords: * [Paper that compares crossword CSP models](https://cs.uwaterloo.ca/~vanbeek/Publications/cai01a.pdf) * [Paper on solving crossword CSP](https://web.archive.org/web/20230202063946/https://web.stanford.edu/~jduchi/projects/crossword_writeup.pdf) * [Crossword construction tool](https://www.crosswordconstruction.com/static/files/poster.pdf) ## Algorithms comparison The following is a comparison between the lookahead-based word suggestion algorithm, and the AC-3-based word suggestion algorithm. Assume that the lookahead-based algorithm uses a single level of lookahead. <!-- | | Lookahead | AC-3 | | - | --------- | ---- | | **Dead end words** | Uncommon. | None. | **Crossword aids** | Possible, but requires conversion to grid-level algorithm. | Fully compatible. | **Benefit to autofill** | Possible, but requires conversion to grid-level algorithm. | Yes. | **Speed** | Fast. | Slow. | **Implementation difficulty** | Simple | Complex. --> ### Dead end words The AC-3-based algorithm eliminates all dead end words. The lookahead-based algorithm does not eliminate all dead end words. However, the lookahead-based algorithm does eliminate most of the dead end words. The reasoning for this is: 1. An algorithm with one or two levels of lookahead accounts for the slots that are closest to the current slot. 1. The slots closest to the current slot impose the most constraints on the current slot. 1. Therefore, an algorithm with one or two levels of lookahead eliminates most of the dead end words. #### Closer slots impose more constraints Closer slots impose more constraints on the current slot, because they are fewer steps removed from the current slot. Suppose $i_1$ is an intersecting slot of the current slot. Then, $i_1$ has a big impact on the current slot, because $i_1$ directly constrains the character set for one of the current slot's cells. Now, suppose $i_2$ is an intersecting slot of $i_1$. Then, the only way that $i_2$ can constrain the current slot is by constraining $i_1$. And the only way it can constrain $i_1$ is by constraining the character set for one of $i_1$'s cells. So, $i_2$ must constrain one of $i_1$'s cells to the point that $i_1$ becomes constrained enough to add an additional constraint on the current slot. So, the impact that a slot has on the current slot grows weaker, the further away it is. This means that an algorithm with one or two levels of lookahead catches most of the dead end words. ### Crossword aids The AC-3-based word suggestion algorithm functions on the grid level. Its input is the entire grid, and its output is the set of possible words for every slot on the grid. This lets us implement some useful crossword aids: | Crossword aid | Description | | ------------- | ----------- | | Fillability indicator | Indicates if the grid is fillable or not. If at least one slot is unfillable, then the grid is unfillable. | Cell constraint heat map | A heat map overlaid on top of the grid that indicates how constrained each cell is. A cell that can be any letter from *A* to *Z* is completely unconstrained. A cell that can only be one specific letter is maximally constrained. | Most-constrained-slot button | A button that moves the cursor to the most constrained slot. The most constrained slot is the slot that has the smallest set of possible words---and that is not fully filled. With the lookahead-based algorithm, it is possible to implement a less accurate version of all these aids. To do this, we need to turn the lookahead-based algorithm into a grid-level algorithm. We can do it like this: ``` For each slot in the grid: Run the lookahead-based algorithm ``` There are two limitations of this: * The lookahead-based algorithm's outputs may contain dead ends. The more dead ends there are, the less accurate the crossword aids become. * Running the lookahead-based algorithm on every cell may be too slow. <!-- ### Benefit to autofill The AC-3-based algorithm can benefit the autofill algorithm by providing it with the possible word sets for every slot. The lookahead-based algorithm can also do this, if we transform it into a grid-level algorithm. ==manual button== ==hybrid (switch automatically when enough filled)== ==use lookahead while ac3 runs== --> ### Speed The lookahead-based algorithm takes [$n$ times longer to run](#Implementation) than our current word suggestion algorithm, where $n$ is the average slot size. Our current algorithm takes less than 16 ms to run (because we are able to hit 60 fps). So, even for the maximum slot size of 21 (extra-large grid), the lookahead-based algorithm runs in under a second. <!-- To estimate the speed of the AC-3-based algorithm, we can look at existing crossword editors that use a grid-level word suggestion algorithm. For these editors, their algorithms' speed looks like this: * If most of the grid is filled, the algorithm runs in under a second ==regen each time?== --> ### Implementation difficulty The lookahead-based algorithm is simple to implement. It builds off of our existing word suggestion algorithm. The AC-3-based algorithm is difficult to implement. It is completely different from our current code. ## Which algorithm? We should implement the lookahead-based algorithm. It significantly reduces dead end words, it runs quickly, and it's easy to implement. <!-- The AC-3-based algorithm has the potential to improve our editor a lot--- ### Both algorithms? button, option --> ## Prior Art The following is true about most crossword editors: * They use a grid-level word suggestion algorithm. * It takes a while for the word suggestion list to populate. * They have a fillability indicator. * They have a cell constraint heat map. * They have a most-constrained-slot button. Open source editors: * [Ingrid's word suggestion algorithm](https://github.com/rf-/ingrid_core/blob/f00da637452fff46e1b6255a29a553aa8f2da5c2/src/backtracking_search.rs#L1-L5) is well-documented, but complex. * Exet's word suggestion algorithm simply looks for words that match the current filter. ## Implementation The following is a possible implementation for the lookahead-based word suggestion algorithm. It runs the intersection function on each cell of the current slot, and then returns the set intersection of all the outputs. ``` lookahead_based_algorithm (current_slot) { final_words_set = NULL; for (cell : current_slot) { if (cell != "") continue; intersecting_slot = get_int_slot (current_slot, cell); words_set = intersection_func (current_slot, intersecting_slot); if (final_words_set == NULL) final_words_set = words_set; else final_words_set = (final_words_set) ∩ (words_set); if (final_words_set == {}) return final_words_set; } return final_words_set; } ``` ### Must be asynchronous This algorithm runs the intersection function $n$ times, where $n$ is the size of the current slot. The intersection function itself takes a few milliseconds to run. This means that we cannot hit our target of 16 ms per frame, if we run the lookahead-based algorithm in the main thread. So, we must implement the algorithm asynchronously. <!-- ### ==only run if less than N open slots to avoid lag on open grid...=== ==future async== autofill csp --> <!--TODO: add section about caching somewhere, probably before the implementation section.--> ## Next steps * [ ] Finish the WIP parts of this design doc. * [ ] Implement lookahead-based algorithm synchronously. * [ ] Discuss and decide next steps: * Implement lookahead-based algorithm asynchronously? * Remove lookahead-based algorithm and implement AC-3-based algorithm? * Hybrid approach? ## Open Questions ## Areas For Improvement