Hello! Today, we are going to solve Minimum Genetic Mutation on leetcode.
This challenge uses some concepts we've already talked in other posts: BFS (Breadth-First Search), brute force, sets, and finding the minimum using BFS.
The Challenge
Given a gene string that is 8 characters long, containing only the letters A, C, G, and T, find the minimum mutations needed to change the initial gene into a target gene.
For example, changing "AACCGGTT" to "AACCGGTA" is one mutation.
One of the constraints of the exercise is the bank. The bank is a list of valid mutations. To go from the start gene to the end gene and count how many mutations are needed, we must ensure every mutation is valid.
Note that the starting point is assumed to be valid, so it might not be included in the bank.
Our first approach
Seems easy, right? We just need to check the difference in both genes. We don't need this bank.
Let's implement this approach:
def count_mutations(str1, str2):
mutations = sum(1 for a, b in zip(str1, str2) if a != b)
return mutations
This code:
- Iterates through both strings using
zip
; - Returns 1 if a char in startGene differs from endGene;
- Sum all the ones.
Try this approach and run the tests on leetcode. It passes!
Now, submit the code. It didn't run, right? You will try to cover edge cases until you have no energy and understand that this challenge is not about differences; it is about paths.
Take a look at one of the examples where our simple code breaks:
startGene = "AACCTTGG"
endGene = "AATTCCGG"
bank = ["AATTCCGG","AACCTGGG","AACCCCGG","AACCTACC"]
Output: 4
Expected: -1
It fails because there is no way to change one letter at a time from start gene to end gene while using only valid mutations.
Let's visualize the problem with a simple diagram:
A better approach
Now we know the bank must be used, let's try a correct approach to solve the problem.
Read the challenge description again. Focus on the highlighted parts like minimum, valid, and starting point.
They all suggest that we need a code that:
- Ends as soon as we reach the target;
- Checks if our moves are valid;
- Has a starting point to iterate from to reach the target.
Bread-First Search (BFS) is excelent for finding the shortest or minimum path needed to go from one position to another.
Implementation
Here is the code we are going to explain in detail:
from collections import deque
class Solution:
def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
bank = set(bank)
queue = deque([(startGene, 0)]) # gene, changes
if endGene not in bank:
return -1
while queue:
gene, changes = queue.popleft()
if gene == endGene:
return changes
for i in range(len(gene)):
for c in "ACGT":
attempt_gene = gene[:i] + c + gene[i+1:]
if attempt_gene in bank:
queue.append((attempt_gene, changes + 1))
bank.remove(attempt_gene)
return -1
First, we convert our bank in a set to perform O(1) lookups:
bank = set(bank)
Since we need to check if every mutation is valid, turning the bank into a set is the best approach.
Then, we create our queue. In Python, we can initialize the queue with some values. Our starting point is startGene
with a total of mutations that starts from zero.
queue = deque([(startGene, 0)])
Remember I said this challenge is about paths? With this in mind, we know we need to keep track of the paths we visited, or the genes we already checked, so we don't end up in an infinite loop.
And now the most complex part:
while queue:
gene, changes = queue.popleft()
if gene == endGene:
return changes
for i in range(len(gene)):
for c in "ACGT":
attempt_gene = gene[:i] + c + gene[i+1:]
if attempt_gene in bank:
queue.append((attempt_gene, changes + 1))
bank.remove(attempt_gene)
while queue
ensures we iterate our queue until it has genes.
Everytime we pop one node from the queue, we check if it is equal to our target, endGene
.
gene, changes = queue.popleft()
if gene == endGene:
return changes
Then we have two nested for
loops. They are necessary because we want to replace one letter at a time and check if the mutation is valid. The first loop iterates through the current string, and the second loop tries to replace the current index with A, C, G, or T.
for i in range(len(gene)):
for c in "ACGT":
attempt_gene = gene[:i] + c + gene[i+1:]
if attempt_gene in bank:
queue.append((attempt_gene, changes + 1))
bank.remove(attempt_gene)
If the gene is in the bank we add it to the queue as a valid mutation (or a path).
We also remove it from the bank, so we don't need to check it again to avoid adding it back to the queue, which would cause an infinite loop.
Complexity
Time: it is O(b) where b is the bank size.
It is a little tricky, but the reason is that a gene string has a limit of 8 characters. When measuring time complexity we need to drop the constant, no matter how big they are. In our case, the constants are 8 and 4: genes are 8 characters long and we might have ending up replacing each letter by A, C, G or T. 8 * 4 = 32
Space: it is also O(b) where b is the bank size.
We use a queue that can store, in the worst case, all genes from the bank.
There is also the set we use to keep track of visited genes that can contain up to b genes.
Top comments (2)
nice!
This content is pure gold! Please, keep going! I recommend solving the "Sudoku solver" problem, I did it a few interviews ago and was very stuck on him. It would be a very nice article!