Biology is an interesting world and bioinformatic is where computer science meets biology,
Today I will describe a simple yet interesting bioinformatic problem from an algorithmic prospective.: Calculate the reverse complement of DNA bases using Javascript.
In this article, I am using a bioinformatic problem because is fun and interesting, but I will be mostly talking about JavaScript performance.
We will
- Start describing how DNA works (with some big simplifications… Eih! I am not a biologist!),
- Propose some implementations, and then
- try to archive the best time performance, comparing the time for completing the task.
heads-up: A basic knowledge of JavaScript language is required to follow along.
What’s the reverse complement?
Before explaining it, bear with me for a small tour of how DNA looks like.
The DNA helix is composed of two strands like in the image above.
A strand is a long sequence of this for letters ATGC ( each letter is a specific nucleotide Adenine, Thymidine, Guanidine, Cytidine ) in some order.
There is a specific relation between what is the first strand and what there is on the second strand: for each A in the first sequence there is a T on the other strand and vice versa, and for each G a C will be on the other strand.
The conversion from map DNA strand to complementary strang would be something like:
'A': 'T',
'G': 'C',
'T': 'A',
'C': 'G'
Here is an example:
I often hear these two sequences named 5’ to 3’’ ( 3’ end) and the second string is named 3’ to 5’ (5’’ end). The direction of reading is in both from 5’ to 3’’ and this means that a sequence is read from left to right but the other one (the complementary) is read from right to left.
In most of the file formats and web APIs that I worked with since the complementary DNA strand can be calculated from the first strand sequence, only one DNA strand is provided ( 3’ end) and is up to us to calculate the complementary.
Now, we have enough for our small challenge:
How can I generate a complementary strand?
Given an input:
TTATACGACTCACTATAGGGAGACTTAAGAAG
The expected output should look like this:
CTTCTTAAGTCTCCCTATAGTGAGTCGTATAA
Remember: we are reading the complementary in reverse order so the DNA sequence starts TT the complementary will end with AA.
Input:
TT ATACGACTCACTATAGGGAGACTTAAGAAG
Output:
CTTCTTAAGTCTCCCTATAGTGAGTCGTAT AA
Ok, let make the code talks for us:
Let start with a modern approach, a map
const map = (sequence) => {
const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
return sequence.split('').reverse().map(bp => map[bp]).join("")
}
This is “easy” to read and the steps are :
We take the input
“TACGA”
We separate each char and create an array
[ ‘T’ , ’A’ , ’C’ , ’G’ , ’A’]
Then map each char into his complementary
[ ‘A’ , ’T’ , ’G’ , ’C’ , ’T’]
We reverse
[ ‘T’ , ’C’ , ’G’ , ’T’ , ’A’]
And the join in a string
“TCGTA”
That’s it… right?
In most cases, yes, but today we are a bit more stubborn and we will try to find the best performance time for this job.
Why? Well even a small bacterial DNA can range in size anywhere from 130 kbp to over 14 Mbp ( a bp is a single letter/Nucleotide) so being fast could be important.
Ok, we have the rules, now let me present our players :
Player1:
We just saw the map implementation, let call map,
const map = (sequence) => {
const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
return sequence.split('')
.reverse()
.map(bp => map[bp])
.join("")
}
Player2: for loop and with if:
const forLoop = (sequence) => {
let complement = ''
for (let idx = 0; idx < sequence.length; idx++) {
if (sequence[idx] === 'A') {
complement = 'T' + complement
} else if (sequence[idx] === 'T') {
complement = 'A' + complement
} else if (sequence[idx] === 'G') {
complement = 'C' + complement
} else if (sequence[idx] === 'C') {
complement = 'G' + complement
}
}
return complement
}
Player3: A for with a switch case:
const forSwitch = (sequence) => {
let complement = '';
for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
switch (sequence[idx]) {
case 'A':
complement = 'T' + complement
break;
case 'T':
complement = 'A' + complement
break;
case 'G':
complement = 'C' + complement
break;
case 'C':
complement = 'G' + complement
break;
}
}
return complement
}
We will run these implementations ( and some small variations, github for more details ), 10000 times on a 35752 long DNA sequence and record the best time, the worst time, and the overall average time.
Ready go!
This graph is not that easy to read, let me provide a table ordered by
Code | Average | Best | Worst |
---|---|---|---|
For (optimized) with switch case | 0.9446 | 0.4836 | 99258,00 |
For with multiple if | 21564,00 | 0.5540 | 867263,00 |
For (optimized) with each if | 11737,00 | 0.6480 | 98886,00 |
For with dictionary | 15038,00 | 11097,00 | 83742,00 |
ForEach with dictionary | 23381,00 | 17202,00 | 70510,00 |
Big Map with regular expression | 29884,00 | 23477,00 | 103878,00 |
Map with dictionary | 34595,00 | 26937,00 | 137978,00 |
Replace with dictionary | 237074,00 | 51751,00 | 3951461,00 |
It looks like “replace with dictionary” is the worst in timing, and “optimized switch case” is the best.
Wrapping up,
In this implementation I can see that:
- The regular expression and the dictionary are slower than if and switch case
- For is the faster loop
- switch case wins on if else if
- The optimized of ‘for loop’ give some small improvements
Bonus, (what optimized for means):
Maybe you already noted the ‘switch case’ implementation. During my review of this topic, I fell on this website ( https://browserdiet.com/ ) and learned something interesting about the for loop that I didn’t know.
for ([initialization]; [condition]; [final-expression]){
Statement
}
Every time a ‘statement’ gets executed the condition block runs again.
This sounds clear, but also sequence.length will re-calculate each interaction, consuming more time, and this is bad!
And there is a simple solution,
We can instance a variable with the value of sequence.length
in the initialization block:
for (let idx = 0; idx < sequence.length; idx++) {
// sequence.length is calculated every interaction
}
for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
// sequence.length is calculated only 1 time
}
Thanks for taking the time to read this article and let me know any feedback, have a great day!
References:
https://en.wikipedia.org/wiki/Bacterial_genome
https://en.wikipedia.org/wiki/DNA
https://leanylabs.com/blog/js-forEach-map-reduce-vs-for-for_of/
DNA image from https://commons.wikimedia.org/wiki/File:0321_DNA_Macrostructure.jpg
Top comments (1)
This is an interesting point of view. Thanks!
My main concern on this idea is how hard/easy you can to find repetition in a DNA sequence.
BTW I think is worth trying to run some comparisons.
Thanks for sharing