Representing sequences in pydna
Visit the full library documentation here
# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
%%capture
# Install the current development version of pydna (comment to install pip version)
!pip install git+https://github.com/BjornFJohansson/pydna@dev_bjorn
# Install pip version instead (uncomment to install)
# !pip install pydna
Pydna contains classes to represent double stranded DNA sequences that can:
Be linear
Be circular
Contain overhangs (sticky ends).
These sequences can be used to simulate molecular biology methods such as cloning and PCR. The main classes used to represent sequences are Dseq
and Dseqrecord
.
Dseq
represents the sequence only. Think of it as a FASTA file.Dseqrecord
can contain sequence features and other info such as publication, authors, etc. Think of it as a Genbank file.
Dseq Class
We can create a Dseq
object in different ways.
For a linear sequence without overhangs, we create a Dseq
object passing a string with the sequence. For example:
from pydna.dseq import Dseq
my_seq = Dseq("aatat")
my_seq
Dseq(-5)
aatat
ttata
In the console representation above, there are three lines:
Dseq(-5)
indicates that the sequence is linear and has 5 basepairs.aatat
, the top / sense / watson strand, referred from now on as watson strand..ttata
, the bottom / anti-sense / crick strand, referred from now on as crick strand.
Now, let’s create a circular sequence:
my_seq = Dseq("aatat", circular=True)
my_seq
Dseq(o5)
aatat
ttata
Note how
o5
indicates that the sequence is circular and has 5 basepairs.
One way to represent a linear sequence with overhangs is to instantiate Dseq
with the following arguments:
The
watson
strand as a string in the 5’-3’ direction.The
crick
strand as a string in the 5’-3’ direction.The 5’ overhang
ovhg
(overhang), which can be positive or negative, and represents the number of basepairs that thewatson
strand extends beyond thecrick
strand.
Dseq("actag", "ctag", -1)
Dseq(-5)
actag
gatc
Note how the bottom strand is passed in the 5’-3’ direction, but it is represented in the 3’-5’ direction in the console output.
If you omit the ovhg
argument, pydna will try to find the value that makes the watson
and crick
strands complementary.
Dseq("actag", "ctag")
Dseq(-5)
actag
gatc
The best way to get a feeling for the meaning of ovhg
is to visualise the possible scenarios as such:
dsDNA overhang
nnn... 2
nnnnn...
nnnn... 1
nnnnn...
nnnnn... 0
nnnnn...
nnnnn... -1
nnnn...
nnnnn... -2
nnn...
Of note, the DNA sequence can be passed in both lower case and upper case, and are not restricted to the conventional ATCG nucleotides (E.g ), The class supports the IUPAC ambiguous nucleotide code.
Dseq("Actag", "Ctag", -1)
Dseq(-5)
Actag
gatC
Another way to pass the overhangs is to use the from_full_sequence_and_overhangs
classmethod, which only needs the watson
/sense strand. This is useful you can only store the entire sequence (e.g. in a FASTA file), or if you want to specify overhangs on both sides of the double stranded DNA when you create the object.
Both the watson_ovhg
and crick_ovhg
can be passed following the same rules as above. Specifically, the crick_ovhg
argument is identical to the conventional ovhg
argument. The watson_ovhg
argument is the ovhg
argument applied to the reverse complementary sequence.
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
my_seq
Dseq(-8)
aaatta
aattt
A list of possible scenarios, applying positive and negative crick_ovhg
and watson_ovhg
to a Dseq
object are visualised in the output of the code below:
for crick_ovhg in [-2, 2]:
for watson_ovhg in [-3, 3]:
print("watson_ovhg is " + str(watson_ovhg) + ", crick_ovhg is " + str(crick_ovhg))
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg, watson_ovhg)
print(my_seq.__repr__() + "\n")
watson_ovhg is -3, crick_ovhg is -2
Dseq(-8)
aaatt
taattt
watson_ovhg is 3, crick_ovhg is -2
Dseq(-8)
aaattaaa
taa
watson_ovhg is -3, crick_ovhg is 2
Dseq(-8)
att
tttaattt
watson_ovhg is 3, crick_ovhg is 2
Dseq(-8)
attaaa
tttaa
The drawing below can help visualize the meaning of the overhangs.
(-3)--(-2)--(-1)--(x)--(x)--(x)--(-1)--(-2)
5'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)3'
3'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)5'
5'( a)--( a)--( a)--(t)--(t)--(a)--( )--( )3'
3'( )--( )--( )--(t)--(t)--(a)--( a)--( a)5'
If you would like to check the overhangs for a Dseq
object, it can be done by calling the methods five_prime_end
and three_prime_end
to show the 5’ and 3’ overhangs, respectively. An example of a Dseq
object, and examples showing what the print-out of the methods looks like are demonstrated here:
my_seq = Dseq("aatat", "ttata", ovhg=-2)
print(my_seq.__repr__())
print(my_seq.five_prime_end())
print(my_seq.three_prime_end())
Dseq(-7)
aatat
atatt
("5'", 'aa')
("5'", 'tt')
If you now want to join your sequence’s sticky ends to make a circular sequence (i.e Plasmid), you can use the looped
method. The sticky ends must be compatible to do so.
my_seq = Dseq("aatat", "ttata", ovhg=-2)
my_seq.looped()
Dseq(o5)
aatat
ttata
If you want to change the circular origin of the sequence/plasmid, this can be easily done using the shifted
method. This can be done by providing the number of bases between the original origin with the new origin:
my_seq = Dseq("aatat", circular=True)
my_seq.shifted(2)
Dseq(o5)
tataa
atatt
getitem, repr, and str methods
Slicing sequences (__getitem__
)
__getitem__
is the method that is called when you use the square brackets []
after a python object. Below is an example of the builtin python list
:
my_list = [1, 2, 3]
print('using square brackets:', my_list[1:])
print('is the same as using __getitem__:', my_list.__getitem__(slice(1, None)))
using square brackets: [2, 3]
is the same as using __getitem__: [2, 3]
The __getitem__
method is modified in pydna to deal with Dseq
objects and returns a slice of the Dseq
object, defined by the a start value and a stop value, similarly to string indexing. In other words, __getitem__
indexes Dseq
. Note that ‘getitem’ (and, consequently, []
) uses zero-based indexing.
my_seq = Dseq("aatataa")
my_seq[2:5]
Dseq(-3)
tat
ata
__getitem__
respects overhangs.
my_seq = Dseq.from_full_sequence_and_overhangs("aatataa", crick_ovhg=0, watson_ovhg=-1)
my_seq[2:]
Dseq(-5)
tata
atatt
Note that index zero corresponds to the leftmost base of the sequence, which might not necessarily be on the watson
strand. Let’s create a sequence that has an overhang on the left side.
sequence_with_overhangs = Dseq.from_full_sequence_and_overhangs("aatacgttcc", crick_ovhg=3, watson_ovhg=0)
sequence_with_overhangs
Dseq(-10)
acgttcc
ttatgcaagg
When we index starting from 2
, we don’t start counting on the watson, but on the crick strand since it is the leftmost one.
sequence_with_overhangs[2:]
Dseq(-8)
acgttcc
atgcaagg
Slicing circular sequences
When slicing circular Dseq
objects we get linear sequences.
circular_seq = Dseq("aatctaa", circular=True)
circular_seq[1:5]
Dseq(-4)
atct
taga
We can slice circular sequences across the origin (where index is zero) if the first index is bigger than the second index. This is demonstrated in the example below:
circular_seq[5:2]
Dseq(-4)
aaaa
tttt
Printing sequences to the console: __repr__
and __str__
__repr__
and __str__
are methods present in all python classes that return a string representation of an object. __str__
is called by the print
function, and __repr__
is used by the console or notebook output when the object is not assigned to a variable. Below is an example with a date
object:
import datetime
my_date = datetime.date(2023, 8, 15)
print('> print statement:', my_date)
print('> repr:', repr(my_date))
print('> repr from class method:', my_date.__repr__())
print()
print('> console output:')
my_date
> print statement: 2023-08-15
> repr: datetime.date(2023, 8, 15)
> repr from class method: datetime.date(2023, 8, 15)
> console output:
datetime.date(2023, 8, 15)
In a similar way, __repr__
and __str__
methods are used by pydna to represent sequences as strings for different purposes:
__repr__
is used to make a figure-like representation that shows both strands and the overhangs.__str__
is used to return the entire sequence as a string of characters (from the left-most to the right-most base of both strands), the way we would store it in a FASTA file.
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
print('> figure-like representation:\n', my_seq.__repr__())
print()
print('> string representation:\n', my_seq)
> figure-like representation:
Dseq(-8)
aaatta
aattt
> string representation:
aaattaaa
Note that on the string representation, the bases correspond to the entire sequence provided, even when they are only present on either the watson
or crick
strand. In the example above, the last two aa
bases are missing from the watson
strand, and that only the crick
strand has them.
Edge cases
You can create arbitrary double-stranded sequences that are not complementary if you specify both strands and an overhang, but you won’t be able to use them for molecular biology simulations. For example:
Dseq("xxxx", "atat", ovhg=2)
Dseq(-6)
xxxx
tata