Representing sequences in pydna

Visit the full library documentation here

# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
    %%capture
    # Install the current development version of pydna (comment to install pip version)
    !pip install git+https://github.com/BjornFJohansson/pydna@dev_bjorn
    # Install pip version instead (uncomment to install)
    # !pip install pydna

Pydna contains classes to represent double stranded DNA sequences that can:

Be linear
Be circular
Contain overhangs (sticky ends).

These sequences can be used to simulate molecular biology methods such as cloning and PCR. The main classes used to represent sequences are Dseq and Dseqrecord.

Dseq represents the sequence only. Think of it as a FASTA file.
Dseqrecord can contain sequence features and other info such as publication, authors, etc. Think of it as a Genbank file.

NOTE: The Dseq class is a subclass of biopython’s Seq, whose documentation can be found here. Dseqrecord is a subclass of biopython’s SeqRecord, whose documentation can be found here.

Dseq Class

We can create a Dseq object in different ways.

For a linear sequence without overhangs, we create a Dseq object passing a string with the sequence. For example:

from pydna.dseq import Dseq
my_seq = Dseq("aatat")
my_seq

Dseq(-5)
aatat
ttata

In the console representation above, there are three lines:

Dseq(-5) indicates that the sequence is linear and has 5 basepairs.
aatat, the top / sense / watson strand, referred from now on as watson strand..
ttata, the bottom / anti-sense / crick strand, referred from now on as crick strand.

Now, let’s create a circular sequence:

my_seq = Dseq("aatat", circular=True)
my_seq

Dseq(o5)
aatat
ttata

Note how o5 indicates that the sequence is circular and has 5 basepairs.

One way to represent a linear sequence with overhangs is to instantiate Dseq with the following arguments:

The watson strand as a string in the 5’-3’ direction.
The crick strand as a string in the 5’-3’ direction.
The 5’ overhang ovhg (overhang), which can be positive or negative, and represents the number of basepairs that the watson strand extends beyond the crick strand.

Dseq("actag", "ctag", -1)

Dseq(-5)
actag
 gatc

Note how the bottom strand is passed in the 5’-3’ direction, but it is represented in the 3’-5’ direction in the console output.

If you omit the ovhg argument, pydna will try to find the value that makes the watson and crick strands complementary.

Dseq("actag", "ctag")

Dseq(-5)
actag
 gatc

The best way to get a feeling for the meaning of ovhg is to visualise the possible scenarios as such:

dsDNA       overhang

  nnn...    2
nnnnn...

  nnnn...   1
nnnnn...

nnnnn...    0
nnnnn...

nnnnn...   -1
  nnnn...

nnnnn...   -2
  nnn...

Of note, the DNA sequence can be passed in both lower case and upper case, and are not restricted to the conventional ATCG nucleotides (E.g ), The class supports the IUPAC ambiguous nucleotide code.

Dseq("Actag", "Ctag", -1)

Dseq(-5)
Actag
 gatC

Another way to pass the overhangs is to use the from_full_sequence_and_overhangs classmethod, which only needs the watson/sense strand. This is useful you can only store the entire sequence (e.g. in a FASTA file), or if you want to specify overhangs on both sides of the double stranded DNA when you create the object.

Both the watson_ovhg and crick_ovhg can be passed following the same rules as above. Specifically, the crick_ovhg argument is identical to the conventional ovhg argument. The watson_ovhg argument is the ovhg argument applied to the reverse complementary sequence.

my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
my_seq

Dseq(-8)
aaatta
   aattt

A list of possible scenarios, applying positive and negative crick_ovhg and watson_ovhg to a Dseq object are visualised in the output of the code below:

for crick_ovhg in [-2, 2]:
    for watson_ovhg in [-3, 3]:
        print("watson_ovhg is " + str(watson_ovhg) + ", crick_ovhg is " + str(crick_ovhg))
        my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg, watson_ovhg)
        print(my_seq.__repr__() + "\n")

watson_ovhg is -3, crick_ovhg is -2
Dseq(-8)
aaatt
  taattt

watson_ovhg is 3, crick_ovhg is -2
Dseq(-8)
aaattaaa
  taa

watson_ovhg is -3, crick_ovhg is 2
Dseq(-8)
  att
tttaattt

watson_ovhg is 3, crick_ovhg is 2
Dseq(-8)
  attaaa
tttaa

The drawing below can help visualize the meaning of the overhangs.

  (-3)--(-2)--(-1)--(x)--(x)--(x)--(-1)--(-2)

5'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)3'
3'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)5'

5'( a)--( a)--( a)--(t)--(t)--(a)--(  )--(  )3'
3'(  )--(  )--(  )--(t)--(t)--(a)--( a)--( a)5'

If you would like to check the overhangs for a Dseq object, it can be done by calling the methods five_prime_end and three_prime_end to show the 5’ and 3’ overhangs, respectively. An example of a Dseq object, and examples showing what the print-out of the methods looks like are demonstrated here:

my_seq = Dseq("aatat", "ttata", ovhg=-2)
print(my_seq.__repr__())
print(my_seq.five_prime_end())
print(my_seq.three_prime_end())

Dseq(-7)
aatat
  atatt
("5'", 'aa')
("5'", 'tt')

If you now want to join your sequence’s sticky ends to make a circular sequence (i.e Plasmid), you can use the looped method. The sticky ends must be compatible to do so.

my_seq = Dseq("aatat", "ttata", ovhg=-2)
my_seq.looped()

Dseq(o5)
aatat
ttata

If you want to change the circular origin of the sequence/plasmid, this can be easily done using the shifted method. This can be done by providing the number of bases between the original origin with the new origin:

my_seq = Dseq("aatat", circular=True)
my_seq.shifted(2)

Dseq(o5)
tataa
atatt

getitem, repr, and str methods

Slicing sequences (`getitem`)

__getitem__ is the method that is called when you use the square brackets [] after a python object. Below is an example of the builtin python list:

my_list = [1, 2, 3]

print('using square brackets:', my_list[1:])
print('is the same as using __getitem__:', my_list.__getitem__(slice(1, None)))

using square brackets: [2, 3]
is the same as using __getitem__: [2, 3]

The __getitem__ method is modified in pydna to deal with Dseq objects and returns a slice of the Dseq object, defined by the a start value and a stop value, similarly to string indexing. In other words, __getitem__ indexes Dseq. Note that ‘getitem’ (and, consequently, []) uses zero-based indexing.

my_seq = Dseq("aatataa")
my_seq[2:5]

Dseq(-3)
tat
ata

__getitem__ respects overhangs.

my_seq = Dseq.from_full_sequence_and_overhangs("aatataa", crick_ovhg=0, watson_ovhg=-1)
my_seq[2:]

Dseq(-5)
tata
atatt

Note that index zero corresponds to the leftmost base of the sequence, which might not necessarily be on the watson strand. Let’s create a sequence that has an overhang on the left side.

sequence_with_overhangs = Dseq.from_full_sequence_and_overhangs("aatacgttcc", crick_ovhg=3, watson_ovhg=0)
sequence_with_overhangs

Dseq(-10)
   acgttcc
ttatgcaagg

When we index starting from 2, we don’t start counting on the watson, but on the crick strand since it is the leftmost one.

sequence_with_overhangs[2:]

Dseq(-8)
 acgttcc
atgcaagg

Slicing circular sequences

When slicing circular Dseq objects we get linear sequences.

circular_seq = Dseq("aatctaa", circular=True)
circular_seq[1:5]

Dseq(-4)
atct
taga

We can slice circular sequences across the origin (where index is zero) if the first index is bigger than the second index. This is demonstrated in the example below:

circular_seq[5:2]

Dseq(-4)
aaaa
tttt

Printing sequences to the console: `repr` and `str`

__repr__ and __str__ are methods present in all python classes that return a string representation of an object. __str__ is called by the print function, and __repr__ is used by the console or notebook output when the object is not assigned to a variable. Below is an example with a date object:

import datetime

my_date = datetime.date(2023, 8, 15)

print('> print statement:', my_date)
print('> repr:', repr(my_date))
print('> repr from class method:', my_date.__repr__())

print()
print('> console output:')
my_date

> print statement: 2023-08-15
> repr: datetime.date(2023, 8, 15)
> repr from class method: datetime.date(2023, 8, 15)

> console output:

datetime.date(2023, 8, 15)

In a similar way, __repr__ and __str__ methods are used by pydna to represent sequences as strings for different purposes:

__repr__ is used to make a figure-like representation that shows both strands and the overhangs.
__str__ is used to return the entire sequence as a string of characters (from the left-most to the right-most base of both strands), the way we would store it in a FASTA file.

my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
print('> figure-like representation:\n', my_seq.__repr__())
print()
print('> string representation:\n', my_seq)

> figure-like representation:
 Dseq(-8)
aaatta
   aattt

> string representation:
 aaattaaa

Note that on the string representation, the bases correspond to the entire sequence provided, even when they are only present on either the watson or crick strand. In the example above, the last two aa bases are missing from the watson strand, and that only the crick strand has them.

Edge cases

You can create arbitrary double-stranded sequences that are not complementary if you specify both strands and an overhang, but you won’t be able to use them for molecular biology simulations. For example:

Dseq("xxxx", "atat", ovhg=2)

Dseq(-6)
  xxxx
tata