pydna.seq

A subclass of the Biopython SeqRecord class.

Has a number of extra methods and uses the pydna._pretty_str.pretty_str class instread of str for a nicer output in the IPython shell.

class pydna.seq.Seq(data: str | bytes | bytearray | _SeqAbstractBaseClass | SequenceDataAbstractBaseClass | dict | None, length: int | None = None)[source]

Bases: Seq

docstring.

translate(*args, stop_symbol: str = '*', to_stop: bool = False, cds: bool = False, gap: str = '-', **kwargs) ProteinSeq[source]

Translate..

gc() float[source]

Return GC content.

cai(organism: str = 'sce') float[source]

docstring.

rarecodons(organism: str = 'sce') List[slice][source]

docstring.

startcodon(organism: str = 'sce') float | None[source]

docstring.

stopcodon(organism: str = 'sce') float | None[source]

docstring.

express(organism: str = 'sce') PrettyTable[source]

docstring.

orfs2(minsize: int = 30) List[str][source]

docstring.

orfs(minsize: int = 100) List[Tuple[int, int]][source]
seguid() str[source]

Url safe SEGUID [1] for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.seq import Seq
>>> a = Seq("aa")
>>> a.seguid()
'lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttjU'

References

reverse_complement()[source]

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

rc()

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

class pydna.seq.ProteinSeq(data: str | bytes | bytearray | _SeqAbstractBaseClass | SequenceDataAbstractBaseClass | dict | None, length: int | None = None)[source]

Bases: Seq

docstring.

translate()[source]

Turn a nucleotide sequence into a protein sequence by creating a new sequence object.

This method will translate DNA or RNA sequences. It should not be used on protein sequences as any result will be biologically meaningless.

Parameters:
  • name (- table - Which codon table to use? This can be either a) – (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). This defaults to the “Standard” table.

  • string (- stop_symbol - Single character) – terminators. This defaults to the asterisk, “*”.

  • for (what to use) – terminators. This defaults to the asterisk, “*”.

  • Boolean (- cds -) – translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • full (defaults to False meaning do a) – translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • Boolean – this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

  • True (indicates this is a complete CDS. If) – this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

:paramthis checks the sequence starts with a valid alternative start

codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

Parameters:

gaps. (- gap - Single character string to denote symbol used for) – Defaults to the minus sign.

A Seq object is returned if translate is called on a Seq object; a MutableSeq object is returned if translate is called pn a MutableSeq object.

e.g. Using the standard table:

>>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna.translate()
Seq('VAIVMGR*KGAR*')
>>> coding_dna.translate(stop_symbol="@")
Seq('VAIVMGR@KGAR@')
>>> coding_dna.translate(to_stop=True)
Seq('VAIVMGR')

Now using NCBI table 2, where TGA is not a stop codon:

>>> coding_dna.translate(table=2)
Seq('VAIVMGRWKGAR*')
>>> coding_dna.translate(table=2, to_stop=True)
Seq('VAIVMGRWKGAR')

In fact, GTG is an alternative start codon under NCBI table 2, meaning this sequence could be a complete CDS:

>>> coding_dna.translate(table=2, cds=True)
Seq('MAIVMGRWKGAR')

It isn’t a valid CDS under NCBI table 1, due to both the start codon and also the in frame stop codons:

>>> coding_dna.translate(table=1, cds=True)
Traceback (most recent call last):
    ...
Bio.Data.CodonTable.TranslationError: First codon 'GTG' is not a start codon

If the sequence has no in-frame stop codon, then the to_stop argument has no effect:

>>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
>>> coding_dna2.translate()
Seq('LAIVMGR')
>>> coding_dna2.translate(to_stop=True)
Seq('LAIVMGR')

NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.

NOTE - This does NOT behave like the python string’s translate method. For that use str(my_seq).translate(…) instead

complement()[source]

Return the complement as a DNA sequence.

>>> Seq("CGA").complement()
Seq('GCT')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").complement()
Seq('GCTAA')

In contrast, complement_rna returns an RNA sequence:

>>> Seq("CGAUT").complement_rna()
Seq('GCUAA')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement()
MutableSeq('GCT')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement(inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As Seq objects are immutable, a TypeError is raised if complement_rna is called on a Seq object with inplace=True.

complement_rna()[source]

Return the complement as an RNA sequence.

>>> Seq("CGA").complement_rna()
Seq('GCU')

Any T in the sequence is treated as a U:

>>> Seq("CGAUT").complement_rna()
Seq('GCUAA')

In contrast, complement returns a DNA sequence by default:

>>> Seq("CGA").complement()
Seq('GCT')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement_rna()
MutableSeq('GCU')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement_rna(inplace=True)
MutableSeq('GCU')
>>> my_seq
MutableSeq('GCU')

As Seq objects are immutable, a TypeError is raised if complement_rna is called on a Seq object with inplace=True.

reverse_complement()[source]

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

rc()

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

reverse_complement_rna()[source]

Return the reverse complement as an RNA sequence.

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

Any T in the sequence is treated as a U:

>>> Seq("CGAUT").reverse_complement_rna()
Seq('AAUCG')

In contrast, reverse_complement returns a DNA sequence:

>>> Seq("CGA").reverse_complement()
Seq('TCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement_rna()
MutableSeq('UCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement_rna(inplace=True)
MutableSeq('UCG')
>>> my_seq
MutableSeq('UCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement_rna is called on a Seq object with inplace=True.

transcribe()[source]

Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.

Following the usual convention, the sequence is interpreted as the coding strand of the DNA double helix, not the template strand. This means we can get the RNA sequence just by switching T to U.

>>> from Bio.Seq import Seq
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> coding_dna.transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

The sequence is modified in-place and returned if inplace is True:

>>> sequence = MutableSeq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence.transcribe()
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence.transcribe(inplace=True)
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

As Seq objects are immutable, a TypeError is raised if transcribe is called on a Seq object with inplace=True.

Trying to transcribe an RNA sequence has no effect. If you have a nucleotide sequence which might be DNA or RNA (or even a mixture), calling the transcribe method will ensure any T becomes U.

Trying to transcribe a protein sequence will replace any T for Threonine with U for Selenocysteine, which has no biologically plausible rational.

>>> from Bio.Seq import Seq
>>> my_protein = Seq("MAIVMGRT")
>>> my_protein.transcribe()
Seq('MAIVMGRU')
back_transcribe()[source]

Return the DNA sequence from an RNA sequence by creating a new Seq object.

>>> from Bio.Seq import Seq
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

The sequence is modified in-place and returned if inplace is True:

>>> sequence = MutableSeq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence.back_transcribe()
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence.back_transcribe(inplace=True)
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

As Seq objects are immutable, a TypeError is raised if transcribe is called on a Seq object with inplace=True.

Trying to back-transcribe DNA has no effect, If you have a nucleotide sequence which might be DNA or RNA (or even a mixture), calling the back-transcribe method will ensure any U becomes T.

Trying to back-transcribe a protein sequence will replace any U for Selenocysteine with T for Threonine, which is biologically meaningless.

>>> from Bio.Seq import Seq
>>> my_protein = Seq("MAIVMGRU")
>>> my_protein.back_transcribe()
Seq('MAIVMGRT')
seguid() str[source]

Url safe SEGUID [2] for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.seq import ProteinSeq
>>> a = ProteinSeq("aa")
>>> a.seguid()
'lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttjU'

References

molecular_weight() float[source]
pI() float[source]
instability_index() float[source]

Instability index according to Guruprasad et al.

Value above 40 means the protein is has a short half life.

Guruprasad K., Reddy B.V.B., Pandit M.W. Protein Engineering 4:155-161(1990).