Working with Features using the Dseqrecord class

Before working with features, check how to import sequences from files in the Importing_Seqs notebook.

For full library documentation, visit here.

Some sequence file formats (like Genbank) include features, describing key biological properties of sequence regions. In Genbank, features “include genes, gene products, as well as regions of biological significance reported in the sequence.” (See here for a description of a Genbank file and associated terminologies/annotations) Examples include coding sequences (CDS), introns, promoters, etc.

pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the Dseqrecord class. After reading a file into a Dseqrecord object, we can check out the list of features in the record using the following code. This example uses the sample record U49845.

# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
    %%capture
    # Install the current development version of pydna (comment to install pip version)
    !pip install git+https://github.com/BjornFJohansson/pydna@dev_bjorn
    # Install pip version instead (uncomment to install)
    # !pip install pydna

from pydna.dseqrecord import Dseqrecord
from pydna.parsers import parse

#Import your file into python. 
file_path = "./U49845.gb"
records = parse(file_path)
sample_record = records[0]

# List all features
for feature in sample_record.features:
    print(feature)

type: source
location: [0:5028](+)
qualifiers:
    Key: chromosome, Value: ['IX']
    Key: db_xref, Value: ['taxon:4932']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Saccharomyces cerevisiae']

type: mRNA
location: [<0:>206](+)
qualifiers:
    Key: product, Value: ['TCP1-beta']

type: CDS
location: [<0:206](+)
qualifiers:
    Key: codon_start, Value: ['3']
    Key: product, Value: ['TCP1-beta']
    Key: protein_id, Value: ['AAA98665.1']
    Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM']

type: gene
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']

type: mRNA
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']
    Key: product, Value: ['Axl2p']

type: CDS
location: [686:3158](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: gene, Value: ['AXL2']
    Key: note, Value: ['plasma membrane glycoprotein']
    Key: product, Value: ['Axl2p']
    Key: protein_id, Value: ['AAA98666.1']
    Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML']

type: gene
location: [<3299:>4037](-)
qualifiers:
    Key: gene, Value: ['REV7']

type: mRNA
location: [<3299:>4037](-)
qualifiers:
    Key: gene, Value: ['REV7']
    Key: product, Value: ['Rev7p']

type: CDS
location: [3299:4037](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: gene, Value: ['REV7']
    Key: product, Value: ['Rev7p']
    Key: protein_id, Value: ['AAA98667.1']
    Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF']

Additional ways to view and search for particular features are shown at the bottom of the page under “Other Methods to Viewing Features”

Adding Features and Qualifiers

To add new feature to describe a region of interest to a record, for instance a region that you would like to perform a PCR, you need to create a SeqFeature (sequence feature). The minimal information required is:

A FeatureLocation: position of the feature in the sequence.
The type of feature you want to add.

🚨🚨 VERY IMPORTANT 🚨🚨. Note that FeatureLocations are like python ranges (zero-based open intervals), whereas in GenBank files, locations are one-based closed intervals. For instance, the following code adds a new feature from the 2nd to the 5th nucleotide (FeatureLocation(3, 15)), of the gene type, but in the GenBank file will be represented as 4..15.

from Bio.SeqFeature import FeatureLocation, SeqFeature

# Create a dummy record
dummy_record = Dseqrecord("aaaATGCGTACGTGAacgt")

# Define the locations of a CDS
location = FeatureLocation(3, 15)

# Create a SeqFeature with the type mRNA
my_feature = SeqFeature(location=location, type="gene")

# Add my_feature to dummy_record with .append
dummy_record.features.append(my_feature)

# Confirm that my_feature has been added
print(dummy_record.features[-1])

# Print the feature in GenBank format (see how the location is `4..15`)
print(dummy_record.format("genbank"))

type: gene
location: [3:15]
qualifiers:

LOCUS       name                      19 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     gene            4..15
ORIGIN
        1 aaaatgcgta cgtgaacgt
//

To give further information about a feature, we can add a qualifier using the qualifiers property of SeqFeature, which contains a dictionary of qualifiers. For instance, if I would like to note a new feature of type ‘domain’, between 3-9 bases as my region of interest, I can instantiate the SeqFeature class object as such.

Note that a new feature is always added to the last position of the features list.

location = FeatureLocation(3, 9)

# Create a SeqFeature with a qualifier
my_feature2 = SeqFeature(location=location, type="domain", qualifiers={"Note": ["Region of interest"]})

# Add my_feature to my_record with .append
dummy_record.features.append(my_feature2)

# Confirm that my_feature has been added
print('>> Feature was added:')
print(dummy_record.features[-1])
print()

# Print the feature in GenBank format
print('>> GenBank format:')
print(dummy_record.format("genbank"))

>> Feature was added:
type: domain
location: [3:9]
qualifiers:
    Key: Note, Value: ['Region of interest']


>> GenBank format:
LOCUS       name                      19 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     gene            4..15
     domain          4..9
                     /Note="Region of interest"
ORIGIN
        1 aaaatgcgta cgtgaacgt
//

🤔 Best practices for qualifiers:

The values in the qualifiers dictionary should be lists. The reason for this is that in a GenBank file, a single feature can have multiple values for a single qualifier. Below is a real world of the ase1 CDS example from the S. pombe genome in EMBL format:

FT   CDS             join(1878362..1878785,1878833..1880604)
FT                   /colour=2
FT                   /primary_name="ase1"
FT                   /product="antiparallel microtubule cross-linking factor
FT                   Ase1"
FT                   /systematic_id="SPAPB1A10.09"
FT                   /controlled_curation="term=species distribution, conserved
FT                   in eukaryotes; date=20081110"
FT                   /controlled_curation="term=species distribution, conserved
FT                   in metazoa; date=20081110"
FT                   /controlled_curation="term=species distribution, conserved
FT                   in vertebrates; date=20081110"
FT                   /controlled_curation="term=species distribution,
FT                   predominantly single copy (one to one); date=20081110"
FT                   /controlled_curation="term=species distribution, conserved
FT                   in fungi; date=20081110"
FT                   /controlled_curation="term=species distribution, conserved
FT                   in eukaryotes only; date=20081110"

Note how there are several controlled_curation qualifiers, therefore it makes sense to store them as a list.

By default, you can add any type of object in the qualifiers dictionary, and most things will work if you add a string. However, you risk overwriting the existing value for a qualifier, so best practice is:

Check if the qualifier already exists using if "qualifier_name" in feature.qualifiers
If it exists, append to the existing list of values using feature.qualifiers["qualifier_name"].append("new_value")
If it does not exist, add it to the qualifiers dictionary using feature.qualifiers["qualifier_name"] = ["new_value"]

Note that Bio.SeqFeatures does not automatically assume a sequence strand for the feature. If you would like to refer to a feature on the positive or minus strand, you can add a parameter in FeatureLocation specifying strand=+1 or strand=-1.

#Create a location specifying the minus strand
location = FeatureLocation(15, 19, strand=-1)

my_feature3 = SeqFeature(location=location, type="domain", qualifiers={"gene":["example_domain"]})

dummy_record.features.append(my_feature3)

print(dummy_record.features[-1])

print(dummy_record.format("genbank"))

type: domain
location: [15:19](-)
qualifiers:
    Key: gene, Value: ['example_domain']

LOCUS       name                      19 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     gene            4..15
     domain          4..9
                     /Note="Region of interest"
     domain          complement(16..19)
                     /gene="example_domain"
ORIGIN
        1 aaaatgcgta cgtgaacgt
//

Adding a Feature with Parts

To add a feature with parts, like a CDS with introns, we need to use a CompoundLocation object when creating a SeqFeature.

The example code below adds a CDS with two parts, between 3-9bp and 12-15bp, to my features list. In a real-world scenario this would represent a CDS with an intron that skips the ACG codon: ATGCGT~~ACG~~TGA

from Bio.SeqFeature import CompoundLocation

# Define the locations of the CDS
locations = [FeatureLocation(3, 9), FeatureLocation(12, 15)]

# Create a compound location from these parts
compound_location = CompoundLocation(locations)

# Create a SeqFeature with this compound location, including type and qualifiers. 
cds_feature = SeqFeature(location=compound_location, type="CDS", qualifiers={"gene": ["example_gene"]})

# Add the feature to the Dseqrecord
dummy_record.features.append(cds_feature)

print(dummy_record.features[-1])

print(dummy_record.format("genbank"))

type: CDS
location: join{[3:9], [12:15]}
qualifiers:
    Key: gene, Value: ['example_gene']

LOCUS       name                      19 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     gene            4..15
     domain          4..9
                     /Note="Region of interest"
     domain          complement(16..19)
                     /gene="example_domain"
     CDS             join(4..9,13..15)
                     /gene="example_gene"
ORIGIN
        1 aaaatgcgta cgtgaacgt
//

We can even extract a protein record as follows (see how the protein sequence is MR, skipping the intron):

sub_record = dummy_record.features[-1].extract(dummy_record)

print(sub_record.translate())

ID: id
Name: name
Description: description
Number of features: 0
/molecule_type=DNA
ProteinSeq('MR')

Standard Feature Types and Qualifiers

pydna and Bio.SeqFeature suppports all the conventional feature types through the type parameters. A non-exhaustive list include gene, CDS, promoter, exon, intron, 5’ UTR, 3’ UTR, terminator, enhancer, and RBS. You can also define custom features, which could be useful for synthetic biology applications. For instance, you might want to have Bio_brick or spacer features to describe a synthetic standardised plasmid construct.

It is important to note that while pydna and Bio.SeqFeature does not restrict the feature types you can use, sticking to standard types helps maintain compatibility with other bioinformatics tools and databases. Please refer to the official GenBank_Feature_Table, that lists the standard feature types and their associated qualifiers.

Further documentation for SeqFeature, CompoundLocation, and FeatureLocation can be found in the SeqFeature module here.

Handling Origin Spanning Features

An origin spanning feature is a special type of feature that crosses over a circular sequence’s origin. In pydna, such a feature is represented as a feature with parts, joining the part of the sequence immediately before the origin and immediately after the origin. They can be added using CompoundLocation as normal.

An origin spanning feature, between base 19 to base 6, in a 25bp long circular sequence, is represented like so:

type: gene 
location: join{[19:25](+), [0:6](+)}  
qualifiers: gene, Value: example_gene  

This feature will be displayed as a single feature in SnapGene viewer and Benchling, since they support this convention.

circular_record = Dseqrecord('ACGTGAaaaaaaaaaaaaaATGCGT', circular=True)

location = [FeatureLocation(19,25), FeatureLocation(0, 6)]
ori_feat_location = CompoundLocation(location)
ori_feature = SeqFeature(location=ori_feat_location, type="misc", qualifiers={"gene": ["example origin spanning gene"]})
circular_record.features.append(ori_feature)

print('>> Feature:')
print(circular_record.features[-1])

# Note how the feature sequence is extracted properly across the origin.
print('>> Feature sequence:')
print(circular_record.features[-1].extract(circular_record).seq)
print()

print('>> GenBank format:')
print(circular_record.format("genbank"))

>> Feature:
type: misc
location: join{[19:25], [0:6]}
qualifiers:
    Key: gene, Value: ['example origin spanning gene']

>> Feature sequence:
ATGCGTACGTGA

>> GenBank format:
LOCUS       name                      25 bp    DNA     circular UNK 01-JAN-1980
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     misc            join(20..25,1..6)
                     /gene="example origin spanning gene"
ORIGIN
        1 acgtgaaaaa aaaaaaaaaa tgcgt
//

Other Methods to Viewing Features

pydna also provides the list_features method as a simple way to list all the features in a Dseqrecord object.

print(sample_record.list_features())

+-----+------------------+-----+-------+-------+------+--------+------+
| Ft# | Label or Note    | Dir | Sta   | End   |  Len | type   | orf? |
+-----+------------------+-----+-------+-------+------+--------+------+
|   0 | nd               | --> | 0     | 5028  | 5028 | source |  no  |
|   1 | nd               | --> | <0    | >206  |  206 | mRNA   |  no  |
|   2 | nd               | --> | <0    | 206   |  206 | CDS    |  no  |
|   3 | nd               | --> | <686  | >3158 | 2472 | gene   | yes  |
|   4 | nd               | --> | <686  | >3158 | 2472 | mRNA   | yes  |
|   5 | N:plasma membran | --> | 686   | 3158  | 2472 | CDS    | yes  |
|   6 | nd               | <-- | <3299 | >4037 |  738 | gene   | yes  |
|   7 | nd               | <-- | <3299 | >4037 |  738 | mRNA   | yes  |
|   8 | nd               | <-- | 3299  | 4037  |  738 | CDS    | yes  |
+-----+------------------+-----+-------+-------+------+--------+------+

This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record.

Alternatively, we can look for specific features using their qualifiers. For instance:

# Filter based on feature type
print('Getting all CDS features:')
cds_features = [f for f in sample_record.features if f.type == "CDS"]
for feature in cds_features:
    print(feature)

Getting all CDS features:
type: CDS
location: [<0:206](+)
qualifiers:
    Key: codon_start, Value: ['3']
    Key: product, Value: ['TCP1-beta']
    Key: protein_id, Value: ['AAA98665.1']
    Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM']

type: CDS
location: [686:3158](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: gene, Value: ['AXL2']
    Key: note, Value: ['plasma membrane glycoprotein']
    Key: product, Value: ['Axl2p']
    Key: protein_id, Value: ['AAA98666.1']
    Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML']

type: CDS
location: [3299:4037](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: gene, Value: ['REV7']
    Key: product, Value: ['Rev7p']
    Key: protein_id, Value: ['AAA98667.1']
    Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF']

# Find a particular feature by its qualifier (e.g. gene name)
rev7_cds_feature = next(f for f in sample_record.features if 
     f.type == "gene" and
     "gene" in f.qualifiers and "REV7" in f.qualifiers["gene"]
     )

print(rev7_cds_feature)

type: gene
location: [<3299:>4037](-)
qualifiers:
    Key: gene, Value: ['REV7']

If you would like to search for another type of features, simply replace the "gene" with your desired feature type in quotation marks.

Removing Features

In pydna, we can search for the feature that we would like to remove using the feature’s types or qualififers. For instance, we can modify the features list to exclude all CDS:

#Remove all CDS type features from my feature list
sample_record.features = [f for f in sample_record.features if not (f.type == "CDS")]

for feature in sample_record.features:
    print(feature)

type: source
location: [0:5028](+)
qualifiers:
    Key: chromosome, Value: ['IX']
    Key: db_xref, Value: ['taxon:4932']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Saccharomyces cerevisiae']

type: mRNA
location: [<0:>206](+)
qualifiers:
    Key: product, Value: ['TCP1-beta']

type: gene
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']

type: mRNA
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']
    Key: product, Value: ['Axl2p']

type: gene
location: [<3299:>4037](-)
qualifiers:
    Key: gene, Value: ['REV7']

type: mRNA
location: [<3299:>4037](-)
qualifiers:
    Key: gene, Value: ['REV7']
    Key: product, Value: ['Rev7p']

We can also modify the features list to exclude a specific gene:

#Exclude REV7 from my feature list
sample_record.features = [f for f in sample_record.features if not ('gene' in f.qualifiers and 'REV7' in f.qualifiers['gene'])]

for feature in sample_record.features:
    print(feature)

type: source
location: [0:5028](+)
qualifiers:
    Key: chromosome, Value: ['IX']
    Key: db_xref, Value: ['taxon:4932']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Saccharomyces cerevisiae']

type: mRNA
location: [<0:>206](+)
qualifiers:
    Key: product, Value: ['TCP1-beta']

type: gene
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']

type: mRNA
location: [<686:>3158](+)
qualifiers:
    Key: gene, Value: ['AXL2']
    Key: product, Value: ['Axl2p']