Working with Features using the Dseqrecord class
Before working with features, check how to import sequences from files in the Importing_Seqs notebook.
For full library documentation, visit here.
Some sequence file formats (like Genbank) include features, describing key biological properties of sequence regions. In Genbank, features “include genes, gene products, as well as regions of biological significance reported in the sequence.” (See here for a description of a Genbank file and associated terminologies/annotations) Examples include coding sequences (CDS), introns, promoters, etc.
pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the Dseqrecord
class. After reading a file into a Dseqrecord
object, we can check out the list of features in the record using the following code. This example uses the sample record U49845.
# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
%%capture
# Install the current development version of pydna (comment to install pip version)
!pip install git+https://github.com/BjornFJohansson/pydna@dev_bjorn
# Install pip version instead (uncomment to install)
# !pip install pydna
from pydna.dseqrecord import Dseqrecord
from pydna.parsers import parse
#Import your file into python.
file_path = "./U49845.gb"
records = parse(file_path)
sample_record = records[0]
# List all features
for feature in sample_record.features:
print(feature)
type: source
location: [0:5028](+)
qualifiers:
Key: chromosome, Value: ['IX']
Key: db_xref, Value: ['taxon:4932']
Key: mol_type, Value: ['genomic DNA']
Key: organism, Value: ['Saccharomyces cerevisiae']
type: mRNA
location: [<0:>206](+)
qualifiers:
Key: product, Value: ['TCP1-beta']
type: CDS
location: [<0:206](+)
qualifiers:
Key: codon_start, Value: ['3']
Key: product, Value: ['TCP1-beta']
Key: protein_id, Value: ['AAA98665.1']
Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM']
type: gene
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
type: mRNA
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
Key: product, Value: ['Axl2p']
type: CDS
location: [686:3158](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: gene, Value: ['AXL2']
Key: note, Value: ['plasma membrane glycoprotein']
Key: product, Value: ['Axl2p']
Key: protein_id, Value: ['AAA98666.1']
Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML']
type: gene
location: [<3299:>4037](-)
qualifiers:
Key: gene, Value: ['REV7']
type: mRNA
location: [<3299:>4037](-)
qualifiers:
Key: gene, Value: ['REV7']
Key: product, Value: ['Rev7p']
type: CDS
location: [3299:4037](-)
qualifiers:
Key: codon_start, Value: ['1']
Key: gene, Value: ['REV7']
Key: product, Value: ['Rev7p']
Key: protein_id, Value: ['AAA98667.1']
Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF']
Additional ways to view and search for particular features are shown at the bottom of the page under “Other Methods to Viewing Features”
Adding Features and Qualifiers
To add new feature to describe a region of interest to a record, for instance a region that you would like to perform a PCR, you need to create a SeqFeature
(sequence feature). The minimal information required is:
A
FeatureLocation
: position of the feature in the sequence.The
type
of feature you want to add.
🚨🚨 VERY IMPORTANT 🚨🚨. Note that FeatureLocation
s are like python ranges (zero-based open intervals), whereas in GenBank files, locations are one-based closed intervals. For instance, the following code adds a new feature from the 2nd to the 5th nucleotide (FeatureLocation(3, 15)
), of the gene
type, but in the GenBank file will be represented as 4..15
.
from Bio.SeqFeature import FeatureLocation, SeqFeature
# Create a dummy record
dummy_record = Dseqrecord("aaaATGCGTACGTGAacgt")
# Define the locations of a CDS
location = FeatureLocation(3, 15)
# Create a SeqFeature with the type mRNA
my_feature = SeqFeature(location=location, type="gene")
# Add my_feature to dummy_record with .append
dummy_record.features.append(my_feature)
# Confirm that my_feature has been added
print(dummy_record.features[-1])
# Print the feature in GenBank format (see how the location is `4..15`)
print(dummy_record.format("genbank"))
type: gene
location: [3:15]
qualifiers:
LOCUS name 19 bp DNA linear UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
gene 4..15
ORIGIN
1 aaaatgcgta cgtgaacgt
//
To give further information about a feature, we can add a qualifier using the qualifiers
property of SeqFeature
, which contains a dictionary of qualifiers. For instance, if I would like to note a new feature of type ‘domain’, between 3-9 bases as my region of interest, I can instantiate the SeqFeature
class object as such.
Note that a new feature is always added to the last position of the features list.
location = FeatureLocation(3, 9)
# Create a SeqFeature with a qualifier
my_feature2 = SeqFeature(location=location, type="domain", qualifiers={"Note": ["Region of interest"]})
# Add my_feature to my_record with .append
dummy_record.features.append(my_feature2)
# Confirm that my_feature has been added
print('>> Feature was added:')
print(dummy_record.features[-1])
print()
# Print the feature in GenBank format
print('>> GenBank format:')
print(dummy_record.format("genbank"))
>> Feature was added:
type: domain
location: [3:9]
qualifiers:
Key: Note, Value: ['Region of interest']
>> GenBank format:
LOCUS name 19 bp DNA linear UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
gene 4..15
domain 4..9
/Note="Region of interest"
ORIGIN
1 aaaatgcgta cgtgaacgt
//
🤔 Best practices for qualifiers:
The values in the qualifiers
dictionary should be lists. The reason for this is that in a GenBank file, a single feature can have multiple values for a single qualifier. Below is a real world of the ase1 CDS example from the S. pombe genome in EMBL format:
FT CDS join(1878362..1878785,1878833..1880604)
FT /colour=2
FT /primary_name="ase1"
FT /product="antiparallel microtubule cross-linking factor
FT Ase1"
FT /systematic_id="SPAPB1A10.09"
FT /controlled_curation="term=species distribution, conserved
FT in eukaryotes; date=20081110"
FT /controlled_curation="term=species distribution, conserved
FT in metazoa; date=20081110"
FT /controlled_curation="term=species distribution, conserved
FT in vertebrates; date=20081110"
FT /controlled_curation="term=species distribution,
FT predominantly single copy (one to one); date=20081110"
FT /controlled_curation="term=species distribution, conserved
FT in fungi; date=20081110"
FT /controlled_curation="term=species distribution, conserved
FT in eukaryotes only; date=20081110"
Note how there are several controlled_curation
qualifiers, therefore it makes sense to store them as a list.
By default, you can add any type of object in the qualifiers dictionary, and most things will work if you add a string. However, you risk overwriting the existing value for a qualifier, so best practice is:
Check if the qualifier already exists using
if "qualifier_name" in feature.qualifiers
If it exists, append to the existing list of values using
feature.qualifiers["qualifier_name"].append("new_value")
If it does not exist, add it to the qualifiers dictionary using
feature.qualifiers["qualifier_name"] = ["new_value"]
Note that Bio.SeqFeatures
does not automatically assume a sequence strand for the feature. If you would like to refer to a feature on the positive or minus strand, you can add a parameter in FeatureLocation
specifying strand=+1
or strand=-1
.
#Create a location specifying the minus strand
location = FeatureLocation(15, 19, strand=-1)
my_feature3 = SeqFeature(location=location, type="domain", qualifiers={"gene":["example_domain"]})
dummy_record.features.append(my_feature3)
print(dummy_record.features[-1])
print(dummy_record.format("genbank"))
type: domain
location: [15:19](-)
qualifiers:
Key: gene, Value: ['example_domain']
LOCUS name 19 bp DNA linear UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
gene 4..15
domain 4..9
/Note="Region of interest"
domain complement(16..19)
/gene="example_domain"
ORIGIN
1 aaaatgcgta cgtgaacgt
//
Adding a Feature with Parts
To add a feature with parts, like a CDS with introns, we need to use a CompoundLocation
object when creating a SeqFeature
.
The example code below adds a CDS with two parts, between 3-9bp and 12-15bp, to my features list. In a real-world scenario this would represent a CDS with an intron that skips the ACG
codon: ATGCGT~~ACG~~TGA
from Bio.SeqFeature import CompoundLocation
# Define the locations of the CDS
locations = [FeatureLocation(3, 9), FeatureLocation(12, 15)]
# Create a compound location from these parts
compound_location = CompoundLocation(locations)
# Create a SeqFeature with this compound location, including type and qualifiers.
cds_feature = SeqFeature(location=compound_location, type="CDS", qualifiers={"gene": ["example_gene"]})
# Add the feature to the Dseqrecord
dummy_record.features.append(cds_feature)
print(dummy_record.features[-1])
print(dummy_record.format("genbank"))
type: CDS
location: join{[3:9], [12:15]}
qualifiers:
Key: gene, Value: ['example_gene']
LOCUS name 19 bp DNA linear UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
gene 4..15
domain 4..9
/Note="Region of interest"
domain complement(16..19)
/gene="example_domain"
CDS join(4..9,13..15)
/gene="example_gene"
ORIGIN
1 aaaatgcgta cgtgaacgt
//
We can even extract a protein record as follows (see how the protein sequence is MR
, skipping the intron):
sub_record = dummy_record.features[-1].extract(dummy_record)
print(sub_record.translate())
ID: id
Name: name
Description: description
Number of features: 0
/molecule_type=DNA
ProteinSeq('MR')
Standard Feature Types and Qualifiers
pydna
and Bio.SeqFeature
suppports all the conventional feature types through the type
parameters. A non-exhaustive list include gene, CDS, promoter, exon, intron, 5’ UTR, 3’ UTR, terminator, enhancer, and RBS. You can also define custom features, which could be useful for synthetic biology applications. For instance, you might want to have Bio_brick or spacer features to describe a synthetic standardised plasmid construct.
It is important to note that while pydna
and Bio.SeqFeature
does not restrict the feature types you can use, sticking to standard types helps maintain compatibility with other bioinformatics tools and databases. Please refer to the official GenBank_Feature_Table, that lists the standard feature types and their associated qualifiers.
Further documentation for SeqFeature
, CompoundLocation
, and FeatureLocation
can be found in the SeqFeature
module here.
Handling Origin Spanning Features
An origin spanning feature is a special type of feature that crosses over a circular sequence’s origin. In pydna, such a feature is represented as a feature with parts, joining the part of the sequence immediately before the origin and immediately after the origin. They can be added using CompoundLocation
as normal.
An origin spanning feature, between base 19 to base 6, in a 25bp long circular sequence, is represented like so:
type: gene
location: join{[19:25](+), [0:6](+)}
qualifiers: gene, Value: example_gene
This feature will be displayed as a single feature in SnapGene viewer and Benchling, since they support this convention.
circular_record = Dseqrecord('ACGTGAaaaaaaaaaaaaaATGCGT', circular=True)
location = [FeatureLocation(19,25), FeatureLocation(0, 6)]
ori_feat_location = CompoundLocation(location)
ori_feature = SeqFeature(location=ori_feat_location, type="misc", qualifiers={"gene": ["example origin spanning gene"]})
circular_record.features.append(ori_feature)
print('>> Feature:')
print(circular_record.features[-1])
# Note how the feature sequence is extracted properly across the origin.
print('>> Feature sequence:')
print(circular_record.features[-1].extract(circular_record).seq)
print()
print('>> GenBank format:')
print(circular_record.format("genbank"))
>> Feature:
type: misc
location: join{[19:25], [0:6]}
qualifiers:
Key: gene, Value: ['example origin spanning gene']
>> Feature sequence:
ATGCGTACGTGA
>> GenBank format:
LOCUS name 25 bp DNA circular UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
misc join(20..25,1..6)
/gene="example origin spanning gene"
ORIGIN
1 acgtgaaaaa aaaaaaaaaa tgcgt
//
Other Methods to Viewing Features
pydna also provides the list_features
method as a simple way to list all the features in a Dseqrecord
object.
print(sample_record.list_features())
+-----+------------------+-----+-------+-------+------+--------+------+
| Ft# | Label or Note | Dir | Sta | End | Len | type | orf? |
+-----+------------------+-----+-------+-------+------+--------+------+
| 0 | nd | --> | 0 | 5028 | 5028 | source | no |
| 1 | nd | --> | <0 | >206 | 206 | mRNA | no |
| 2 | nd | --> | <0 | 206 | 206 | CDS | no |
| 3 | nd | --> | <686 | >3158 | 2472 | gene | yes |
| 4 | nd | --> | <686 | >3158 | 2472 | mRNA | yes |
| 5 | N:plasma membran | --> | 686 | 3158 | 2472 | CDS | yes |
| 6 | nd | <-- | <3299 | >4037 | 738 | gene | yes |
| 7 | nd | <-- | <3299 | >4037 | 738 | mRNA | yes |
| 8 | nd | <-- | 3299 | 4037 | 738 | CDS | yes |
+-----+------------------+-----+-------+-------+------+--------+------+
This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record.
Alternatively, we can look for specific features using their qualifiers. For instance:
# Filter based on feature type
print('Getting all CDS features:')
cds_features = [f for f in sample_record.features if f.type == "CDS"]
for feature in cds_features:
print(feature)
Getting all CDS features:
type: CDS
location: [<0:206](+)
qualifiers:
Key: codon_start, Value: ['3']
Key: product, Value: ['TCP1-beta']
Key: protein_id, Value: ['AAA98665.1']
Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM']
type: CDS
location: [686:3158](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: gene, Value: ['AXL2']
Key: note, Value: ['plasma membrane glycoprotein']
Key: product, Value: ['Axl2p']
Key: protein_id, Value: ['AAA98666.1']
Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML']
type: CDS
location: [3299:4037](-)
qualifiers:
Key: codon_start, Value: ['1']
Key: gene, Value: ['REV7']
Key: product, Value: ['Rev7p']
Key: protein_id, Value: ['AAA98667.1']
Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF']
# Find a particular feature by its qualifier (e.g. gene name)
rev7_cds_feature = next(f for f in sample_record.features if
f.type == "gene" and
"gene" in f.qualifiers and "REV7" in f.qualifiers["gene"]
)
print(rev7_cds_feature)
type: gene
location: [<3299:>4037](-)
qualifiers:
Key: gene, Value: ['REV7']
If you would like to search for another type of features, simply replace the "gene"
with your desired feature type in quotation marks.
Removing Features
In pydna, we can search for the feature that we would like to remove using the feature’s types or qualififers. For instance, we can modify the features list to exclude all CDS:
#Remove all CDS type features from my feature list
sample_record.features = [f for f in sample_record.features if not (f.type == "CDS")]
for feature in sample_record.features:
print(feature)
type: source
location: [0:5028](+)
qualifiers:
Key: chromosome, Value: ['IX']
Key: db_xref, Value: ['taxon:4932']
Key: mol_type, Value: ['genomic DNA']
Key: organism, Value: ['Saccharomyces cerevisiae']
type: mRNA
location: [<0:>206](+)
qualifiers:
Key: product, Value: ['TCP1-beta']
type: gene
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
type: mRNA
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
Key: product, Value: ['Axl2p']
type: gene
location: [<3299:>4037](-)
qualifiers:
Key: gene, Value: ['REV7']
type: mRNA
location: [<3299:>4037](-)
qualifiers:
Key: gene, Value: ['REV7']
Key: product, Value: ['Rev7p']
We can also modify the features list to exclude a specific gene:
#Exclude REV7 from my feature list
sample_record.features = [f for f in sample_record.features if not ('gene' in f.qualifiers and 'REV7' in f.qualifiers['gene'])]
for feature in sample_record.features:
print(feature)
type: source
location: [0:5028](+)
qualifiers:
Key: chromosome, Value: ['IX']
Key: db_xref, Value: ['taxon:4932']
Key: mol_type, Value: ['genomic DNA']
Key: organism, Value: ['Saccharomyces cerevisiae']
type: mRNA
location: [<0:>206](+)
qualifiers:
Key: product, Value: ['TCP1-beta']
type: gene
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
type: mRNA
location: [<686:>3158](+)
qualifiers:
Key: gene, Value: ['AXL2']
Key: product, Value: ['Axl2p']