pydna.genbankfixer
This module provides the gbtext_clean()
function which can clean up broken Genbank files enough to
pass the BioPython Genbank parser
Almost all of this code was lifted from BioJSON (https://github.com/levskaya/BioJSON) by Anselm Levskaya. The original code was not accompanied by any software licence. This parser is based on pyparsing.
There are some modifications to deal with fringe cases.
The parser first produces JSON as an intermediate format which is then formatted back into a string in Genbank format.
The parser is not complete, so some fields do not survive the roundtrip (see below). This should not be a difficult fix. The returned result has two properties, .jseq which is the intermediate JSON produced by the parser and .gbtext which is the formatted genbank string.
- pydna.genbankfixer.parseGBLoc(s, l_, t)[source]
retwingles parsed genbank location strings, assumes no joins of RC and FWD sequences
- pydna.genbankfixer.concat_dict(dlist)[source]
more or less dict(list of string pairs) but merges vals with the same keys so no duplicates occur
- pydna.genbankfixer.wrapstring(str_, rowstart, rowend, padfirst=True)[source]
wraps the provided string in lines of length rowend-rowstart and padded on the left by rowstart. -> if padfirst is false the first line is not padded
- pydna.genbankfixer.locstr(locs, strand)[source]
genbank formatted location string, assumes no join’d combo of rev and fwd seqs
- pydna.genbankfixer.originstr(sequence)[source]
formats dna sequence as broken, numbered lines ala genbank
- pydna.genbankfixer.gbtext_clean(gbtext)[source]
This function takes a string containing one genbank sequence in Genbank format and returns a named tuple containing two fields, the gbtext containing a string with the corrected genbank sequence and jseq which contains the JSON intermediate.
Examples
>>> s = '''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013 ... DEFINITION . ... ACCESSION ... VERSION ... SOURCE . ... ORGANISM . ... COMMENT ... COMMENT ApEinfo:methylated:1 ... ORIGIN ... 1 aaa ... //''' >>> from pydna.readers import read >>> read(s) /home/bjorn/anaconda3/envs/bjorn36/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct? :'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n' "correct?\n:%r" % line, BiopythonParserWarning) Traceback (most recent call last): File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 48, in read results = results.pop() IndexError: pop from empty list During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 50, in read raise ValueError("No sequences found in data:\n({})".format(data[:79])) ValueError: No sequences found in data: (LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013 DEFINITI) >>> from pydna.genbankfixer import gbtext_clean >>> s2, j2 = gbtext_clean(s) >>> print(s2) LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013 DEFINITION . ACCESSION VERSION SOURCE . ORGANISM . COMMENT COMMENT ApEinfo:methylated:1 FEATURES Location/Qualifiers ORIGIN 1 aaa // >>> s3 = read(s2) >>> s3 Dseqrecord(o3) >>> print(s3.format()) LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013 DEFINITION . ACCESSION New_DNA VERSION New_DNA KEYWORDS . SOURCE ORGANISM . . COMMENT ApEinfo:methylated:1 FEATURES Location/Qualifiers ORIGIN 1 aaa //