pydna.genbankfixer

This module provides the gbtext_clean() function which can clean up broken Genbank files enough to pass the BioPython Genbank parser

Almost all of this code was lifted from BioJSON (https://github.com/levskaya/BioJSON) by Anselm Levskaya. The original code was not accompanied by any software licence. This parser is based on pyparsing.

There are some modifications to deal with fringe cases.

The parser first produces JSON as an intermediate format which is then formatted back into a string in Genbank format.

The parser is not complete, so some fields do not survive the roundtrip (see below). This should not be a difficult fix. The returned result has two properties, .jseq which is the intermediate JSON produced by the parser and .gbtext which is the formatted genbank string.

pydna.genbankfixer.parseGBLoc(s, l_, t)[source]: retwingles parsed genbank location strings, assumes no joins of RC and FWD sequences

pydna.genbankfixer.strip_multiline(s, l_, t)[source]

pydna.genbankfixer.toInt(s, l_, t)[source]

pydna.genbankfixer.strip_indent(str)[source]

pydna.genbankfixer.concat_dict(dlist)[source]: more or less dict(list of string pairs) but merges vals with the same keys so no duplicates occur

pydna.genbankfixer.toJSON(gbkstring)[source]

pydna.genbankfixer.wrapstring(str_, rowstart, rowend, padfirst=True)[source]: wraps the provided string in lines of length rowend-rowstart and padded on the left by rowstart. -> if padfirst is false the first line is not padded

pydna.genbankfixer.locstr(locs, strand)[source]: genbank formatted location string, assumes no join’d combo of rev and fwd seqs

pydna.genbankfixer.originstr(sequence)[source]: formats dna sequence as broken, numbered lines ala genbank

pydna.genbankfixer.toGB(jseq)[source]: parses json jseq data and prints out ApE compatible genbank

pydna.genbankfixer.gbtext_clean(gbtext)[source]

This function takes a string containing one genbank sequence in Genbank format and returns a named tuple containing two fields, the gbtext containing a string with the corrected genbank sequence and jseq which contains the JSON intermediate.

Examples

>>> s = '''LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
... DEFINITION  .
... ACCESSION
... VERSION
... SOURCE      .
...   ORGANISM  .
... COMMENT
... COMMENT     ApEinfo:methylated:1
... ORIGIN
...         1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)  
/home/bjorn/anaconda3/envs/bjorn36/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013\n'
  "correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
  File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 48, in read
    results = results.pop()
IndexError: pop from empty list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 50, in read
    raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS       New_DNA                    3 bp ds-DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION
VERSION
SOURCE      .
ORGANISM  .
COMMENT
COMMENT     ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS       New_DNA                    3 bp    DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION   New_DNA
VERSION     New_DNA
KEYWORDS    .
SOURCE
  ORGANISM  .
            .
COMMENT
            ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//