[IUCr Home Page] [CIF Home Page] [ciftbx home]
<<<==== ENTER PDB ID CODE HERE

WPDB

Wide Protein Data Bank Format

Work funded in part by the U.S. Department of Energy under award ER63601-1021466-0009501

Frances C. Bernstein, Bernstein + Sons, Bellport, NY, USA,
Herbert J. Bernstein, Dowling College, Oakdale, NY, USA,

Note: this web page: http://wpdb.arcib.org was formerly hosted at http://biomol.dowling.edu/WPDB. Please change your bookmarks accordingly

<<<==== ENTER PDB ID CODE HERE

<<<==== ENTER PDB ID CODE HERE

The Protein Data Bank is the world-wide repository for the results of macromolecular structure determination. The macromolecular CIF (mmCIF) format [5] was developed to capture all the detailed searchable relationships among data items and to overcome the field-width limitations of the old 80-column PDB format [2][4]. While mmCIF has been of great value to the internal operation of the PDB and has simplified the creation of powerful search engines, users of the PDB have been reluctant to make the transition from the easily-read fixed-field PDB format to mmCIF, and most existing macromolecular software remains unable to read mmCIF directly.

We propose a new fixed-field wide PDB format (WPDB) that will carry all the information provided by mmCIF using record formats very similar to those in the existing PDB format. By increasing the widths of many fields (especially atom number, atom name, chain identifier and coordinates), we should overcome the major deficiencies of the old PDB format, including the 99,999 atom limitation and the idiosyncratic atom naming forced by the 4-column field width, especially for hydrogens. In addition, we propose new record types to carry the new information from the mmCIF files (e.g. the more detailed information on secondary structure and biological function). We are creating simple external translation filter programs to go back and forth to mmCIF format and, where the data permits, to the existing PDB format. Software developers should find it completely straightforward to revise their input and output routines to process wide PDB (WPDB) and anyone will be able to use the filters.

This project is an extension of the DOE funded project BIOMOL project (ER63601-1021466-0009501) to enchance access to the contents of the PDB through improvements in software external to the database. Just as the base proposal works to enhance access by extending RasMol [3] [7] to increase local functionality in management of biological units, the WPDB project works to enhance access by providing software tools which allow easy adaptation of applications to a format that will support the greater detail of and size of structures not easily handled in the old PDB format. Indeed, we will demonstrate the ease of this adaptation by adding both WPDB input and WPDB output capabilities to RasMol.

The managers of the RCSB PDB have implicity recognized the need for a format going beyond both the old PDB format and mmCIF. They have applied the mmCIF dictionary to XML format to produce a new composite format, PDBML [8]. Unfortunately, this format exacerbates the difficulties for application developers of working with mmCIF format by combining the order independence of mmCIF with the inefficiencies of needing to manage the syntactic verbosity of XML, and they propose a new atom record format that partially returns to the efficiency and fixed ordering of the old PDB format. It is time to take this idea to its logical conclusion and provide a fixed-field, easily parsed format for all PDB entries, including the newer ones.

WPDB Paradigm and Examples of WPDB format

The original design paradigm for the WPDB format was:

  1. WPDB is a fixed field format with records in a specified order.
  2. PDB records types and fields will appear in WPDB in their original order to simplify adaptation of existing software to the new format.
  3. Records and fields will be provided to handle data from all currently defined mmCIF data names.
  4. Fields within records will be made wide enough to accommodate at least 9,999,999 atoms and 9,999 chains.
  5. Where possible, faithful translations would be provided between PDB and WPDB, between mmCIF and WPDB and between PDBML and and WPDB.

The last design requirement would seem to conflict with the requirement for a fixed field format, since mmCIF format allows for arbitrarily long chain identifiers. We originally planned to resolve this conflict by adding record types for translation between long and short chain identifiers, so that atom records could use short identifiers that had been linked by an earlier translation table to longer names. At present we are exploring a simpler solution: use of field-by-field continuation when needed.

The PDB format atom naming convention will be retained, using fixed column positions to distinguish carbon-alpha from calcium, but the field will be extended to at least 6 characters to allow for proper hydrogen naming without the need to wrap the field around. Let us look at a old PDB format CRYST1, ORIGX, SCALE and ATOM records from 3CRO [6]

.........1.........2.........3.........4.........5.........6.........7
1234567890123456789012345678901234567890123456789012345678901234567890
CRYST1   49.200   47.600   61.700  90.00 109.50  90.00 P 21          2 
ORIGX1      1.000000  0.000000  0.000000        0.00000
ORIGX2      0.000000  1.000000  0.000000        0.00000
ORIGX3      0.000000  0.000000  1.000000        0.00000
SCALE1      0.020325  0.000000  0.007198        0.00000
SCALE2      0.000000  0.021008  0.000000        0.00000
SCALE3      0.000000  0.000000  0.017194        0.00000
ATOM      5  O5*   A A   1     -16.851  -5.543  74.981  1.00 55.62
ATOM      6  C5*   A A   1     -18.254  -5.683  75.238  1.00 51.97
ATOM      7  C4*   A A   1     -18.600  -7.125  75.571  1.00 37.32
ATOM      8  O4*   A A   1     -19.740  -7.166  76.456  1.00 26.97
ATOM      9  C3*   A A   1     -18.978  -8.004  74.382  1.00 34.63
Larger cells argue for wider fields for the cell edges in the CRYST1, for at least the translations in the ORIGX records, for all the fields in the SCALE records and for the coordinates in the ATOM records. A larger field is needed for the atom serial number to prevent the need for repeated numbers already seen in some NMR entries, etc. A wider format, as below, would only require minor changes in format edit descriptors (e.g. 'a10' or '3x') in applications.
.........1.........2.........3.........4.........5.........6.........7.........8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
software external to the database.
CRYST1     49.200     47.600     61.700  90.00 109.50  90.00 P 21          2 
ORIGX1        1.000000      0.000000      0.000000          0.00000
ORIGX2        0.000000      1.000000      0.000000          0.00000
ORIGX3        0.000000      0.000000      1.000000          0.00000
SCALE1        0.020325      0.000000      0.007198          0.00000
SCALE2        0.000000      0.021008      0.000000          0.00000
SCALE3        0.000000      0.000000      0.017194          0.00000
ATOM         5  O5*     A     A     1       -16.851    -5.543    74.981  1.00 55.62
ATOM         6  C5*     A     A     1       -18.254    -5.683    75.238  1.00 51.97
ATOM         7  C4*     A     A     1       -18.600    -7.125    75.571  1.00 37.32
ATOM         8  O4*     A     A     1       -19.740    -7.166    76.456  1.00 26.97
ATOM         9  C3*     A     A     1       -18.978    -8.004    74.382  1.00 34.63

Since the original proposal, the format has evolved. It will now handle up to 999,999,999 atoms and 10 character chain names. The current strawman draft of the WPDB format is available as a 1.85MB PDF here. The code used to produce the draft WPDB entries is available here. Comments and corrections appreciated.

References

[1] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 28:235 -- 242, 2000. See http://www.rcsb.org

[2] F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112:535 -- 542, 1977.

[3] H. J. Bernstein. Recent changes to RasMol, recombining the variants. Trends in Biological Sciences (TIBS), 25(9):453 -- 455, September 2000.

[4] Protein data bank atomic coordinate and bibliographic entry format description. Technical report, Brookhaven National Laboratory, 1992. http://arcib.dowling.edu/~bernsteh/PDB_format_1992.pdf.

[5] P. E. Bourne, H. M. Berman, B. McMahon, K. D.. Watenpaugh, J. Westbrook, and P. M. D. Fitzgerald. The macromolecular crystallographic information file (mmCIF). Methods in Enzymology, 277:571 --590, 1997.

[6] A. Mondragon, C. Wolberger, and S. C. Harrison. Structure of phage 434 cro protein at 2.35 Å resolution. J. Mol. Biol., 205(1):179 --188, 1989. PDB entry 3CRO, 1990.

[7] Roger Sayle and E. James Milner-White. RasMol: Biomolecular graphics for all. Trends in Biochemical Sciences (TIBS), 20(9):374, September 1995.

[8] J. Westbrook, N. Ito, H. Nakamura, K. Henrick, and H. M. Berman. PDBML: The representation of archival macromolecular structure data in XML. Bioinformatics, 21(7):988 -- 992, 2005.