                             SITES documentation



CONTENTS

   1.0 SUMMARY
   2.0 INPUTS & OUTPUTS
   3.0 INPUT FILE FORMAT
   4.0 OUTPUT FILE FORMAT
   5.0 DATA FILES
   6.0 USAGE
   7.0 KNOWN BUGS & WARNINGS
   8.0 NOTES
   9.0 DESCRIPTION
   10.0 ALGORITHM
   11.0 RELATED APPLICATIONS
   12.0 DIAGNOSTIC ERROR MESSAGES
   13.0 AUTHORS
   14.0 REFERENCES

1.0 SUMMARY

   Generate residue-ligand CON files from CCF files

2.0 INPUTS & OUTPUTS

   SITES reads CCF files (clean coordinate file) and writes a CON files
   (contacts file) of residue-ligand contact data for domains in a DCF
   file (domain classification file). The CON file contains contact data
   for all ligand-domain pairs (using domain definitions from the DCF
   file) found in the CCF files. The input and output files are specified
   by the user (file extensions in the ACD file). A log file is also
   written.

3.0 INPUT FILE FORMAT

   The format of the protein CCF file is described in the PDBPARSE
   documentation.

  Input files for usage example

  File: ../scopparse-structure/all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

4.0 OUTPUT FILE FORMAT

   The CON format used for the contact files (Figure 1) is similar to EMBL
   format and is described in the CONTACTS documentation. A few of the
   records differ in the SITES output compared to the CONTACTS output,
   however, so for the sake of clarity all records are described below.
     * XX - used for spacing and comments. The first line is bibliographic
       information and contains the text "Residue-ligand contact data (for
       domains)".
     * TY - type of contact. For CON files generated by SITES, 'LIGAND' is
       always given.
     * EX - experimental information. The value of the threshold contact
       distance is given as a floating point number after 'THRESH'. For
       CON files generated by SITES, a '.' is given after 'IGNORE', 'NMOD'
       and 'NCHA' (these records are used by the CONTACTS and INTERFACE
       applications and can be disregarded).
     * NE - number of entries in the file. For CON files generated by
       SITES, this is the number of unique ligands:domain pairs. Following
       the NE record, the file has a section for each entry containing
       records for entry number (EN), identifier codes (ID), ligand
       description (DE), polypeptide chain-specific data (CN), chain
       sequence information (S1) and number of contacts (NC), that
       together define the ligand:domain pair and its contacts.
     * EN - entry number. The number in brackets indicates the start of an
       entry (ligand:domain pair).
     * ID - identifier codes: (1) PDB: 4-character PDB identifier code.
       (2) DOM: 7-character domain identifier code from SCOP or CATH. (3)
       LIG: Ligand identifier (an abbreviation of its full name).
     * DE - Full name of the ligand, see HETPARSE documentation.
     * CN - polypeptide chain-specific data. Tokens delimiting data items
       are as follow. (1) MO: The model number (from the PDB file). '1' is
       always given for CON files generated by using SITES (contacts were
       calculated from the coordinates for a single model from a domain
       CCF file). (2) CN1: Chain number. '1' is always given (domains from
       a domain CCF file are always listed as from a single chain only).
       (3) CN2: Not used by SITES, a '.' is given. (4) ID1: PDB chain
       identifier (a '.` given in cases where a chain identifier was not
       specified in the original PDB file or, for domain CCF files, the
       domain from SCOP or CATH is comprised of more than one chain). (5)
       ID2: Not used by SITES, a '.' is given. (6) NRES1: number of
       residues in chain. (7) NRES2: Not used by SITES, a '.' is given.
     * S1 - polypeptide chain sequence for domain. The number of residues
       is given before AA on the first line. The sequece is given on
       subsequent lines.
     * NC - number of contacts: (1) SM: Not used by SITES, a '.' is given.
       (2) LI: Number of residue-ligand contacts; between side-chain or
       main-chain atoms of an amino acid residue and a ligand.
     * LI - Line of residue-ligand contact data. The amino acid identifier
       and residue number are given. Residue numbers are taken from the
       CCF file and give a correct index into the sequence (i.e. they are
       not necessarily the same as the original PDB file). This sequence
       is given in the CON file itself (S1 record).
     * // - delimiter for individual entries in the file and also given on
       the last line of the file.

  Output files for usage example

  File: SITES.con

XX   Residue-ligand contact data (for domains).
XX
TY   LIGAND
XX
EX   THRESH 1.0; IGNORE .; NMOD .; NCHA .;
XX
NE   11
XX
EN   [1]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 1; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  D8CCAE0E1FC0849A CRC64;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   PHE 6
LI   THR 7
LI   LEU 44
LI   GLY 45
LI   ASP 46
XX
//
EN   [2]
XX
ID   PDB 1ii7; DOM d1ii7a_; LIG 101;
XX
DE   2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE
XX
SI   SN 2; NS 2
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 65; NRES2 .
XX
S1   SEQUENCE    65 AA;   7395 MW;  75FBE75B22FD3678 CRC64;
     MKFAHLADIH LGYEQFHKPQ REEEFAEAFK NALEIAVQEN VDFILIAGDL FHSSRPSPGT
     LKKAI
XX
NC   SM .; LI 2
XX
LI   HIS 10
LI   ASP 49
XX


  [Part of this file has been deleted for brevity]

NC   SM .; LI 3
XX
LI   ASP 8
LI   HIS 10
LI   ASP 49
XX
//
EN   [10]
XX
ID   PDB 2hhb; DOM .; LIG PO4;
XX
DE   PHOSPHATE ION
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 D; ID2 .; NRES1 146; NRES2 .
XX
S1   SEQUENCE   146 AA;  15867 MW;  EACBC707CFD466A1 CRC64;
     VHLTPEEKSA VTALWGKVNV DEVGGEALGR LLVVYPWTQR FFESFGDLST PDAVMGNPKV
     KAHGKKVLGA FSDGLAHLDN LKGTFATLSE LHCDKLHVDP ENFRLLGNVL VCVLAHHFGK
     EFTPPVQAAY QKVVAGVANA LAHKYH
XX
NC   SM .; LI 2
XX
LI   VAL 1
LI   LEU 81
XX
//
EN   [11]
XX
ID   PDB 1cs4; DOM d1cs4a_; LIG POP;
XX
DE   PYROPHOSPHATE 2-
XX
SI   SN 1; NS 1
XX
CN   MO .; CN1 1; CN2 .; ID1 A; ID2 .; NRES1 52; NRES2 .
XX
S1   SEQUENCE    52 AA;   5817 MW;  D8CCAE0E1FC0849A CRC64;
     ADIEGFTSLA SQCTAQELVM TLNELFARFD KLAAENHCLR IKILGDCYYC VS
XX
NC   SM .; LI 6
XX
LI   ASP 2
LI   ILE 3
LI   GLU 4
LI   GLY 5
LI   PHE 6
LI   THR 7
XX
//

  File: sites.log

CCF: /homes/user/test/qa/pdbplus-keep/1cs4.ccf HETS:YES NHETS:7 SCOP:YES NDOMS:
1
CCF: /homes/user/test/qa/pdbplus-keep/1ii7.ccf HETS:YES NHETS:5 SCOP:YES NDOMS:
1
CCF: /homes/user/test/qa/pdbplus-keep/2hhb.ccf HETS:YES NHETS:5 SCOP:NO NCHN:4

5.0 DATA FILES

   SITES uses a data file containing van der Waals radii for atoms in
   proteins (see CONTACTS documentation.) The file Evdw.dat is such a data
   file and is part of the EMBOSS distribution.
   SITES uses a data file containing a dictionary of heterogen groups in
   PDB. This file may be generated by using HETPARSE and is part of the
   EMBOSS distribution. The file Ehet.dat is such a data file and is part
   of the EMBOSS distribution.

6.0 USAGE

  6.1 COMMAND LINE ARGUMENTS

Generate residue-ligand CON files from CCF files.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-protpath]          dirlist    [./] This option specifies the location of
                                  the protein CCF files (clean coordinate
                                  files) (input). A 'clean cordinate file'
                                  contains protein coordinate and derived data
                                  for a single PDB file ('protein clean
                                  coordinate file') or a single domain from
                                  SCOP or CATH ('domain clean coordinate
                                  file'), in CCF format (EMBL-like). The
                                  files, generated by using PDBPARSE (PDB
                                  files) or DOMAINER (domains), contain
                                  'cleaned-up' data that is self-consistent
                                  and error-corrected. Records for residue
                                  solvent accessibility and secondary
                                  structure are added to the file by using
                                  PDBPLUS.
  [-domaindir]         directory  [./] This option specifies the location of
                                  the domain CCF files (clean coordinate
                                  files) (input). A 'clean cordinate file'
                                  contains protein coordinate and derived data
                                  for a single PDB file ('protein clean
                                  coordinate file') or a single domain from
                                  SCOP or CATH ('domain clean coordinate
                                  file'), in CCF format (EMBL-like). The
                                  files, generated by using PDBPARSE (PDB
                                  files) or DOMAINER (domains), contain
                                  'cleaned-up' data that is self-consistent
                                  and error-corrected. Records for residue
                                  solvent accessibility and secondary
                                  structure are added to the file by using
                                  PDBPLUS.
  [-dcffile]           infile     This option specifies the name of the DCF
                                  file (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
   -threshold          float      [1.0] This option specifies the threshold
                                  contact distance. (Any numeric value)
  [-outfile]           outfile    [SITES.con] This option specifies the name
                                  of the output file.
   -logfile            outfile    [sites.log] This option specifies the name
                                  of the log file.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers:
   -dicfile            datafile   [Ehet.dat] This option specifies the
                                  dictionary of heterogen groups in PDB. This
                                  file is generated by using HETPARSE and is
                                  part of the EMBOSS distribution.
   -vdwfile            datafile   [Evdw.dat] This option specifies the name of
                                  the data file with van der Waals radii for
                                  atoms in amino acid residues. This file is
                                  part of the EMBOSS distribution.

   Associated qualifiers:

   "-protpath" associated qualifiers
   -extension1         string     Default file extension

   "-domaindir" associated qualifiers
   -extension2         string     Default file extension

   "-outfile" associated qualifiers
   -odirectory4        string     Output directory

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


