ACMI v1.0 manual
(c) 2007 Frank DiMaio


1 INTRODUCTION
--------------

ACMI is a suite of programs designed to automate construction of a 
macromolecular model from an electron density map.  It is specifically
developed to handle low-resolution (3-4 Angstrom resolution) and 
poorly phased electron density maps.

Details of ACMI's algorithms are available from the publications listed at 
the end of this manual.  The manual is meant as an introduction to using ACMI
to construct protien models from electron density, and a reference to each
program's command-line parameters.

Generally, a user wanting to construct an all-atom protein model from an 
electron density map will want to use the following program pipeline:
     
    a) SHFEAR -- given a protein amino-acid sequence and a density map -- 
       searches the map for a set of templates corresponding to each 
       amino-acid residue in the protein.  It computes the prior probability
       distribution that of each amino-acid's presence at every location in
       the map.
     
    b) BP uses structural knowledge of the protein to refine the probabilities
       computed by SHFEAR.  Its probabilistic inference algorithm computes the
       posterior distribution of each amino acid's presence at every location
       in the map; that is, the probabilty taking into account the location
       of every other amino acid.
     
    c) SIDECHAIN_PF uses a Monte-Carlo sampling algorithm over BP's posterior 
       distributions to generate a _set_ of physically feasible protein 
       structures, that best explains the computed posterior distribution.

These three programs are connected using the idea of a working directory. Each
program writes its output to the working directory; the next program in the
pipeline reads in from this working directory, processes this input, then
writes its output to this same directory.


2 TUTORIAL
----------

A small, well-phased 2.5-Angstrom-resolution density map -- with corresponding
amino-acid sequence and crystal parameters -- is provided in the folder 
'test/data/'.  The density map is contained in the file 'test.fs4.map', the 
amino-acid sequence in 'test.seq', and the crystal parameters in 'test.cryst'.

The script 'do_test.sh' shows a simple pipeline for the provided data.  Below, 
we show each command in the pipeline, annotated with the output produced.


    ./shfear -w test/  -m test/data/test.fs4.map \
                       -c test/data/test.cryst \
                       -s test/data/test.seq \
                       -r 2.5 -o 0.9 -N 20 -p 1 1
					   
NOTE: Before running this command, make sure the the environment variable
"ACMI_NMERDB" is set to specify the folder containing the nmer database.
                       
This command runs SHFEAR's template searching over the density map.  For each
amino-acid position in test.seq, SHFEAR considers the 5-amino-acid sequence 
_centered_ at this position, and  searches ACMI's fragment database for 
structures with a 5-AA sequence with high local similarity.  It performs an 
exhaustive 6D search over the map for each matching structure, computing a 
correlation coefficient at every location and rotation.

The density map, crystal parameters (identical to the 'CRYST1' line in a 
PDB file) and sequence information (one-letter codes, each chain on a line, 
with multiple asymmetric copies indicated with multiple lines) are specified
with the '-m', '-c', and '-s' flags, respectively.  The '-r' flag optionally
provides the density map resolution.  

The '-o' flag says to ignore 90% of the map in this search -- that 
is, only look for fragments in the best 10% of the map.  We set this value 
higher than the default 0.8 (ignoring only 80% of the map) because this is a 
well-phased map; setting it to 0.9 should find as good a model in about half
the time.

The '-N' flag specifies the minimum number of templates for which to search
per residue.  The default value, 50, typically works well; we set it to
20 here for performance reasons.

Finally, the -p flag is used for parallelizing the algorithm.  The default,
'-p 1 1' (read 1-of-1) runs the entire job on a single processor.  If there
are two machines available, running two jobs, one specifying '-p 1 2', the 
other '-p 2 2' will finish about twice as fast.

The working directory is specified with the '-w' flag (here it is the folder 
'test/').  SHFEAR creates a subdirectory in the working directory named 
'priors'.  In this folder, for each 5-amino-acid sequence XXXXX, SHFEAR 
creates 5 files:

    match_XXXXX.cc.mat 
        - a map of correlation coefficients between the best matching
          templates (C-alpha-centered) and the density map.
          
    match_XXXXX.betaF.mat,  match_XXXXX.gammaF.mat
        - a map of the "preferred" location of the adjacent N-terminal
          C-alpha, specified in spherical coordinates.
        
    match_XXXXX.betaB.mat,  match_XXXXX.gammaB.mat
        - a map of the "preferred" location of the adjacent C-terminal
          C-alpha, specified in spherical coordinates.
        
Each of these maps is in MATLAB v6 format, and can be loaded and visualized 
in MATLAB.  Data is extended to cover the entire unit cell (not just the
asymmetric unit).

                       
    ./BP -w test/  -c test/data/test.cryst \
                   -s test/data/test.seq  
                   
This command runs ACMI's inference algorithm on the previously computed 
priors.  The arguments -c and -s are identical 

The working directory -- again indicated with '-w' -- must be the
same working directory in the previous step.  BP reads in the contents of the
'priors' folder, and creates an output directory, 'posteriors'.  

In this folder, for each residue X of chain Y, BP creates the output file 
'test_chnY_resX_marginal.mat'.  This file contains the posterior distribution
(that is, the distribution taking all other amino acid locations into account) 
of residue X's C-alpha location.  Each of these maps are also in MATLAB v6 
format, and can be loaded and visualized in MATLAB.

BP also creates a subfolder to the working directory, 'backbone_models'.  At
each iteration of the algorithm, BP computes a backbone (C-alpha-only) model
of the protein, and places it in this folder.
                  
                  
    ./sidechain_pf -w test/  -m test/data/test.fs4.map \
                             -c test/data/test.cryst \
                             -s test/data/test.seq \
                             -r 2.5 -b 1-1

This command uses the posterior disributions to guide a Monte Carlo sampling
algorithm, producing an ensemble of all-atom protein models.  The arguments
'-m', '-s', and '-c' indicate the map, crystal, and sequence files, as in the 
other core programs.  Resolution is indicated with -r, while an additional
argument, '-b' indicates a range of structures to generate.  In this case,
we generate just 1 structure.  

Allowing a range of indices makes parallelizing straightforward, e.g. one can 
compute structures 1-5 on one machine, and 6-10 on another (repeated runs with
a the same index yield the same structure.

The working directory, specified with '-w', is the same as before. 
SIDECHAIN_PF reads the contents of the 'posteriors' directory from this 
folder, and creats a new output folder, 'Nparticle_models' (where N is the 
number of particles per run, by default 100).  It outputs a set of all-atom
models, named prot_allatom_pf.K.pdb, where K is the structure's index
(corresponding to the range specified with '-b').

SIDECHAIN_PF also creates a directory 'fragment_cache'

This set of structures comprises the final model produced by ACMI.


3 CORE PROGRAMS
---------------

This section describes individual command line arguments to the core programs
SHFEAR, BP, and SIDECHAIN_PF.


3.1 SHFEAR 

Given the amino acid sequence and a map, this program looks up each 5-mer in the
database, finds a number of instances, and searches the map for these instances.
For each residue in the protein, it outputs a map of correlation coefficients
between the best-matching instance and the electron density map.

Usage: ./shfear [options [args]]

  --help,       -h          Displays this information
  --working,    -w <args>   [REQUIRED] Working Directory
        This is the top-level output directory, shared by all programs in ACMI
        SHFEAR writes its output maps to a subdirectory, 'priors'.
  
  --map,        -m <args>   [REQUIRED] Input fs4 map file
        Full path to the fs4-format density map.
        The CCP4 program 'map2fs' converts CCP4 maps to FS4 maps.
        
  --cryst,      -c <args>   [REQUIRED] Input crystal file
        A one-line text file that has the exact same syntax as the 'CRYST1'
        line in a PDB file.
  
  --seq,        -s <args>   [REQUIRED] Input sequence file
        The protein's sequence as a one-letter code ('_'s indicate gaps). 
        Each chain is specified on a separate line.
        NOTE! If there are multiple copies in the asymmetric unit, 
              you will need to enter a line for each.
        
  --reso,       -r <args>   Density map resolution (default 2.0)
  --bandwidth,  -B <args>   Spherical harmonic bandwidth (default 12)
  --radius,     -R <args>   Fragment radius (default 5)
  --nfrags,     -N <args>   Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases.  Higher values for 'bandwidth' and
        'nfrags' may yield higher-quality results at the cost of additional 
        runtime.
        
  --grid,       -x <args>   Approximate map grid spacing (default 1A)
        Grid spacing is now computed automatically, given an approximate
        grid spacing (works the same as 'fft' in CCP4).
        The default is reasonable in most cases.
  
  --omit,       -o <args>   Using density values, omit fraction of map 
                            (default 0.8)
        A first-pass filter, based on density values throws away this 
        fraction of points.  If the map is especially well-phased, you 
        may consider increasing this to 0.9 (reducing the runtime by half).
        
  --parallel,   -p <args>   Parallelize <job#> of <Njobs>
        This flag allows parallelism of shfear jobs.  Individual amino-acids
        are run on separate machines.
        
        Parallelization of jobs is done as 'k of N'; e.g., -p 1 12 indicates
        this job is the first of 12 parallel runs.
  
  --analogues,  -a <args>   List of structural analogues for fragment generation
        Allows specification of structural analogues.  The search procudure
        will use templates from this structure, and only go to the database if
        there are no instances in the analogue.  Specifiy a list of PDB files, 
        deliminated by spaces.
  
  --force,      -f          Force priors to be overwritten
        Normally, if an ouput file exists in 'priors' from a previous run of
        shfear, it will not be overwritten.  
        This flag forces overwriting in such a case.


3.2 BP

   Given the output of shfear, BP runs the inference algorithm to
   compute marginal distributions of each residue's position.

Usage: ./BP [options [args]]
  
  --help,       -h          Displays this information
  --working,    -w <args>   [REQUIRED] Working Directory
        These are the same as in shfear.  BP reads from the priors directory,
        and creates two directories for output, 'posteriors', and 
        'backbone_models'.
  
  --cryst,      -c <args>   [REQUIRED] Input crystal file
  --seq,        -s <args>   [REQUIRED] Input sequence file
        These are the same as in shfear.
  
  --sigma,      -o <args>   Average squared rotation error (default 1.0)
  --m-est,      -m <args>   Probability of incorrect rotational alignment
                            (default 0.1)
  --iters,      -i <args>   Iterations of BP (default 25)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases. 
          
  --seleniums,  -e <args>   PDB file containing locations of selenium atoms
  --se-prior,   -p <args>   If Se file specified, wgt on specified Se locations
                            (default 0.9)
        These two parameters allow you to specify the location of all the 
        selenium atoms (from phasing experiments) and a weight to apply to
        these locations.  The default weight should work fine in most cases.
                            
  --memory,     -x <args>   Maximum memory usage before paging (in MB)
  --scratch,    -d <args>   Scratch directory for paging 
                            (will use a unique subdir)
        With especially large maps, BP often need significant amounts of
        memory, which can lead to swapping and significant performance penalty.
        
        These flags allows specification of a maximum amount of memory to 
        allocate, and a scratch  directory, which will be used for temporary 
        storage, only if required memory is >= the memory  limit.  
        Default is 4000 MB and '/scratch', respectively.
 

3.3 SIDECHAIN_PF

   Given the posteriors from BP, use particle filtering to obtain a set of 
   all-atom models.
   
Usage: ./sidechain_pf [options [args]]

  --help,       -h          Displays this information
  --working,    -w <args>   [REQUIRED] Working Directory
        Same as in shfear.  sidechain_pf reads from 'posteriors', and writes to 
        a folder Nparticle_models, where 'N' is the # of particles.
  
  --map,        -m <args>   [REQUIRED] Input fs4 map file
  --cryst,      -c <args>   [REQUIRED] Input crystal file
  --seq,        -s <args>   [REQUIRED] Input sequence file
        The same as shfear/BP.
  
  --reso,       -r <args>   Density map resolution (default 2.0)
  --bandwidth,  -B <args>   Spherical harmonic bandwidth (default 16)
  --rad-align,  -g <args>   Fragment radius for alignment (default 5)
  --rad-mask,   -k <args>   Fragment radius for computing CC (default 8)
  --mask-dist,  -d <args>   Mask dist for computing CC if rad-mask > rad-align
                            (default 3.0)
  --nfrags,     -N <args>   Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases. Many are the same as in shfear.
            
                            
  --npart,      -p <args>   Number of particles (default 100)
        The number of point estimates used to model the posterior.
        The default is probably reasonable.
        
  --buckets,    -b <args>   Bucket number (or range) of this particle set 
                            (default 1-10)
        Range of buckets, or "populations" of particles over which to run 
        (e.g -b 3-6 runs buckets 3,4,5, and 6).  No resampling occurs
        between different buckets, so each bucket produces an independent
        interpretation.  Using '-b' it is possible to parallelize PF runs: 
        for example, running buckets 1-5 on 1 machine, and 6-10 simultaneously
        on another.        
  --b-min,      -y <args>   Min belief to continue a trace (default 0.001)
  --M-max,      -z <args>   Max entropy to start a trace (default 2.8)
        Parameters controlling when to abort chain growth and to 
        terminate interpretation early.  Defaults give reasonable results.
         
  --analogues,  -a <args>   A list of structural analogues to use for 
                            fragment generation
    

4 EXTRA UTILITIES
-----------------

Several additional programs are provided with the ACMI distribution.

The program GET_RMSD is used to evaluate an ACMI model against a 
crystallographer-built model.

The program SIDECHAIN_NAIVE takes the backbone model produced by
BP and stitches sidechains on this predicted backbone.

Finally, the program BUILD_NMER_DB is provided as a reference to users
wishing to construct their own 5-mer fragment library.


4.1 GET_RMSD

   A short program to analyze computed results.  Unless the 'verbose' option
   is specified, arguments are output in a single comma-separated line, in
   the following order:
   
       1) Number of chains in the model
       2) Percentage of "true" C-alphas found
       3) Percentage of "true" sidechains correctly identified
       4) Percentage of predicted structure corresponding to "true" structure
       5) Percentage of predicted structure corresponding to 
          correctly-identified "true" structure
       6-8) C-alpha, all-atom, and sidechain-only RMS error of 
            WHAT WAS CORRECTLY PREDICTED
    
    Values 2 and 3 correspond to the "recall" of the model, while
    values 4 and 5 correspond to the "precision".
    
    If 'nobreak' was not specified, 8 additional values provide the same 
    metrics for the model in which the predicted chain was optimally broken.
   
   
   Usage: ./get_RMSd [options [args]]

  --help,       -h          Displays this information
  --true,       -t <args>   [REQUIRED] True PDB
  --pred,       -p <args>   [REQUIRED] Predicted PDB
        Paths to the true and the predicted PDB files.
        
  --correct,    -c <args>   Consider AA correct if CA is within this distance 
                            of the true CA location (default=2A)
        Self-explanitory.
                            
  --outpdb,     -o          Write PDB in correct asymm unit to STDOUT
        Outputs to STDOUT the predicted PDB, but rotated to lie in the same 
        asymmetric unit as the true PDB.  Useful for displaying the results.
        
  --stats,      -s          Write matching statistics to STDOUT
        Prints a comma-seperated list comparing prediction confidence 
        (specified in the B-factor column of the predicted PDB), and the 
        distance from the true CA location.
  
  --nobreak,    -n          Don't try to break chains
        By default, get_RMSd will try to break false connections in the model.
        Use this flag to disable this behavior.

  --verbose,    -v          Use verbose output
        By default, get_RMSd will output on a single comma-separated line.
        This argument forces human-readable output.
        
        
4.2 SIDECHAIN_NAIVE

    Given a backbone model produced by BP, this program naively creates an
    all-atom model by stitching sidechains on this backbone-only model.
    
    The arguments are similar to SIDECHAIN_PF.  SIDECHAIN_NAIVE writes to 
    STDOUT.
    
Usage: ./sidechain_pf [options [args]]

  --help,       -h          Displays this information
  --working,    -w <args>   [REQUIRED] Working Directory
        Same as in shfear.  sidechain_pf reads from 'posteriors', and writes to 
        a folder Nparticle_models, where 'N' is the # of particles.
  
  --map,        -m <args>   [REQUIRED] Input fs4 map file
        The same as shfear/BP.
        
  --cryst,      -p <args>   [REQUIRED] Input backbone model
        A backbone model (C-alpha-only) of the protein of interest.
  
  --reso,       -r <args>   Density map resolution (default 2.0)
  --bandwidth,  -B <args>   Spherical harmonic bandwidth (default 16)
  --rad-align,  -g <args>   Fragment radius for alignment (default 5)
  --rad-mask,   -k <args>   Fragment radius for computing CC (default 8)
  --mask-dist,  -d <args>   Mask dist for computing CC if rad-mask > rad-align
                            (default 3.0)
  --nfrags,     -N <args>   Number of fragments considered per amino acid
                            (default 50)
        Search parameters.  These are optional, and defaults give
        reasonable results in most cases.
            
                            
  --analogues,  -a <args>   A list of structural analogues to use for 
                            fragment generation


4.3 BUILD_NMER_DB

This program allows one to construct a custom fragment library.  
The program takes two command line arguments: 
    
    a) A text file where the first four charaters of each line 
       give a PDB accession code, and the fifth gives a chain 
       identifier (with a single header line).
    
    b) The location of the fragment database.

Before building the database, the PDB files should be downloaded and placed in 
'nmerdb/fraglib', and named 'pdbXXXX.pdb', where XXXX is the accession code.

A sample such text file is given in 
'nmerdb/cullpdb_pc30_res3.0_R1.0_d050604_chains4257.txt' 
The intended use of the program is to use the Dunbrack lab's culled PDB lists
at 'http://dunbrack.fccc.edu/Guoli/pisces_download.php'.

The "live" database is always at 'nmerdb/nmers.db' (and ACMI always looks for
PDB files in 'nmerdb/fraglib'). 


5 PARALLELIZATION
-----------------

The utility SHFEAR is the component of ACMI that requires the most significant
computation time.  However, it is trivially parallelized using the '-p' flag. 
Inputs are specified in a k-of-n format, e.g. '-p 12 35' indicates that this run
of SHFEAR is to be job #12 after breaking the task into 35 jobs.  Jobs are split
at the amino-acid level; specifying more parallel jobs than amino acids in the
protein will not provide additional parallelism (extra jobs will immediately
halt).

SHFEAR's parallelism works extremely well with the CONDOR distributed computing
environment.  Details of CONDOR are available on the project's home page, at
http://www.cs.wisc.edu/condor/.

Included with this distribution is a sample SHFEAR condor-submission file,
'run.condor'.  In this file, the testset protein in the folder '_02b_70653' is
split into 323 jobs (one for each amino-acid).  These jobs are submitted to the job
queuing system with the command 'condor_submit run.condor'.


6 ACMI PUBLICATIONS
-------------------

SHFEAR is described in:

F. DiMaio, A. Soni, G. Phillips and J. Shavlik (2007).  Improved methods for 
template-matching in electron density maps using spherical harmonics.  Proc. BIBM.


BP is described in:

F. DiMaio, J. Shavlik, and G. Phillips (2006).  A probabilistic approach to
protein backbone tracing in electron density maps. Bioinformatics 22.  e81-e89.


SIDECHAIN_PF is described in:

F. DiMaio, D. Kondrashov, E. Bitto, C. Bingman, A. Soni,  G. Phillips, and 
J. Shavlik (under review). Creating All-Atom Protein Models from 
Electron-Density Maps using Particle-Filtering Methods.