ACMI v1.0 manual (c) 2007 Frank DiMaio 1 INTRODUCTION -------------- ACMI is a suite of programs designed to automate construction of a macromolecular model from an electron density map. It is specifically developed to handle low-resolution (3-4 Angstrom resolution) and poorly phased electron density maps. Details of ACMI's algorithms are available from the publications listed at the end of this manual. The manual is meant as an introduction to using ACMI to construct protien models from electron density, and a reference to each program's command-line parameters. Generally, a user wanting to construct an all-atom protein model from an electron density map will want to use the following program pipeline: a) SHFEAR -- given a protein amino-acid sequence and a density map -- searches the map for a set of templates corresponding to each amino-acid residue in the protein. It computes the prior probability distribution that of each amino-acid's presence at every location in the map. b) BP uses structural knowledge of the protein to refine the probabilities computed by SHFEAR. Its probabilistic inference algorithm computes the posterior distribution of each amino acid's presence at every location in the map; that is, the probabilty taking into account the location of every other amino acid. c) SIDECHAIN_PF uses a Monte-Carlo sampling algorithm over BP's posterior distributions to generate a _set_ of physically feasible protein structures, that best explains the computed posterior distribution. These three programs are connected using the idea of a working directory. Each program writes its output to the working directory; the next program in the pipeline reads in from this working directory, processes this input, then writes its output to this same directory. 2 TUTORIAL ---------- A small, well-phased 2.5-Angstrom-resolution density map -- with corresponding amino-acid sequence and crystal parameters -- is provided in the folder 'test/data/'. The density map is contained in the file 'test.fs4.map', the amino-acid sequence in 'test.seq', and the crystal parameters in 'test.cryst'. The script 'do_test.sh' shows a simple pipeline for the provided data. Below, we show each command in the pipeline, annotated with the output produced. ./shfear -w test/ -m test/data/test.fs4.map \ -c test/data/test.cryst \ -s test/data/test.seq \ -r 2.5 -o 0.9 -N 20 -p 1 1 NOTE: Before running this command, make sure the the environment variable "ACMI_NMERDB" is set to specify the folder containing the nmer database. This command runs SHFEAR's template searching over the density map. For each amino-acid position in test.seq, SHFEAR considers the 5-amino-acid sequence _centered_ at this position, and searches ACMI's fragment database for structures with a 5-AA sequence with high local similarity. It performs an exhaustive 6D search over the map for each matching structure, computing a correlation coefficient at every location and rotation. The density map, crystal parameters (identical to the 'CRYST1' line in a PDB file) and sequence information (one-letter codes, each chain on a line, with multiple asymmetric copies indicated with multiple lines) are specified with the '-m', '-c', and '-s' flags, respectively. The '-r' flag optionally provides the density map resolution. The '-o' flag says to ignore 90% of the map in this search -- that is, only look for fragments in the best 10% of the map. We set this value higher than the default 0.8 (ignoring only 80% of the map) because this is a well-phased map; setting it to 0.9 should find as good a model in about half the time. The '-N' flag specifies the minimum number of templates for which to search per residue. The default value, 50, typically works well; we set it to 20 here for performance reasons. Finally, the -p flag is used for parallelizing the algorithm. The default, '-p 1 1' (read 1-of-1) runs the entire job on a single processor. If there are two machines available, running two jobs, one specifying '-p 1 2', the other '-p 2 2' will finish about twice as fast. The working directory is specified with the '-w' flag (here it is the folder 'test/'). SHFEAR creates a subdirectory in the working directory named 'priors'. In this folder, for each 5-amino-acid sequence XXXXX, SHFEAR creates 5 files: match_XXXXX.cc.mat - a map of correlation coefficients between the best matching templates (C-alpha-centered) and the density map. match_XXXXX.betaF.mat, match_XXXXX.gammaF.mat - a map of the "preferred" location of the adjacent N-terminal C-alpha, specified in spherical coordinates. match_XXXXX.betaB.mat, match_XXXXX.gammaB.mat - a map of the "preferred" location of the adjacent C-terminal C-alpha, specified in spherical coordinates. Each of these maps is in MATLAB v6 format, and can be loaded and visualized in MATLAB. Data is extended to cover the entire unit cell (not just the asymmetric unit). ./BP -w test/ -c test/data/test.cryst \ -s test/data/test.seq This command runs ACMI's inference algorithm on the previously computed priors. The arguments -c and -s are identical The working directory -- again indicated with '-w' -- must be the same working directory in the previous step. BP reads in the contents of the 'priors' folder, and creates an output directory, 'posteriors'. In this folder, for each residue X of chain Y, BP creates the output file 'test_chnY_resX_marginal.mat'. This file contains the posterior distribution (that is, the distribution taking all other amino acid locations into account) of residue X's C-alpha location. Each of these maps are also in MATLAB v6 format, and can be loaded and visualized in MATLAB. BP also creates a subfolder to the working directory, 'backbone_models'. At each iteration of the algorithm, BP computes a backbone (C-alpha-only) model of the protein, and places it in this folder. ./sidechain_pf -w test/ -m test/data/test.fs4.map \ -c test/data/test.cryst \ -s test/data/test.seq \ -r 2.5 -b 1-1 This command uses the posterior disributions to guide a Monte Carlo sampling algorithm, producing an ensemble of all-atom protein models. The arguments '-m', '-s', and '-c' indicate the map, crystal, and sequence files, as in the other core programs. Resolution is indicated with -r, while an additional argument, '-b' indicates a range of structures to generate. In this case, we generate just 1 structure. Allowing a range of indices makes parallelizing straightforward, e.g. one can compute structures 1-5 on one machine, and 6-10 on another (repeated runs with a the same index yield the same structure. The working directory, specified with '-w', is the same as before. SIDECHAIN_PF reads the contents of the 'posteriors' directory from this folder, and creats a new output folder, 'Nparticle_models' (where N is the number of particles per run, by default 100). It outputs a set of all-atom models, named prot_allatom_pf.K.pdb, where K is the structure's index (corresponding to the range specified with '-b'). SIDECHAIN_PF also creates a directory 'fragment_cache' This set of structures comprises the final model produced by ACMI. 3 CORE PROGRAMS --------------- This section describes individual command line arguments to the core programs SHFEAR, BP, and SIDECHAIN_PF. 3.1 SHFEAR Given the amino acid sequence and a map, this program looks up each 5-mer in the database, finds a number of instances, and searches the map for these instances. For each residue in the protein, it outputs a map of correlation coefficients between the best-matching instance and the electron density map. Usage: ./shfear [options [args]] --help, -h Displays this information --working, -w [REQUIRED] Working Directory This is the top-level output directory, shared by all programs in ACMI SHFEAR writes its output maps to a subdirectory, 'priors'. --map, -m [REQUIRED] Input fs4 map file Full path to the fs4-format density map. The CCP4 program 'map2fs' converts CCP4 maps to FS4 maps. --cryst, -c [REQUIRED] Input crystal file A one-line text file that has the exact same syntax as the 'CRYST1' line in a PDB file. --seq, -s [REQUIRED] Input sequence file The protein's sequence as a one-letter code ('_'s indicate gaps). Each chain is specified on a separate line. NOTE! If there are multiple copies in the asymmetric unit, you will need to enter a line for each. --reso, -r Density map resolution (default 2.0) --bandwidth, -B Spherical harmonic bandwidth (default 12) --radius, -R Fragment radius (default 5) --nfrags, -N Number of fragments considered per amino acid (default 50) Search parameters. These are optional, and defaults give reasonable results in most cases. Higher values for 'bandwidth' and 'nfrags' may yield higher-quality results at the cost of additional runtime. --grid, -x Approximate map grid spacing (default 1A) Grid spacing is now computed automatically, given an approximate grid spacing (works the same as 'fft' in CCP4). The default is reasonable in most cases. --omit, -o Using density values, omit fraction of map (default 0.8) A first-pass filter, based on density values throws away this fraction of points. If the map is especially well-phased, you may consider increasing this to 0.9 (reducing the runtime by half). --parallel, -p Parallelize of This flag allows parallelism of shfear jobs. Individual amino-acids are run on separate machines. Parallelization of jobs is done as 'k of N'; e.g., -p 1 12 indicates this job is the first of 12 parallel runs. --analogues, -a List of structural analogues for fragment generation Allows specification of structural analogues. The search procudure will use templates from this structure, and only go to the database if there are no instances in the analogue. Specifiy a list of PDB files, deliminated by spaces. --force, -f Force priors to be overwritten Normally, if an ouput file exists in 'priors' from a previous run of shfear, it will not be overwritten. This flag forces overwriting in such a case. 3.2 BP Given the output of shfear, BP runs the inference algorithm to compute marginal distributions of each residue's position. Usage: ./BP [options [args]] --help, -h Displays this information --working, -w [REQUIRED] Working Directory These are the same as in shfear. BP reads from the priors directory, and creates two directories for output, 'posteriors', and 'backbone_models'. --cryst, -c [REQUIRED] Input crystal file --seq, -s [REQUIRED] Input sequence file These are the same as in shfear. --sigma, -o Average squared rotation error (default 1.0) --m-est, -m Probability of incorrect rotational alignment (default 0.1) --iters, -i Iterations of BP (default 25) Search parameters. These are optional, and defaults give reasonable results in most cases. --seleniums, -e PDB file containing locations of selenium atoms --se-prior, -p If Se file specified, wgt on specified Se locations (default 0.9) These two parameters allow you to specify the location of all the selenium atoms (from phasing experiments) and a weight to apply to these locations. The default weight should work fine in most cases. --memory, -x Maximum memory usage before paging (in MB) --scratch, -d Scratch directory for paging (will use a unique subdir) With especially large maps, BP often need significant amounts of memory, which can lead to swapping and significant performance penalty. These flags allows specification of a maximum amount of memory to allocate, and a scratch directory, which will be used for temporary storage, only if required memory is >= the memory limit. Default is 4000 MB and '/scratch', respectively. 3.3 SIDECHAIN_PF Given the posteriors from BP, use particle filtering to obtain a set of all-atom models. Usage: ./sidechain_pf [options [args]] --help, -h Displays this information --working, -w [REQUIRED] Working Directory Same as in shfear. sidechain_pf reads from 'posteriors', and writes to a folder Nparticle_models, where 'N' is the # of particles. --map, -m [REQUIRED] Input fs4 map file --cryst, -c [REQUIRED] Input crystal file --seq, -s [REQUIRED] Input sequence file The same as shfear/BP. --reso, -r Density map resolution (default 2.0) --bandwidth, -B Spherical harmonic bandwidth (default 16) --rad-align, -g Fragment radius for alignment (default 5) --rad-mask, -k Fragment radius for computing CC (default 8) --mask-dist, -d Mask dist for computing CC if rad-mask > rad-align (default 3.0) --nfrags, -N Number of fragments considered per amino acid (default 50) Search parameters. These are optional, and defaults give reasonable results in most cases. Many are the same as in shfear. --npart, -p Number of particles (default 100) The number of point estimates used to model the posterior. The default is probably reasonable. --buckets, -b Bucket number (or range) of this particle set (default 1-10) Range of buckets, or "populations" of particles over which to run (e.g -b 3-6 runs buckets 3,4,5, and 6). No resampling occurs between different buckets, so each bucket produces an independent interpretation. Using '-b' it is possible to parallelize PF runs: for example, running buckets 1-5 on 1 machine, and 6-10 simultaneously on another. --b-min, -y Min belief to continue a trace (default 0.001) --M-max, -z Max entropy to start a trace (default 2.8) Parameters controlling when to abort chain growth and to terminate interpretation early. Defaults give reasonable results. --analogues, -a A list of structural analogues to use for fragment generation 4 EXTRA UTILITIES ----------------- Several additional programs are provided with the ACMI distribution. The program GET_RMSD is used to evaluate an ACMI model against a crystallographer-built model. The program SIDECHAIN_NAIVE takes the backbone model produced by BP and stitches sidechains on this predicted backbone. Finally, the program BUILD_NMER_DB is provided as a reference to users wishing to construct their own 5-mer fragment library. 4.1 GET_RMSD A short program to analyze computed results. Unless the 'verbose' option is specified, arguments are output in a single comma-separated line, in the following order: 1) Number of chains in the model 2) Percentage of "true" C-alphas found 3) Percentage of "true" sidechains correctly identified 4) Percentage of predicted structure corresponding to "true" structure 5) Percentage of predicted structure corresponding to correctly-identified "true" structure 6-8) C-alpha, all-atom, and sidechain-only RMS error of WHAT WAS CORRECTLY PREDICTED Values 2 and 3 correspond to the "recall" of the model, while values 4 and 5 correspond to the "precision". If 'nobreak' was not specified, 8 additional values provide the same metrics for the model in which the predicted chain was optimally broken. Usage: ./get_RMSd [options [args]] --help, -h Displays this information --true, -t [REQUIRED] True PDB --pred, -p [REQUIRED] Predicted PDB Paths to the true and the predicted PDB files. --correct, -c Consider AA correct if CA is within this distance of the true CA location (default=2A) Self-explanitory. --outpdb, -o Write PDB in correct asymm unit to STDOUT Outputs to STDOUT the predicted PDB, but rotated to lie in the same asymmetric unit as the true PDB. Useful for displaying the results. --stats, -s Write matching statistics to STDOUT Prints a comma-seperated list comparing prediction confidence (specified in the B-factor column of the predicted PDB), and the distance from the true CA location. --nobreak, -n Don't try to break chains By default, get_RMSd will try to break false connections in the model. Use this flag to disable this behavior. --verbose, -v Use verbose output By default, get_RMSd will output on a single comma-separated line. This argument forces human-readable output. 4.2 SIDECHAIN_NAIVE Given a backbone model produced by BP, this program naively creates an all-atom model by stitching sidechains on this backbone-only model. The arguments are similar to SIDECHAIN_PF. SIDECHAIN_NAIVE writes to STDOUT. Usage: ./sidechain_pf [options [args]] --help, -h Displays this information --working, -w [REQUIRED] Working Directory Same as in shfear. sidechain_pf reads from 'posteriors', and writes to a folder Nparticle_models, where 'N' is the # of particles. --map, -m [REQUIRED] Input fs4 map file The same as shfear/BP. --cryst, -p [REQUIRED] Input backbone model A backbone model (C-alpha-only) of the protein of interest. --reso, -r Density map resolution (default 2.0) --bandwidth, -B Spherical harmonic bandwidth (default 16) --rad-align, -g Fragment radius for alignment (default 5) --rad-mask, -k Fragment radius for computing CC (default 8) --mask-dist, -d Mask dist for computing CC if rad-mask > rad-align (default 3.0) --nfrags, -N Number of fragments considered per amino acid (default 50) Search parameters. These are optional, and defaults give reasonable results in most cases. --analogues, -a A list of structural analogues to use for fragment generation 4.3 BUILD_NMER_DB This program allows one to construct a custom fragment library. The program takes two command line arguments: a) A text file where the first four charaters of each line give a PDB accession code, and the fifth gives a chain identifier (with a single header line). b) The location of the fragment database. Before building the database, the PDB files should be downloaded and placed in 'nmerdb/fraglib', and named 'pdbXXXX.pdb', where XXXX is the accession code. A sample such text file is given in 'nmerdb/cullpdb_pc30_res3.0_R1.0_d050604_chains4257.txt' The intended use of the program is to use the Dunbrack lab's culled PDB lists at 'http://dunbrack.fccc.edu/Guoli/pisces_download.php'. The "live" database is always at 'nmerdb/nmers.db' (and ACMI always looks for PDB files in 'nmerdb/fraglib'). 5 PARALLELIZATION ----------------- The utility SHFEAR is the component of ACMI that requires the most significant computation time. However, it is trivially parallelized using the '-p' flag. Inputs are specified in a k-of-n format, e.g. '-p 12 35' indicates that this run of SHFEAR is to be job #12 after breaking the task into 35 jobs. Jobs are split at the amino-acid level; specifying more parallel jobs than amino acids in the protein will not provide additional parallelism (extra jobs will immediately halt). SHFEAR's parallelism works extremely well with the CONDOR distributed computing environment. Details of CONDOR are available on the project's home page, at http://www.cs.wisc.edu/condor/. Included with this distribution is a sample SHFEAR condor-submission file, 'run.condor'. In this file, the testset protein in the folder '_02b_70653' is split into 323 jobs (one for each amino-acid). These jobs are submitted to the job queuing system with the command 'condor_submit run.condor'. 6 ACMI PUBLICATIONS ------------------- SHFEAR is described in: F. DiMaio, A. Soni, G. Phillips and J. Shavlik (2007). Improved methods for template-matching in electron density maps using spherical harmonics. Proc. BIBM. BP is described in: F. DiMaio, J. Shavlik, and G. Phillips (2006). A probabilistic approach to protein backbone tracing in electron density maps. Bioinformatics 22. e81-e89. SIDECHAIN_PF is described in: F. DiMaio, D. Kondrashov, E. Bitto, C. Bingman, A. Soni, G. Phillips, and J. Shavlik (under review). Creating All-Atom Protein Models from Electron-Density Maps using Particle-Filtering Methods.