POA installation notes Sept. 2001
Chris Lee
Dept. of Chemistry & Biochemistry
UCLA

I. COMPILATION
To compile this program, simply type 'make poa'.

This produces an executable for sequence alignment (poa) and also a linkable 
library liblpo.a.  The software has been compiled and tested on LINUX.

II. RUNNING POA

Poa has a variety of command line options.  Running poa without any arguments 
will print a list of the possible command line arguments.  POA may be used to
construct a PO-MSA, or to analyze a PO-MSA.   

A.  Constructing a PO-MSA
-------------------------

1.  Required Input:

i. An Alignment Score Matrix File:
A score matrix file is required, because poa uses it to get the residue
alphabet and indexing.  Even if poa is not being used to perform 
multiple sequence alignment, this file must be provided. Any basic alignment 
matrix which may be used with BLAST may be used here.  This file must be the 
first command line argument without a flag in order to be interpreted by poa 
as the score matrix file.  An example score matrix file, blosum80.mat, is 
provided in this directory. 

Note:  Poa is Case Sensitive.
Poa can align amino acid sequences to nucleotide sequences.  In order to
distinguish amino acid residues from nucleic acid residues, poa is case 
sensitive.  Residues that are upper case are interpreted as amino acids
while residues that are lower case are interpreted as nucleotides.  Poa
can handle mixed score matrices containing scores for aligning amino acid 
residues to nucleotide residues as long as the column and row labels of the 
matrix are case sensitive.  See blosum80.mat file for an example of just
such a mixed score matrix.

ii. A FASTA File:
A FASTA file is required only if poa is being used to construct a 
new PO-MSA from a list of sequences, or to align a list of sequences to
an already existing PO-MSA (see Analyzing a PO-MSA below).  This FASTA file 
should contain sequences to be aligned by poa.  The command line argument
to get poa to accept a FASTA file as input is '-read_fasta FILENAME'.  Poa 
will interpret FILENAME as the FASTA sequence file.  An example file, 
multidom.seq, is provided in this directory.  

Poa is case sensitive (see note above).  All residues in protein sequences 
in the FASTA file must be upper case to be interpreted as amino acids by poa, 
while all residues in nucleotide sequences in the FASTA file must be lower 
case to be interpreted as nucleotides by poa.  To switch the case of all of 
the letters in the FASTA file to upper case, use the '-toupper' command 
line argument.  To switch the case of all the letters in the FASTA file to 
lower case, use the '-tolower' command line argument.

2.  MSA Construction Options:

i.  Aggressive Fusion:
During the building up of a PO-MSA, if a node i with label 'a' is aligned
to an align ring which already contains a node j with label 'a', poa simply
adds the node to the align ring.  It is possible to force poa to do 
aggressive fusion, so that when a node i with label 'a' is aligned to
an align ring which already contains a node j with label 'a', node i is
fused to node j.  The command line argument for accomplishing this is 
'-fuse_all'.

3.  MSA Output Formats:
Poa can output a PO-MSA in several formats simultaneously including CLUSTAL, 
PIR, and PO.  The PO format is the best format since it contains all of the 
information in the PO-MSA.  The other formats accurately represent the MSA, 
but since they are RC-MSA formats, they may lose some of the information in the 
full PO-MSA.

i.  CLUSTAL format:
This format is the standard CLUSTAL format.  The command line argument to get
the MSA output in this format is '-clustal FILENAME'.

ii.  PIR format:
This format is the standard PIR format, which is like FASTA with a '.'
character representing gaps.  The command line argument to get the MSA
output in this format is '-pir FILENAME'.

iii.  PO format:
This format is the standard PO format.  It is described below in the section
PO format.  The command line argument to get the MSA output in this format is 
'-po FILENAME'.

Example:  Constructing a MSA of Four Protein Sequences
Running poa with the following statement will take the fasta formatted 
sequences in the multidom.seq file, construct a PO-MSA using the scoring 
matrix in the file blosum80.mat, and then output the PO-MSA in CLUSTAL format
in the file multidom.aln.

poa -clustal multidom.aln blosum80.mat multidom.seq

The output should be identical to the results of figs. 6 & 7 in the paper.

4.  Other Output:

i. Score Matrix
Poa will also print to stdout the score matrix stored in the '.mat' file.  
The command line argument to get poa to do this is '-printmatrix LETTERSET',
where LETTERSET is a string of letters to be printed with the score matrix.
For example, if the score matrix is designed for protein alignment the 
letter set might be 'ARNDCQEGHILKMFPSTWYV'.

ii. Verbose Mode
Poa will run in verbose mode, printing additional information generated  
during the run to stdout.  The command line argument to get poa to do run in 
verbose mode is '-v'.

B.  Analyzing a PO-MSA
-----------------------
Poa can also take a PO format file as input and rebuild the PO-MSA data
structure.  Once this data structure has been rebuilt, it may be analyzed
for features.  In 'liblpo.a', the linkable poa library, we have included the
functions necessary to do heaviest bundling and thereby find consensus
sequences in the PO-MSA (the details of the heaviest bundling algorithm 
are described elsewhere).  Poa has been written so that users may create 
their own functions for analyzing a PO-MSA.  We have not included in the 
'liblpo.a' library the functions that we wrote to analyze PO-MSAs 
constructed with ESTs and genome sequence to find snps and alternative 
splice sites.  However, it is possible to design modular library functions 
that will look for highly specific  biological features in any PO-MSA data 
structure.

1.  Required Input:

Before the PO-MSA data structure can be analyzed it must be built.  It can
be built either from a PO file or from a FASTA file, or from both a PO file 
and a FASTA file.

Note:  POA Requires Either A PO File or a FASTA File
If neither files are read in by POA it will terminate early, since it 
has not received any sequence data.

i. A PO file:
Poa will read in a PO formatted file.  The command line argument to get
poa to read in a PO formatted file and rebuild the PO-MSA data structure
is '-read_po FILENAME'.  

It is possible to filter the PO-MSA data structure as it is being rebuilt.  In 
order to filter the PO-MSA in the PO file to include only a subset of sequences 
use the command line argument '-subset FILENAME', where the file named FILENAME 
contains the list of sequence names to be included in the new PO-MSA. In order 
to filter the PO-MSA in the PO file to exclude a subset of sequences use the 
command line argument '-remove FILENAME', where the file named FILENAME contains 
the list of sequences to be excluded from the new PO-MSA.  The names of 
sequences to be included or excluded should be in the format 'SOURCENAME= *"
as they are in the PO file.  Lists of sequence source names can be created by 
using the unix grep utility on the PO file.     

ii. A FASTA File:
The FASTA file should contain sequences to be aligned by poa.  The command line 
argument to get poa to accept a FASTA file as input is '-read_fasta FILENAME'.  
Poa will interpret FILENAME as the FASTA sequence file.  An example file, 
multidom.seq, is provided in this directory.  (See note above on case sensitivity).

Note:  POA Can Take Both A PO File And A FASTA File As Input
If both the '-read_po FILENAME' argument and the '-read_fasta FILENAME' 
argument are given to poa on the command line, then poa will first rebuild the 
PO-MSA in the PO file, and then it will align the sequences in the FASTA file 
to this PO-MSA.   

2.  Additional PO Utilities:

i.  Consensus Generation Via Heaviest Bundling Algorithm:
The heaviest bundling algorithm finds consensus sequences in the
PO-MSA.  The command line argument for heaviest bundling is '-hb'.  This
function adds the new consensus sequences to the PO-MSA by storing new
consensus sequence indices on the in the PO-MSA nodes corresponding to 
the consensus sequence paths.  The sequence source names for consensus 
sequences generated by heaviest bundling is CONSENS'i' where 'i' is the
index of the bundle corresponding to the consensus sequence.

The heaviest bundling algorithm can also take as input a bundling 
threshold value.  The command line argument for setting a bundling 
threshold value for heaviest bundling is '-hbmin VALUE'.  This threshold 
is used during the process of associating sequences with bundles.  If a 
sequence has a percentage of nodes shared with bundle 'i' greater than this 
threshold value, it is associated with bundle 'i'.  Iterative heaviest 
bundling can also be affected by the bundling threshold.  A detailed 
description of heaviest bundling and heaviest bundling thresholds is given 
elsewhere.  The consensus sequences corresponding to bundles generated
by heaviest bundling are listed in the sequence source list.  Additionally,
in the SOURCEINFO line for each sequence the index of the bundle to which
that sequence belongs is give.  Finally, using the command line argument 
'-best' restricts the MSA output to the consensus sequences generated 
by heaviest bundling.   

III.  PO FILE FORMAT

****************************HEADER****************************************
VERSION= ~Current version of poa~
NAME=  ~Name of PO-MSA.  Defaults to name of 1st sequence in PO-MSA~
TITLE=  ~Title of PO-MSA.  Defaults to title of 1st sequence in PO-MSA~
LENGTH=  ~Number of nodes in PO-MSA~
SOURCECOUNT=  ~Number of sequences in PO-MSA~

*********************SEQUENCE SOURCE LIST*********************************

/* For each sequence in the PO-MSA: */
SOURCENAME= ~Name of sequence taken from FASTA sequence header~
SOURCEINFO= ~Number of nodes in sequence~ 
            ~Index of first node containing sequence~
            ~Sequence weight~
            ~Index of bundle containing sequence~
            ~Title of sequence taken from FASTA sequence header~

/* Example: */
SOURCENAME=GRB2_HUMAN
SOURCEINFO=217 10 0 3 GROWTH FACTOR RECEPTOR-BOUND PROTEIN 2 (GRB2 ADAPTOR PROTEIN)(SH2)

********************PO-MSA DATA STRUCTURE*********************************

/* For each node in the PO-MSA:  */
~Residue label~:~'L' delimited index list of other nodes with edges into node~
                ~'S' delimited index list of sequences stored in each node~
                ~'A' index of next node in same align ring~ 
                     NB: align ring indices must form a cycle.
                     e.g. if two nodes 121 and 122 are aligned, then 
                     the line for node 121 indicates "A122", and
                     the line for node 122 indicates "A121".

/* Example: */
F:L156L155L22S2S3S7A158

********************END***************************************************


For more information, see http://www.bioinformatics.ucla.edu/poa.

