LAST: Genome-Scale Sequence Comparison
======================================

Introduction
------------

LAST is software for comparing and aligning sequences, typically DNA
or protein sequences.  LAST is similar to BLAST, but it copes better
with huge amounts of sequence data.  It can also report probabilities
for every pair of aligned letters, indicating the reliability of each
pairing.


Requirements
------------

To handle mammalian genomes, you will need at least 2 gigabytes of
RAM, and a few tens of gigabytes of disk space.  To install the
software, you need a C++ compiler.

Optional: to run the scripts, you need a Unix-like environment with
Python.  To make dotplots, you need the Python Imaging Library.

Luxury: to handle mammalian genomes with maximum efficiency, it's good
to have about 16 gigabytes of RAM (and use it with "lastdb -s16G").


Installation
------------

Just go into the src directory and type 'make'.  This should make two
programs: lastdb and lastal.  (If you checked it out using subversion,
then type 'make' in the top-level directory, not the src directory.)
Run the programs without arguments to get usage messages.


Example 1: Compare the human and fugu mitochondrial genomes
-----------------------------------------------------------

You can find these sequences in the examples directory: humanMito.fa
and fuguMito.fa.  Firstly, make a LAST database of the human
sequence::

  lastdb -c -m110 humanMito humanMito.fa

This will make some new files whose names begin with "humanMito".
Here, we used "-c" to soft-mask lowercase letters.  (Lowercase
indicates repetitive sequence, and "soft-masking" helps to avoid
uninteresting repetitive alignments.)  We also used "-m110" to skip
every third position when matching: this makes it more sensitive for
matching protein-coding DNA (and non-coding DNA to some extent).
Secondly, compare the fugu sequence to the human database::

  lastal -o myalns.maf humanMito fuguMito.fa

This will write alignments in a file called "myalns.maf".  To view the
alignments, you'll want to avoid text-wrapping, e.g. 'less -S
myalns.maf'.

For an example of aligning multiple mitochondrial genomes, see
multiMito.sh in the examples directory.


Example 2: Compare the cat and mouse genomes
--------------------------------------------

Let's assume you have the cat and mouse genomes in FASTA-format files:
cat/chr*.fa and mouse/chr*.fa.  We'll assume also that repetitive
regions are in lowercase.  We can compare them using the same steps as
above::

  lastdb -c -m110 -v mousedb mouse/chr*.fa
  lastal -o myalns.maf -v mousedb cat/chr*.fa

The "-v" (verbose) option just makes it write progress messages on the
screen.  Next, we might want to remove paralogs or make a dotplot: see
the accompanying document last-scripts.txt.


Example 3: Map short sequence tags to the human genome
------------------------------------------------------

Let's assume you have the human genome and tag sequences in
FASTA-format files: human/chr*.fa and tags.fa.  This time, we will not
mask repeats, because we want to map repetitive tags too::

  lastdb -v humandb human/chr*.fa
  lastal -o myalns.maf -a2 -e30 -v humandb tags.fa

Here, we used "-a2" to set the gap existence cost to 2, and "-e30" to
get alignments with score >= 30.  The appropriate score parameters
depend on how long the tags are and how many errors you want to allow:
the default scoring scheme assigns +1 to each match and -1 to each
mismatch.  For more ideas on tag mapping, see the accompanying
document tag-seeds.txt.


Output Formats
--------------

lastal can write alignments in two formats: tabular and MAF.  MAF
format looks like this::

  a score=15
  s chr3L        19433515 23 + 24543557 TTTGGGAGTTGAAGTTTTCGCCC
  s H04BA01F1907        2 21 +       25 TTTGGGAGTTGAAGGTT--GCCC
  p 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.85 0.759 0.662 - - 0.533 0.574 0.593 0.564

Lines starting with "s" contain: the sequence name, the start position
of the alignment, the number of nucleotides in the alignment, the
strand, the total size of the sequence, and the aligned nucleotides.
If the alignment starts at the beginning of the sequence, the start
position is zero.  If the strand is "-", the start position is as if
we had used the reverse-complemented sequence.  The line starting with
"p" contains the probability of each pair of aligned letters.  The
same alignment in tabular format looks like this::

  15 chr3L 19433515 23 + 24543557 H04BA01F1907 2 21 + 25 17,2:0,4

The final column shows the sizes and offsets of gapless blocks in the
alignment.  In this case, we have a block of size 17, then an offset
of size 2 in the upper sequence and 0 in the lower sequence, then a
block of size 4.  Probabilities are not shown in this format.


Steps in lastal
---------------

1) Find initial matches:
     keep those with multiplicity <= m and length >= l.

2) Extend gapless alignments from the initial matches:
     keep those with score >= d.

3) Extend gapped alignments from the gapless alignments:
     keep those with score >= e.

4) Non-redundantize the gapped alignments:
     remove those that share an endpoint with a higher-scoring alignment.

5) Calculate probabilities (OFF by default).

6) Redo the gapped extensions using centroid alignment (OFF by default).


What the Probabilities Mean
---------------------------

The probabilities indicate the reliability of each pairing, *assuming
that the alignment is not wholly spurious*.  In more detail, gapped
alignments are extended (in step 3) from either side of a "core"
alignment (derived from the gapless alignment of step 2).  The
probabilities are contingent upon the core being correct, and pairings
within the core automatically get a probability of 1.  So the
probabilities give no indication if an alignment is wholly spurious.
For that, please use the accompanying E-value tables.

The probabilities are calculated as follows.  We assume that each
gapped extension has probability proportional to: exp(lambda * score).
Here, lambda is the scale parameter of the scoring matrix (YK Yu et
al. 2003, PNAS 100(26):15688-93).  Then, the probability of each
letter-pair is the sum of the probabilities of all possible gapped
extensions that include this pairing.


Options for lastdb
------------------

Main Options
~~~~~~~~~~~~

-h  Show all options and their default settings.

-p  Interpret the sequences as proteins.  The default is to interpret
    them as DNA.

-c  Be case-sensitive: lowercase letters will then be forbidden in
    initial matches (even in skipped positions).

-m  Specify skipped positions in initial matches, e.g. "-m 110101". In
    this example, every third and fifth position out of six will be
    skipped.


Advanced Options
~~~~~~~~~~~~~~~~

-s  Split large databases into "volumes" of at most the specified
    number of bytes (excluding buckets).  If a single sequence exceeds
    this amount, however, it is not split.  The default is tuned for 2
    gigabytes of RAM: if you have more, increase this to make lastal
    go faster. You can use suffixes K, M, and G to specify KibiBytes,
    MebiBytes, and GibiBytes.

-w  Allow initial matches to start only at every "w"th position in each
    database sequence.  This reduces time and storage requirements, at
    the expense of sensitivity.  To emulate BLAT, use "-w 11".

-u  Use a subset seed in the specified file.  The -m option will then
    be ignored.  For an example of the format, see yass.seed in the
    examples directory.

-a  Specify your own alphabet, e.g. "-a 0123".  The default (DNA)
    alphabet is equivalent to "-a ACGT".  The protein alphabet (-p) is
    equivalent to "-a ACDEFGHIKLMNPQRSTVWY".  Non-alphabet letters are
    allowed in sequences, but by default they are forbidden in initial
    matches (even in skipped positions) and get the mismatch score
    when aligned to anything.

-b  Specify the depth of "buckets" used to accelerate initial match
    finding.  The deeper the faster, but the more memory is needed.
    The default is to use the maximum depth that consumes at most one
    byte per possible match start position.  This option has no effect
    on the results.

-v  Be verbose: write messages about what lastdb is doing.


Options for lastal
------------------

Main Options
~~~~~~~~~~~~

-h  Show all options and their default settings.

-o  Write output to the specified file, instead of the screen.

-s  Specify which query strand should be used: 0 means reverse only, 1
    means forward only, and 2 means both.

-f  Choose the output format: 0 means tabular and 1 means MAF.


Score Parameters
~~~~~~~~~~~~~~~~

-r  Match score.

-q  Mismatch cost.

-p  Obtain match and mismatch scores from the specified file.  The -r
    and -q options will then be ignored.  For examples of the format,
    see HOXD70 and TiTv212 in the examples directory.  Any letters
    that aren't in the file will get the lowest score in the file when
    aligned to anything.

-a  Gap existence cost.

-b  Gap extension cost.  A gap of size k costs: a + b*k.

-c  This option allows use of "generalized affine gap costs" (SF
    Altschul 1998, Proteins 32(1):88-96).  Here, a "gap" may consist
    of unaligned regions of both sequences.  If these unaligned
    regions have sizes j and k, where j <= k, the cost is: a + b*(k-j)
    + c*j.  If c >= a + 2b (the default), it reduces to standard
    affine gaps.

-F  Align DNA queries to a protein database, using the specified
    frameshift cost.  A value of 15 seems to be reasonable.

-x  Maximum score dropoff for gapped alignments.  Gapped alignments
    are forbidden from having any internal region with score < -x.
    This serves two purposes: accuracy (avoid spurious internal
    regions in alignments) and speed (the smaller the faster).

-y  Maximum score dropoff for gapless alignments.

-d  Minimum score for gapless alignments.  For guidance on choosing
    this parameter, see the accompanying E-value tables.

-e  Minimum score for gapped alignments.  For guidance on choosing
    this parameter, see the accompanying E-value tables.


Miscellaneous Options
~~~~~~~~~~~~~~~~~~~~~

-u  Specify treatment of lowercase letters for gapless and gapped
    extensions.  0 means mask them for neither stage; 1 means mask
    them for gapless extensions but not for gapped extensions; 2 means
    mask them for both stages.  "Mask" means give them the worst
    mismatch score when aligned to anything.  Note that treatment of
    lowercase for initial matches is set by lastdb's -c option.

-m  Maximum multiplicity for initial matches.  Each initial match is
    lengthened until it occurs at most this many times in the database
    volume.

-l  Minimum length for initial matches.  (Skipped positions are
    included in the length.)

-k  Look for initial matches starting only at every "k"th position in
    the query.  This increases speed at the expense of sensitivity.

-i  Search queries in batches of at most this many bytes.  If a single
    sequence exceeds this amount, however, it is not split.  You can
    use suffixes K, M, and G to specify KibiBytes, MebiBytes, and
    GibiBytes.  This option has no effect on the results (apart from
    their order).  Higher values can reduce disk reads.

-w  This option is a kludge to avoid catastrophic time and memory
    usage when self-comparing a large sequence.  If a large identical
    match is found, then gapped alignments will not be triggered from
    repeats (typically tandem repeats) within the identical match
    whose start positions are offset by this distance or less.  Use
    "-w 0" to turn this off.

-t  'temperature' for calculating probabilities.  Make the probability
    of each gapped extension proportional to exp(score / t).

-g  This option allows use of "gamma-centroid alignment" (M Hamada et
    al. 2009, Bioinformatics 25(4):465-73).  Such alignments only
    include pairings with probability > 1/(1+g).  When g=1, this is
    the same as "centroid alignment" (LE Carvalho & CE Lawrence 2008,
    PNAS 105(9):3209-14).  When lastal does (gamma-)centroid
    alignment, it does not report the usual alignment score.  Instead,
    it reports: sum[prob * (1+g) - 1].

-G  Use an alternative genetic code in the specified file.  For an
    example of the format, see vertebrateMito.gc in the examples
    directory.  By default, the standard genetic code is used.  This
    option has no effect unless DNA-versus-protein alignment is
    selected with option -F.

-v  Be verbose: write messages about what lastal is doing.

-j  Output type: 0 means counts of initial matches (of all sizes); 1
    means gapless alignments; 2 means gapped alignments before
    non-redundantization; 3 means gapped alignments after
    non-redundantization; 4 means alignments with probabilities; 5
    means centroid alignments.  Match counts (-j 0) respect the
    minimum depth option but not the maximum multiplicity option.
    It's a bad idea to try -j 0 when comparing a large sequence to
    itself.

-Q  This option allows lastal to use sequence quality scores for the
    queries.  0 means read queries in FASTA format (without quality
    scores); 1 means FASTQ-Sanger format; 2 means FASTQ-Solexa format;
    3 means PRB format.  The FASTQ formats look like this::

      @mySequenceName
      TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
      +
      55555555*&5-/55*5//5(55,5#&$)$)*+$

    The "+" may optionally be followed by a name (ignored), and the
    sequence and quality codes are allowed to wrap onto more than one
    line.  For FASTQ-Sanger, the quality scores are obtained by
    subtracting 33 from the ASCII values of the characters below the
    "+", and for FASTQ-Solexa, they are obtained by subtracting 64.
    PRB format stores four quality scores (A, C, G, T) per position,
    with one sequence per line, like this::

      -40   40  -40  -40      -12    1  -12   -3      -10   10  -40  -40

    Since PRB does not store sequence names, lastal uses the line
    number (starting from 1) as the name.  In FASTQ-Sanger format, the
    quality scores are related to error probabilities like this:
    qScore = -10log10[p].  In FASTQ-Solexa and PRB, however, qScore =
    -10log10[p/(1-p)].  In lastal's MAF output, the quality scores are
    written on lines starting with "q".  For FASTQ, they are written
    with the same encoding as the input.  For PRB, they are written in
    the FASTQ-Solexa (ASCII-64) encoding.

    The quality scores influence alignment scores as follows.  Let Qiy
    be the probability that the base at position i is y (y = A, C, G,
    or T).  Let Sxy be the scoring matrix, and let T be the
    "temperature" parameter (by default 1/lambda).  Then, the score
    for aligning base x (A, C, G, or T) to position i is::

      Rix = T * ln[ sum(y){ Qiy * exp[ Sxy / T ] } ]


Credits & Citation
------------------

LAST was developed by Martin C. Frith, Michiaki Hamada, Toshiyuki
Sato, and Paul Horton in the Computational Biology Research Center.
Many thanks to Hajime Harada for setting up the repository and
website, and Takako Sugawara for making the logo.  LAST includes
public domain code kindly provided by Yi-Kuo Yu and Stephen Altschul
at the NCBI.  There is no journal publication yet, so please cite the
website: http://last.cbrc.jp/.


Questions, Comments, Problems
-----------------------------

Please email: last (ATmark) cbrc (dot) jp.  If reporting a problem,
please describe exactly how to trigger the problem.
