Frequently Asked Questions (FAQ) for mpiBLAST
- How is mpiBLAST-PIO different from mpiBLAST?
- How accurate are the E-value statistics?
- How does mpiBLAST output differ from NCBI blastall output?
- Does mpiBLAST support PSI-BLAST, PHI-BLAST, RPS-BLAST, etc.?
- Does mpiBLAST support Mega-BLAST?
- I have a cluster with yy processors. How many database fragments should I use?
- I have a cluster with yy processors. How many MPI processes should I start with mpirun -np?
- I have hyperthreaded processors. How many MPI processes should I start with mpirun -np?
- Can mpiBLAST run without local storage?
- Can mpiBLAST run without a shared filesystem?
- Can mpiBLAST be run on a single processor system for testing purposes?
- I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!
- Does mpiBLAST run on Mac OS X?
- Does mpiBLAST work with AIX 5.2 and the IBM VisualAge compiler?
- How do I compile mpiBLAST from CVS?
- Is LAM-MPI corrupting memory?
- How do I format a huge database?
How is mpiBLAST-PIO different from mpiBLAST?
mpiBLAST-PIO utilizes parallel I/O techniques to greatly speedup mpiBLAST. As of mpiBLAST-1.5.0-PIO, the use of these parallel techniques does not require a parallel file-system. However, if you do have a parallel file-system, then mpiBLAST-PIO will still be able to utilize this resource. As such, the non-PIO codebase is no longer being actively developed by the mpiBLAST team.
How accurate are the E-value statistics?
In mpiBLAST 1.3 they are exact for all supported search types. In versions 1.2.1 and earlier, e-values for blastn were loosely approximated using a linear equation. For blastp, blastx, tblastn, and tblastx they were inaccurate in versions 1.2.1 and earlier. Note that by "exact" we mean exactly the same as those generated by NCBI-BLAST. As of 2004, NCBI is still refining the e-value calculations in their blast implementation.
How does mpiBLAST output differ from NCBI blastall output?
As of mpiBLAST 1.3, the text, XML, and ASN.1 output formats are nearly identical to NCBI blastall. When an indiviual query has multiple database hits with the same e-value and bit score, mpiBLAST may report these hits in a different order than NCBI's blastall. Further, mpiBLAST does not report some search statistics such as the number of hits to the database.
Does mpiBLAST support PSI-BLAST, PHI-BLAST, RPS-BLAST, etc.?
No. Although it may be possible to parallelize these search algorithms using database segmentation, our preliminary studies indicate they would not benefit as much as the other blast search types do from such a parallelization scheme.
Does mpiBLAST support Mega-BLAST?
No. We are focusing our efforts on blastn, blastp, blastx, tblastn, and tblastx.
I have a cluster with yy processors. How many database fragments should I use?
yy-1
I have a cluster with yy processors. How many MPI processes should I start with mpirun -np?
Start yy + 1 mpiblast processes. This will start one mpiBLAST worker per processor, plus one output and one scheduler process. The minimum value that can be specified to -np is 2.
I have hyperthreaded processors. How many MPI processes should I start with mpirun -np?
Our experience indicates that BLAST jobs tend to be limited by memory bandwidth more so than CPU speed. Since each virtual CPU in a hyperthreaded setup shares the same memory bus, the benefit of running additional mpiBLAST processes is usually negligible.
Can mpiBLAST run without local storage?
Yes, set the local storage path to be identical to the shared storage path.
Can mpiBLAST run without a shared filesystem?
Yes, as of mpiBLAST 1.3.0. The database and query can be stored on a remotely accessible filesystem and copied via rcp or scp. If the database and query reside on the node with rank 0 they can be distributed directly by mpiBLAST. The --copy-via option described in the Usage section of this document has more details.
Can mpiBLAST be run on a single processor system for testing purposes?
Yes, simply execute the desired number of MPI processes using the -np flag. The minimum is -np 2.
I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!
mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.
Does mpiBLAST run on Mac OS X?
mpiBLAST versions prior to 1.3.0 are not supported on Mac OS X. For versions 1.3.0 and later, it might be necessary to add the --disable-dependency-tracking option to configure.
Does mpiBLAST work with AIX 5.2 and the IBM VisualAge compiler?
Our (limited) experience building mpiBLAST on AIX revealed C header conflicts in the C++ build system. In order to work around the header conflicts, we created a duplicate copy of the /usr/vacpp/include directory, minus all files ending in .h except ansic_aix.h, xlocinfo.h, and yvals.h. When running configure, the environment variables CFLAGS and CXXFLAGS should contain the compiler flags -qnostdinc -I/usr/include -I/usr/local/include -I/path/to/duplicate/vacpp_inc_dir in addition to any other custom compiler flags. These flags include the standard system headers without including the conflicting VisualAge C++ headers.
Furthermore, automake's dependency tracking appears broken on AIX, so it is necessary to run configure with the --disable-dependency-tracking option. Another snag is that autoconf thinks it must define _LARGE_FILES on these systems, which breaks the build (open64 is not defined). We hope to fix this behavior in a future release, but for now it can be worked around by editing src/config.h to comment out the line #define _LARGE_FILES 1.
If anybody else has experience compiling applications with the VisualAge C++ compiler, we would appreciate feedback on how to make the build process smoother.
How do I compile mpiBLAST from CVS?
Please see the instructions on the development page.
Is LAM-MPI corrupting memory?
The message pattern generated by mpiblast appears to cause memory corruption between LAM and the Linux kernel. A workaround suggested by Jason Gans is to run mpiblast with the -ssi rpi lamd flag to mpirun: mpirun -np 10 -ssi rpi lamd mpiblast ...
How do I format a huge database?
Large databases like nt can consume several gigabytes of disk space and it is preferable to store them in compressed form. Starting with mpiBLAST 1.4.0 it is possible to pipe FastA formatted sequence data into mpiformatdb. This feature provides the ability to directly format a compressed (gzip/bzip etc.) database using command line syntax like:
zcat nt.gz | mpiformatdb -i stdin -N 100 --skip-reorder -t nt -p F
mpiformatdb needs the --skip-reorder, -t <title> and -p <T|F> options to format a database piped via standard input.