mpiBLAST: Open-Source Parallel BLAST

| Home | Support | Download | Site Map |

synergy
synergy
synergy
synergy





 

User's Guide

In order to perform a search with mpiBLAST, the target BLAST database must first be formatted and segmented using mpiformatdb. Then, mpiexec can be used to execute mpiblast in parallel on several cluster nodes.

Formatting a database

Before processing blast queries the sequence database must be formatted with mpiformatdb. The command line syntax looks like this:
mpiformatdb -N 16 -i nt -o T

The above command would format the nt database into 16 fragments. Note that currently mpiformatdb does not support multiple input files.

mpiformatdb places the formatted database fragments in the same directory as the FASTA database. To specify a different target location, use the "-n" option as what is available in the NCBI formatdb.

Querying the database

mpiblast command line syntax is nearly identical to NCBI's blastall program. Running a query on 18 nodes would look like:
mpiexec -n 18 mpiblast -p blastn -d nt -i blast_query.fas -o blast_results.txt

The above command would query the sequences in blast_query.fas against the nt database and write out results to the blast_results.txt file in the current working directory. By default, mpiBLAST reads configuration information from ~/.ncbirc. Furthermore, mpiBLAST needs at least 3 processes to perform a search: two processes dedicated for scheduling tasks and coordinating file output, while any additional processes actually perform search tasks.

Extra options to mpiblast

  • --partition-size=[integer]
    Enable hierarchical scheduling with multiple masters. The partition size equals the number of workers in a partition plus 1 (the master process). For example, a partition size of 17 creates partitions consisting of 16 workers and 1 master. An individual output file will be generated for each partition. By default, mpiBLAST uses one partition. This option is only available for version 1.6 or above.
  • --replica-group-size=[integer]
    Specify how database fragments are replicated within a partition. Suppose the total number of database fragments is F, the number of MPI processes in a partition is N, and the replica-group-size is G, then in total (N-1)/G database replicas will be distributed in the partition (the master process does not host any database fragments), and each worker process will host F/G fragments. In other words, a database replica will be distributed to every G MPI processes.
  • --query-segment-size=[integer]
    The default value is 5. Specify the number of query sequences that will be fetched from the supermaster to the master at a time. This parameter controls the granularity of load balancing between different partitions. This option is only available for version 1.6 or above.
  • --use-parallel-write
    Enable the high-performance parallel output solution. Note the current implementation of parallel-write does not require a parallel file system.
  • --use-virtual-frags
    Enable workers to cache database fragments in memory instead of local storage. This is recommended on diskless platforms where there is no local storage attaching to each processor. Default to be enabled on Blue Gene systems.
  • --predistribute-db
    Distribute database fragments to workers before the search begins. Especially useful in reducing data input time when multiple database replicas need to be distributed to workers.
  • --output-search-stats
    Enable output of the search statistics in the pairwise and XML output format. This could cause performance degradation on some diskless systems such as Blue Gene.
  • --removedb
    Removes the local copy of the database from each node before terminating execution.
  • --copy-via=[cp|rcp|scp|mpi|none]
    Sets the method of copying files that each worker will use. Default = "cp"
    • cp : use standard file system "cp" command. Additional option is --concurrent.
    • rcp : use rsh "rcp" command. Additonal option is --concurrent.
    • scp : use ssh "scp" command. Additional option is --concurrent.
    • mpi : use MPI_Send/MPI_Recv to copy files. Additional option is --mpi-size.
    • none : do not copy files, instead use shared storage as local storage.
  • --debug[=filename]
    Produces verbose debugging output for each node, optionally logs the output to a file.
  • --time-profile=[filename]
    Reports execution time profile.
  • --version
    Print the mpiBLAST version.

Please refer to the README file in the mpiBLAST package for performance tuning guide.

Removing a database

The --removedb command line option will cause mpiBLAST to do all work in a temporary directory that will get removed from each node's local storage directory upon successful termination. For example:
mpiexec -n 18 mpiblast -p blastx -d yeast.aa -i ech_10k.fas -o results.txt --removedb

The above command would perform a 18 node (16 worker) search of the yeast.aa database, writing the output to results.txt. Upon completion, worker nodes would delete the yeast.aa database fragments from their local storage.

Databases can also be removed without performing a search in the following manner:
mpiexec -n 18 mpiblast_cleanup

 
 
| Edit | Print |