BLAST tutorial - Notes for the Computational Biophysics class

1Running locally¶

We first build the sequence database. Download and unzip the Reviewed Swiss-Prot database from UniProt:

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz

Now we can build a BLAST database by using the makeblastdb command, as follows:

makeblastdb -in uniprot_sprot.fasta -input_type fasta -title swissprot -dbtype prot

where we specify that we want to build a protein database with the -dbtype prot parameter. It should be easy to guess what the other parameters do. The command will generate a number of files that will be used by BLAST for subsequence searchers. All these files have names that are prefixed with the name of the original filename, which is how BLAST will find them later: do not change them!

We now need a query sequence. You can make up your own, download a real protein sequence (in FASTA format), or use the following, which is the sequence of human BRCA1:

>AAC37594.1 BRCA1 [Homo sapiens]
MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITK
RSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHLKDEVSIIQSMGYRNRAKRLLQS
EPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYIELGSDSSEDTVNKATYCSVGDQELLQITPQ
GTRDEISLDSAKKAACEFSETDVTNTEHHQPSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHA
SSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCER
KEWNKQKLPCSENPRDTEDVPWITLNSSIQKVNEWFSRSDELLGSDDSHDGESESNAKVADVLDVLNEVD
EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTENLIIGAFVTEP
QIIQERPLTNKLKRKRRPTSGLHPEDFIKKADLAVQKTPEMINQGTNQTEQNGQVMNITNSGHENKTKGD
SIQNEKNPNPIESLEKESAFKTKAEPISSSISNMELELNIHNSKAPKKNRLRRKSSTRHIHALELVVSRN
LSPPNCTELQIDSCSSSEEIKKKKYNQMPVRHSRNLQLMEGKEPATGAKKSNKPNEQTSKRHDSDTFPEL
KLTNAPGSFTKCSNTSELKEFVNPSLPREEKEEKLETVKVSNNAEDPKDLMLSGERVLQTERSVESSSIS
LVPGTDYGTQESISLLEVSTLGKAKTEPNKCVSQCAAFENPKGLIHGCSKDNRNDTEGFKYPLGHEVNHS
RETSIEMEESELDAQYLQNTFKVSKRQSFAPFSNPGNAEEECATFSAHSGSLKKQSPKVTFECEQKEENQ
GKNESNIKPVQTVNITAGFPVVGQKDKPVDNAKCSIKGGSRFCLSSQFRGNETGLITPNKHGLLQNPYRI
PPLFPIKSFVKTKCKKNLLEENFEEHSMSPEREMGNENIPSTVSTISRNNIRENVFKEASSSNINEVGSS
TNEVGSSINEIGSSDENIQAELGRNRGPKLNAMLRLGVLQPEVYKQSLPGSNCKHPEIKKQEYEEVVQTV
NTDFSPYLISDNLEQPMGSSHASQVCSETPDDLLDDGEIKEDTSFAENDIKESSAVFSKSVQKGELSRSP
SPFTHTHLAQGYRRGAKKLESSEENLSSEDEELPCFQHLLFGKVNNIPSQSTRHSTVATECLSKNTEENL
LSLKNSLNDCSNQVILAKASQEHHLSEETKCSASLFSSQCSELEDLTANTNTQDPFLIGSSKQMRHQSES
QGVGLSDKELVSDDEERGTGLEENNQEEQSMDSNLGEAASGCESETSVSEDCSGLSSQSDILTTQQRDTM
QHNLIKLQQEMAELEAVLEQHGSQPSNSYPSIISDSSALEDLRNPEQSTSEKAVLTSQKSSEYPISQNPE
GLSADKFEVSADSSTSKNKEPGVERSSPSKCPSLDDRWYMHSCSGSLQNRNYPSQEELIKVVDVEEQQLE
ESGPHDLTETSYLPRQDLEGTPYLESGISLFSDDPESDPSEDRAPESARVGNIPSSTSALKVPQLKVAES
AQSPAAAHTTDTAGYNAMEESVSREKPELTASTERVNKRMSMVVSGLTPEEFMLVYKFARKHHITLTNLI
TEETTHVVMKTDAEFVCERTLKYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEVRGDVVNGRNHQGPK
RARESQDRKIFRGLEICCYGPFTNMPTDQLEWMVQLCGASVVKELSSFTLGTGVHPIVVVQPDAWTEDNG
FHAIGQMCEAPVVTREWVLDSVALYQCQELDTYLIPQIPHSHY

Put your FASTA sequence in a query.fasta file. We are now ready to search the database, which can be done with the following command:

blastp -query query.fasta -db uniprot_sprot.fasta > blast_results

The command should take no more than a handful of seconds, which is rather impressive considering that the database contains more than 200 million letters! Nota Bene: here we are using blastp, which aligns protein sequences. Nucleotide sequences are aligned with the blastn command.

If you open the output file (blast_results), you will realise that is human readable. It first contains the list of matches, ordered by their $E$ -values, each of them associated to its score. Then, each alignment is presented in detail. The file ends by listing the parameters used in the search (substitution matrix, gap penalties, λ, $K$ and other statistical parameters, etc.).

The output type can be controlled by using the -outfmt switch (the meaning of this option, and of all other available parameters, can be read by using the blastp -help command). A simple way of extracting BLAST results in a programmatic way is to use the BLAST XML format, which can be done with the -outfmt 5 switch:

blastp -query query.fasta -db uniprot_sprot.fasta -outfmt 5 > blast_results.xml

XML files are not very human-readable, but there are plenty of tools that can be used to parse them. Any generic-purpose XML parser would do, but using a specialised one is definitely more conveniente. Here I show how to do it using Biopython, which can be installed with pip install biopython. Documentation for this module can be found here.

from Bio.Blast import NCBIXML

xml_file = open("query_blast.xml")
blast_record = NCBIXML.read(xml_file)
for alignment in blast_record.alignments:
     for hsp in alignment.hsps:
         if hsp.expect < 1e-10:
             print("Sequence IDs:", alignment.title)
             print("Alignment length:", alignment.length)
             print("E-value:", hsp.expect)
             print(f"Query:\n{hsp.query}")
             print(f"Subject:\n{hsp.sbjct}")
             print(f"Match:\n{hsp.match}")

Fetching long content....

Here we list all the HPSs with $E$ -values smaller than 10^-10, and for each alignment we print the “name” of the database protein match, the length of the associated alignment, the $E$ -value, and then the query and matched (subject) sequences, and the match itself, which directly shows the differences between the two.

The syntax of the match sequence is simple: each character is

an amino acid 1-letter code: the amino acid is conserved in the two sequences
a plus sign: the two amino acids at that positions are different, but the entry relative to their substitution in the score matrix is positive
a space: the amino acid is not conserved, either because it has mutated, or because a gap has been inserted in the query or subject sequence

2Running over the internet¶

We can use Biopython to run a search by using the NCBI webserver. A full tutorial can be found at this page. Here I show a simple code that produces the same alignments computed above.

We first of all import the module containing the qblast function which is used to query the webserver. We then call it with three mandatory arguments:

The tool we want to use. Here we use blastp, since we want to align a protein sequence.
The database we want to search against. Here we use nr (non-redundant protein database).
The sequence we want to align, which we read from the query.fasta file, which contains the same human BRCA1 sequence used above.

The function takes many other optional parameters. Use help(NCBIWWW.qblast) to get the complete list (or look at the same info online).

Nota Bene: This may take a while!

from Bio.Blast import NCBIWWW
fasta_string = open("query.fasta").read()
result_stream = NCBIWWW.qblast("blastp", "nr", fasta_string)

We saved the result in the result_stream object, so that we can analyse it. We first “rewind” the result with seek(0), then use it to initialise a Blast record (blast_record), as we did before, and the print the most significant alignments. If you look at some of the text printed, you’ll notice that the sequence IDs are somewhat different from those printed before, since we are using different databases. However, t

result_stream.seek(0)
blast_record = NCBIXML.read(result_stream)
for alignment in blast_record.alignments:
     for hsp in alignment.hsps:
         if hsp.expect < 1e-10:
             print("Sequence IDs:", alignment.title)
             print("Alignment length:", alignment.length)
             print("E-value:", hsp.expect)
             print(f"Query:\n{hsp.query}")
             print(f"Subject:\n{hsp.sbjct}")
             print(f"Match:\n{hsp.match}")

Fetching long content....