BLAST max-target-seq meets metabarcoding

11 months ago 20

This is my first blog post in years - primarily down to a second child who is now a toddler. And what better topic to return to than a mainstay of past content than NCBI BLAST? This time with a motivating example from my recent work, metabarcoding. This is term used for sequencing a diagnostic region of DNA using specific primers for a group of organisms of interest, and then matching that amplicon to a database of known species. Human interpretation of a BLAST search can generally put a good guess as the organism - weighing hits and annotated taxonomy (e.g. ignoring the odd suspicious uncultured "fungal" match). This post is about how sometimes BLAST on the NCBI website can miss 100% identical (albeit not full length) matches, returning instead lots of very good but longer matches. Basically the online defaults don't suit this use-case.   Background We're using a region of the ITS1 gene using primers which target Phytophthora and related organisms, many of which are important plant pathogens. Some of these are quarantine class organisms, so we want to avoid false positives so the default classifier we use in my metabarcoding pipeline THAPBI PICT only declares a species level match for a perfect match or something 1bp different (a single base substitution, deletion, or insertion). In BLAST terms, this means over 99% identical over the full query length. Now barcoding primers are of course designed to match conserved regions of a genome, yet span a variable region in order to get meaningfully different amplicons. It so happens that the first 32bp of our typical Phytophthora amplicons (immediately after the forward primer site) are also conserved - and importantly often missing in published ITS1 sequences. That means when using NCBI BLAST to check an amplicon, although we hope to see full length perfect matches, the most interesting sequences are often only about 85% of the query (due to the subject match in the database missing the first 32bp) but in the region of 99% to 100% identical. Importantly those hits may not be ranked first by the BLAST e-value or bitscore (but you can change the sort order online). The other handicap for using BLAST in these context is there are lots of very very similar ITS1 sequences in the database, and while the NCBI does do some de-duplicating, the BLAST NT database is still full of duplicates, or near duplicates, of common barcoding sequences. Readers of my past blog posts (e.g. the most recent) will anticipate how the -max-target-seqs setting now comes into play here. During the initial search, BLAST will find lots of good candidates, so many that the heuristic max-target-sequences will drop some. And this means it can sometimes drop the perfect but incomplete matches I am interested in, in favour of the imperfect full length matches (which in fairness do get a better overall score). i.e. The default -max-target-seqs value can be too low for this use case. Example I have two example query sequences, both observed from multiple UK samples - uncultured but from the sequence almost certainly from Phytophthora: >dfae766ff29a02c0521fea4ee7969dc2 PhytophthoraTTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCCTTAAAATTGGGGGCTTGCTCGGCGGCGTGCGTGCTGGCCTATAATGGGTTGGTGTGCTGCTGCTGGGCGGGCTCTATCATGGGCGAGCGTTTGGGCTTCGGCTCGAGCTAGTAGCTTTTTCTTTTAAACCCATTCTTTAATTACTGAAATACT>3d3321eed13dba60899edfbb40cb7629 PhytophthoraTTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCTTTTAAATTGGGGGCTTCCGTCTGGCCGGCCGGTTCTCGGCTGGCTGGGTGGCGGCTCTATCATGGCGACCGCCTGGGGCCTCGGCCTGGGCTAGTAGCGTATTTTTTAAACCATTCCTAATTACTGAAAAAACT They both start with TTTCCGTAGGTGAACCTGCGGAAGGATCATTA which is the typical conserved 32bp expected for Phytophthora amplicons, which we can remove giving two truncated queries: >dfae76-truncated PhytophthoraCCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCCTTAAAATTGGGGGCTTGCTCGGCGGCGTGCGTGCTGGCCTATAATGGGTTGGTGTGCTGCTGCTGGGCGGGCTCTATCATGGGCGAGCGTTTGGGCTTCGGCTCGAGCTAGTAGCTTTTTCTTTTAAACCCATTCTTTAATTACTGAAATACT>3d3321e-truncated PhytophthoraxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxCCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCTTTTAAATTGGGGGCTTCCGTCTGGCCGGCCGGTTCTCGGCTGGCTGGGTGGCGGCTCTATCATGGCGACCGCCTGGGGCCTCGGCCTGGGCTAGTAGCGTATTTTTTAAACCATTCCTAATTACTGAAAAAACT Truncated queries Putting those truncated queries into the NCBI BLAST website using BLASTN against the NT database, or the currently experimental Eukaryota NT (nt_euk) database, with default settings gives full length perfect matches. Specifically query dfae76 gives two perfect matches (shown as a single deuplicated alignment): KT337856.1 Phytophthora cf. inundata D0S1P25 isolate D0S1P25-3 KT337858.1 Phytophthora cf. inundata D0S1P25 isolate D0S1P25-5 While for query 3d3321e there are five perfect matches (show as five alignments): JF300264.1 Phytophthora drechsleri clone S7.Oak4-1F JF300263.1 Phytophthora drechsleri clone S7.Oak4-1B JF300257.1 Phytophthora drechsleri clone S7.Oak4-1G JF300256.1 Phytophthora drechsleri clone S7.Oak4-1D JF300255.1 Phytophthora drechsleri clone S7.Oak4-4A i.e. Leaving the 32bp leader aside, perfect matches of my environmental Phytophthora sequences have been published from cultured isolates/clones, and we can confidentally assign species. Full-length queries Now, repeating that but using the full queries this time (and again default settings, including specifically leaving "Max target sequences" at the current web-blast default of 100), those matches vanish. This time the top hit for query dfae76 by e-value, bitscore, or percentage identify is a a full length match but only 223/225 identical (99%): AY995392.1 Phytophthora inundata isolate P756 (BO-2)  And for query 3d3321e, this time a tied best match, full length but only 204/206 identical (99%): KJ507657.1 Phytophthora drechsleri isolate 1092RN GU111633.1 Phytophthora drechsleri strain TARI 98067  GU111632.1 Phytophthora drechsleri strain TARI 97098  GU111630.1 Phytophthora drechsleri strain TARI 28322  GU111629.1 Phytophthora drechsleri strain TARI 26082  GU111628.1 Phytophthora drechsleri strain TARI 27221  GU111627.1 Phytophthora drechsleri strain TARI 25210  EU194428.1 Phytophthora drechsleri isolate PS-43 That the best hit by bitscore or e-value changed is no surprise, these are longer matches.While the top matches are still from the same species (Phytophthora inundata and Phytophthora dreshsleri), it appears that our environmental sequences are at least 2bp different from anything published. The gotcha is that the shorter 100% identical matches are missing. Non-default settings To see the shorter but 100% identical matches in the results, we need to raise the "Max target sequences" setting. The web-blast interface currently offers values of 10, 50, 100 (default), 250, 500, 1000 and 5000. For these two queries, increasing that to 250 is enough - the missing hits are now present, although a long long way down the default sort order. Off-line testing If running BLASTN at the command line, the equivalent setting is -max-target-seqs and it defaults to 500. The online default is likely a more aggressive optimization to reduce computational load on the NCBI servers. I was able to reproduce this locally using a custom database of 17717 Phytophthora ITS1 sequences downloaded via Entrez Direct on 1 Feb 2024 as follows: $ esearch -db nuccore -sort accession \         -query "Phytophthora[organism] \         AND ((internal AND transcribed AND spacer) OR its1)\         AND 150:10000[sequence length]" > search.xml$ efetch -format fasta < search.xml > search.fasta$ makeblastdb -in search.fasta -dbtype nucl -out search$ for QUERY in dfae766.fasta 3d3321e.fasta; do    echo    echo "Query $QUERY"    for MAX in 100 250; do        echo        echo "Any perfect (partial) hits with -max_target_seqs $MAX?"        if blastn -db search -query $QUERY \            -outfmt "6 pident length stitle" \            -max_target_seqs $MAX \            | cut -c 1-64 | grep -E "^100\.0" ; then            echo "Yes"        else            echo "NO - expected perfect partial hits are missed!"        fi    donedone This uses the TSV output (format mode 6) with custom columns picking percentage identify first, which allows a simple search with grep to pull out any perfect (partial) matches. The output: Query dfae766.fastaAny perfect (partial) hits with -max_target_seqs 100?NO - expected perfect partial hits are missed!Any perfect (partial) hits with -max_target_seqs 250?100.000    193    KT337858.1 Phytophthora cf. inundata D0S1P25 isolate100.000    193    KT337856.1 Phytophthora cf. inundata D0S1P25 isolateYesQuery 3d3321e.fastaAny perfect (partial) hits with -max_target_seqs 100?NO - expected perfect partial hits are missed!Any perfect (partial) hits with -max_target_seqs 250?100.000    174    JF300264.1 Phytophthora drechsleri clone S7.Oak4-1F 100.000    174    JF300263.1 Phytophthora drechsleri clone S7.Oak4-1B 100.000    174    JF300257.1 Phytophthora drechsleri clone S7.Oak4-1G 100.000    174    JF300256.1 Phytophthora drechsleri clone S7.Oak4-1D 100.000    174    JF300255.1 Phytophthora drechsleri clone S7.Oak4-4A YesNote that the command line default of 500 is comfortably above what is needed for this example to "work", and it is unfortunate that the online default of 100 is too low for this niche use-case. Conclusion So, for this particular use case, given the first 32bp of my amplicons is highly conserved but often omitting in the ITS1 fragments in the public databases, I should consider running a second BLAST search with that removed - or remember to bump up the "Max target sequences" option - before concluding that I have a truely novel Phytophthora amplicon.


View Entire Post

Read Entire Article