Setup
Getting started.
MolluscaGenes is run from a terminal — a text-only window
that lets you type commands. On macOS that's the
Terminal app (/Applications/Utilities/Terminal.app),
on Linux it's any terminal emulator (GNOME Terminal, Konsole, xterm, …),
and on Windows we recommend the
Windows
Subsystem for Linux with Ubuntu, which gives you a real Linux shell.
The instructions below assume you have a Bash- or zsh-style shell open.
What's a "shell"? The shell is the program that
interprets the commands you type and dispatches them. Everywhere this
tutorial says "in the terminal", you can paste the line into
your shell and press Enter. A line beginning with
$ is a command you type — you don't paste the
$ itself.
One-time setup, in five minutes
-
Install conda if you don't already have it. We
recommend Miniforge;
follow the installer prompts and reopen your terminal afterward.
-
Clone the repository. This downloads the wrappers,
metadata, HMMs, and site code to a folder on your computer:
git clone https://github.com/invertome/molluscagenes
cd molluscagenes
git clone copies the GitHub repo to a new folder named
molluscagenes; cd ("change directory") moves
your terminal session into that folder. Every command in this tutorial
assumes you are inside it.
-
Create the conda environment. This installs every
external tool the wrappers call (BLAST, DIAMOND, HMMER, MAFFT,
ClipKit, IQ-TREE, TreeShrink, plus a few helpers) into an isolated
conda environment named
molluscagenes. It only needs to
be done once and takes 5–15 minutes:
conda env create -f environment.yml
conda activate molluscagenes
The activate step is something you'll need to repeat
every time you open a new terminal session.
-
Download the database from Zenodo. Pick a folder
with at least 20 GB of free disk for the BLAST
and DIAMOND files (these are too big for GitHub). The
mg_fetch.sh wrapper does the download, verifies SHA256
checksums, extracts the tarballs, and writes a configuration file
that tells the other wrappers where the data lives:
bash wrappers/mg_fetch.sh /path/to/storage
Replace /path/to/storage with the actual folder you
picked. If you want to see what would be downloaded without actually
downloading anything, add --dry-run.
-
Source the configuration file in every new
terminal session before running any wrapper:
source config.sh
This sets a handful of environment variables ($MG_BLAST_AA,
$MG_HMM, …) that the wrappers read. If you forget this
step you'll see an "no config.sh found" error.
Reference
Anatomy of a wrapper command.
Every MolluscaGenes wrapper follows the same shape. Once you can read
one, you can read them all. Take a typical BLAST search:
wrappers/mg_blast.sh -q my_query.fa -o my_results -d aa -e 1e-5 -t 8
wrappers/mg_blast.sh — the wrapper script. The path is
relative to the repo root, where you should be when you run it.
-q my_query.fa — the query: a FASTA file
on your disk containing the sequence(s) you want to look up.
-o my_results — the output directory.
The wrapper creates it if it doesn't exist.
-d aa — which database to search.
aa = the protein database (uses
blastp); mrna = the nucleotide database
(uses blastn).
-e 1e-5 — the e-value threshold (one in 100,000 by
chance). Lower values are stricter; higher values let in more
distant hits.
-t 8 — number of CPU threads. Use as many as your
machine has cores.
Three things every wrapper produces in its output directory:
run.log — the exact command, tool versions, start/end
times, and any warnings. Hand this to a reviewer asking
"what settings did you use?".
.done sentinel — a tiny marker file that records that
the run completed. If you re-run the same command, the wrapper
short-circuits and exits immediately. Pass --force to
override.
- The actual results, named according to the wrapper.
FASTA, briefly. A FASTA file is a plain text file
that holds one or more biological sequences. Each sequence starts with
a header line beginning with >, followed by the sequence
itself on the lines that follow. Example:
>my_protein descriptive text here
MDRSQRLGSPGGFSPRGGSPAPSPRRDSGSGSADGKGGKGIHATIQQ
LADEQERVQKKTFTNWINSYLVKHIPPVKVESL
You can copy a sequence from UniProt or NCBI into a text file with a
.fa extension, or extract one out of MolluscaGenes itself
using mg_extract.sh (Example 3 below).
Example 01 · characterize
What is this protein, and where does it occur in the Mollusca?
You have a protein sequence — maybe from a UniProt page, a paper supplement,
or your own assembly — and you want a fast read on what it is.
mg_characterize.sh runs three searches against MolluscaGenes
in parallel:
- BLAST against every protein in the database (sensitive, slower)
- DIAMOND against the same protein database (less sensitive but ~50× faster)
- hmmsearch against the 1,057 mollusc-revised Pfam HMMs (best at finding what family your protein belongs to)
The wrapper joins each method's hits to species metadata so you can see
the binomial, family, and class of each match without separate lookups.
- You'll need
- A protein FASTA file. We'll call it
my_query.fa; put it in the repo's root folder for these examples (or anywhere — just adjust the path).
- Time
- ~2–5 minutes for a short protein on 8 threads. BLAST dominates the wall time.
- Disk
- ~10 MB output.
1
Run the command.
From the repo root, with the conda env activated and config.sh sourced:
wrappers/mg_characterize.sh -q my_query.fa -o characterize_out -t 8
2
Watch what happens.
The terminal will print progress lines as each of the three searches runs. You'll see something like:
[2026-04-27T14:01:11Z] mg_characterize.sh: query=my_query.fa
[2026-04-27T14:01:11Z] running blastp...
[2026-04-27T14:01:11Z] running diamond blastp --more-sensitive...
[2026-04-27T14:01:11Z] running hmmsearch...
[2026-04-27T14:03:42Z] wrote summary.tsv
[2026-04-27T14:03:42Z] wrote report.html
If you don't want to watch and have a shell that supports it, you
can append & to run it in the background and reclaim
your prompt.
3
Inspect the report.
Open characterize_out/report.html in your browser. It's a self-contained file — you can email it, share it, or open it offline. It shows one row per query sequence, with the best BLAST/DIAMOND/HMM hit and the species each hit comes from.
If you'd rather work with raw data, the same information is in summary.tsv (open it in a spreadsheet or with pandas):
query blast_best_hit blast_best_evalue diamond_best_hit hmm_top_domain species_top_blast
my_protein OctbimEVm004812t1 1.34e-156 OctbimEVm004812t1 Neur_chan_LBD_REV Octopus bimaculoides
Per-method detail (every hit, not just the best one) is under the blast/, diamond/, and hmm/ subdirectories — open hits_with_species.tsv in any of them.
4
Try variations.
- Faster, less sensitive — drop BLAST and just use DIAMOND + HMM:
wrappers/mg_characterize.sh -q my_query.fa -o cf_fast -t 8 --skip blast
- HMM-only, e.g. when you suspect the protein is highly divergent:
wrappers/mg_characterize.sh -q my_query.fa -o cf_hmm -t 8 --skip blast --skip diamond
- Loosen the e-value to catch distant homologs:
wrappers/mg_characterize.sh -q my_query.fa -o cf_loose -t 8 -e 1
If something goes wrong
- "no config.sh found" — you forgot to
source config.sh, or you're in the wrong directory. cd back to the repo root and try again.
- "BLAST db not found" — the path in
config.sh doesn't point to a folder with the BLAST volumes. Re-run mg_fetch.sh.
- The HMM search reports "0 hits" — your protein may genuinely not match any of the 1,057 HMMs (we cover ~50 biological categories, not all of Pfam). Try lowering
--hmm-E first; if still nothing, the protein may be from a family we don't have an HMM for.
- It seems hung — open
characterize_out/run.log in another terminal (tail -f characterize_out/run.log) to see live progress.
Example 02 · phylogeny
Build a phylum-wide tree of a gene family from a few seed sequences.
This is the flagship analysis. Given one or more reference (seed)
sequences for a gene family — say, three nicotinic acetylcholine
receptor sequences from a model species — mg_place.sh
finds every homolog across MolluscaGenes, aligns and trims them,
infers a phylogenetic tree with bootstrap and aBayes branch support,
prunes long-branch outliers ("rogue tips"), re-aligns and re-infers
a final tree, and renames the tips from internal species codes to
binomials.
The whole pipeline is one command. Internally it runs nine steps
(four iterative searches, accession extraction, MAFFT, ClipKit,
IQ-TREE 2 round 1, TreeShrink, MAFFT again, IQ-TREE 2 round 2,
tip renaming) — each step writes a small .done
marker so you can re-run with new flags and only the affected steps
recompute.
- You'll need
- A small reference FASTA — typically 1–10 protein sequences for the family of interest. We'll call it
seeds.fa.
- Time
- ~2 hours wall time on 16 threads with DIAMOND-based search; ~6 hours with BLAST. The MAFFT and IQ-TREE rounds dominate.
- Disk
- ~1–5 GB of intermediate files. The final tree itself is small.
- Memory
- ~16 GB peak during IQ-TREE on a few-thousand-sequence alignment.
Heads-up. Iterative search at e ≤ 1e-96
is appropriately strict for most receptor / enzyme families. If you're
working with very short or rapidly-evolving proteins (e.g. defensins,
venom peptides), relax the e-value and lower the iteration count to
avoid pulling in non-homologous noise:
-e 1e-20 --iterations 2.
1
Run the command.
wrappers/mg_place.sh \
-q seeds.fa
-o nachr_tree
--search diamond
--iterations 4
--mafft-mode einsi
--clipkit-mode kpic-smart-gap
--bb-final 10000
-t 16
The trailing \ on each line is shell
convention for "this command continues on the next line". You can
also paste it all on a single long line if you prefer.
2
What's happening behind the scenes.
- Iterative search (rounds 1–4). Round 1 searches your
seeds against MolluscaGenes; round 2 uses the round-1 hits as a new
query set; rounds 3 and 4 do the same. Each round broadens the
recruited set, capturing increasingly divergent homologs.
- Extraction. All unique hits are pulled out as
FASTA via
blastdbcmd and concatenated with your seeds.
- Alignment. MAFFT's E-INS-i algorithm aligns the
full set, balancing speed against accuracy on protein families
with conserved domains separated by variable linkers.
- Trimming. ClipKit removes ambiguously aligned
positions while keeping parsimony-informative columns.
- Round-1 tree. IQ-TREE 2 picks an evolutionary
model (TESTNEW), infers a maximum-likelihood tree, and computes
ultra-fast bootstrap and aBayes branch support.
- TreeShrink. Long-branch outliers ("rogue tips")
are detected and removed.
- Re-alignment. The pruned set is realigned for
the final tree.
- Final tree. IQ-TREE 2 runs again, this time
using the model selected in round 1 and a higher number of
bootstrap replicates (
--bb-final 10000).
- Tip renaming. Every species-code prefix
(
OctbimEVm…) is replaced by a sed-driven substitution
with the binomial (Octopus_bimaculoides_EVm…),
yielding a tree that's human-readable in any visualizer.
3
What you'll find when it finishes.
The most useful files:
nachr_tree/final.names.contree — the final consensus tree, with binomial tip labels and bootstrap support values. Open this in iTOL, FigTree, or ete3.
nachr_tree/final.iqtree — IQ-TREE log: model selection summary, parameter estimates, support tests.
nachr_tree/treeshrink/run/output.fasta — the alignment after rogue-tip removal.
nachr_tree/iter1/hits.tsv through iter4/hits.tsv — every iterative round, fully traceable.
4
Iterate on flags without losing work.
If you decide partway through that you want a different alignment mode or trim setting, you don't have to start over. Edit the command and re-run; the wrapper will skip the steps whose output is already on disk and only redo the affected stages.
wrappers/mg_place.sh -q seeds.fa -o nachr_tree --clipkit-mode gappy
That re-runs ClipKit, both IQ-TREE rounds, and the rename — but skips the four iterative searches and the alignment, which are by far the slowest steps. Pass --force to redo everything.
5
If you want to skip TreeShrink.
If you'd rather see every recruited sequence in the final tree (e.g. to look at long-branch outliers explicitly), pass --skip-treeshrink. This roughly halves the total runtime.
If something goes wrong
- IQ-TREE runs out of memory. Drop
--bb-final back to 1000 first; if still failing, sub-sample your alignment or trim more aggressively (--clipkit-mode gappy).
- "too few sequences for tree inference" — your seeds are too restrictive. Loosen
-e or add more seed sequences.
- The final tree looks like spaghetti — the family is probably too divergent for one tree. Split your seeds by sub-family and run two separate placements, then concatenate the trees externally.
- You see "diamond: command not found" — the conda env isn't activated.
conda activate molluscagenes.
Example 03 · extract
Pull every protein from a class, family, or species set.
When you want to run an analysis on every protein from a particular
taxonomic group — say, all cephalopods, or every Octopus
species — without writing any glue code. mg_extract.sh
looks up the species you specify in the metadata table, then pulls
their protein and/or transcript sequences out of the BLAST databases.
- You'll need
- A selector: either a species code, a binomial, a comma-separated list, a file with one species per line, or a taxonomic rank (class / order / family / phylum).
- Time
- First run on a fresh database: ~2 minutes (the wrapper builds an accession-by-species cache, ~600 MB under
metadata/_cache/). Subsequent runs are I/O-bound and complete in seconds.
1
Pick a selector form.
Concrete forms, all equivalent in usage:
wrappers/mg_extract.sh -s Octbim -o out -d aa # one code
wrappers/mg_extract.sh -s "Octopus bimaculoides" --by binomial -o out -d aa
wrappers/mg_extract.sh -s Octbim,Sepoff,Eupber -o out -d aa # comma list
wrappers/mg_extract.sh -s @my_codes.txt -o out -d aa # file
wrappers/mg_extract.sh -s Cephalopoda --by class -o out -d aa # by rank
wrappers/mg_extract.sh -s Octopodidae --by family -o out -d aa
2
Run a real example: every cephalopod protein.
Here we extract every protein from every cephalopod species in v0.1
(28 species, ~2.8 M sequences total), with --merge to
concatenate them into a single FASTA file:
wrappers/mg_extract.sh \
-s Cephalopoda --by class \
-o cephalopoda_proteins \
-d aa --merge \
-t 8
The -d aa flag picks the protein database; use
-d mrna for transcripts or -d both for
parallel extraction of both.
3
What you get.
cephalopoda_proteins/extracted_aa.fa — single merged FASTA with ~2.8 M sequences (because of --merge; without it you'd get one FASTA per species).
cephalopoda_proteins/extracted_species.tsv — per-species count of what was actually extracted, plus binomial / class / order / family for each row.
Quick sanity-check from the terminal:
grep -c '^>' cephalopoda_proteins/extracted_aa.fa
2800153
Use the species browser to design your selector.
Browse all species, filter by class or
family, then export the filtered list as TSV — the first column is
the species code, ready to feed into -s @file.txt.
Example 04 · HMM scan
Scan a proteome with the mollusc-revised HMMs.
Given any protein FASTA — your own assembly, a set extracted with
mg_extract.sh, or any third-party proteome — annotate it
domain-by-domain using the 1,057 mollusc-revised HMMs. Because these
HMMs were re-trained against molluscan sequence diversity, they
recover divergent homologs that vertebrate-trained Pfam profiles
frequently miss.
- You'll need
- A protein FASTA. We'll generate one from Octopus bimaculoides using
mg_extract.
- Time
- ~10 minutes for an 80,000-protein proteome on 8 threads with the full bundle (1,057 HMMs). Single-HMM scans take seconds.
1
Get a proteome.
wrappers/mg_extract.sh -s "Octopus bimaculoides" --by binomial -o obim --merge
2
Scan with the full HMM bundle.
wrappers/mg_hmmsearch.sh -q obim/extracted_aa.fa -o obim_hmm -t 8
By default this uses the e-value threshold 1e-5. For
high-confidence calls compatible with how Pfam itself curates,
substitute Pfam's gathering thresholds:
wrappers/mg_hmmsearch.sh -q obim/extracted_aa.fa -o obim_hmm --cut-ga -t 8
3
Or scan with a single HMM.
If you only care about one domain family — e.g. the
ligand-binding domain of pentameric Cys-loop receptors — point
--hmm at the per-domain file:
wrappers/mg_hmmsearch.sh \
-q obim/extracted_aa.fa \
-o obim_neur_chan \
--hmm hmm/per_domain/Neur_chan_LBD_REVISION.hmm \
-t 8
This is much faster than the full scan (~30 seconds), and is
often the right approach when you're chasing a specific family
rather than annotating a whole proteome.
4
What you get.
obim_hmm/hits.tbl — the standard HMMER per-sequence tabular file. One line per (sequence, HMM) match.
obim_hmm/hits.domtbl — per-domain output: detailed coordinates of each domain hit on each sequence.
obim_hmm/hits_with_species.tsv — same as hits.tbl but with binomial / class / order / family appended for each row, so you don't need to cross-reference manually.
Pick HMMs interactively. The
HMMs (Browser) page lets you filter the 1,057 HMMs by
biological category (GPCR, VGIC, LGIC, …), tick a subset, and copy a
curl snippet or download a ZIP bundle. Useful when you
want to scan with a small custom subset rather than the full
collection.
Example 05 · custom flags
Reach a flag we don't expose by name.
Each wrapper exposes a curated set of flags by name (the most-used
settings) and an --extra "ARGS" passthrough so
anything we haven't named is still reachable. The string is split on
whitespace and appended verbatim to the underlying tool. Three
examples:
wrappers/mg_blast.sh -q q.fa -o out -d aa \
--extra "-gapopen 11 -gapextend 1 -seg yes"
wrappers/mg_diamond.sh -q q.fa -o out \
--extra "--masking 0 --comp-based-stats 1 --frameshift 15"
wrappers/mg_place.sh -q seeds.fa -o phylo \
--iqtree-extra "-alrt 1000 -lbp 1000" \
--search-extra "-soft_masking true"
Whatever you pass via --extra also lands verbatim in the
run.log, alongside the tool version that was used. A
reviewer asking "what settings did you use?" gets a single
deterministic answer.
Don't know the underlying tool's flags? Each wrapper
calls a familiar tool you can introspect directly:
blastp -help, diamond help blastp,
hmmsearch -h, mafft --help,
iqtree -h. Anything those tools accept can be passed
through --extra.
Example 06 · taxon-restricted search
BLAST, DIAMOND, or HMM-search a query against one molluscan class.
Sometimes you don't want hits from across the whole Mollusca — you
care only about gastropods, only about cephalopods, or only a single
family. Pass --taxon-filter <taxon> to any search
wrapper and the target database is automatically restricted to that
taxon. The first call builds and caches a per-taxon subset database
under metadata/_cache/subset_dbs/; every subsequent
call against the same taxon hits the cache and runs in seconds. The
source-DB SHA-256 is recorded in a manifest, so the subset
auto-invalidates and rebuilds if the underlying database changes —
you never have to refresh it manually.
Don't know which taxon names are valid? Open the
species browser, filter by class or
family, and copy a name straight off the page.
- You'll need
- A query FASTA, plus a taxon name — a class
(
Cephalopoda), order (Octopoda), family
(Octopodidae), binomial
(Octopus bimaculoides), species code
(Octbim), or a comma-separated union
(Gastropoda,Bivalvia).
- Time
- ~2 minutes on the first run (one-time subset DB build);
seconds on a cache hit thereafter.
1
Pick a taxon.
Use the species browser to filter
by class or family, then pick a name like
Cephalopoda, Octopodidae, or
Octopus bimaculoides. Names are case-sensitive and
must match the corresponding column in
metadata/species_metadata.tsv.
2
Run the search.
From the repo root, with the conda env activated and
config.sh sourced:
wrappers/mg_blast.sh \
-q my_query.fa
-o gastropoda_blast
-d aa
--taxon-filter Gastropoda
-t 8
The first run will pause for a minute or two while
the subset DB is materialized; identical
--taxon-filter values on later calls hit the cache
directly.
3
What you get.
gastropoda_blast/hits_with_species.tsv — BLAST
hits, all from Gastropoda, joined to species metadata.
gastropoda_blast/run.log — provenance, including
the resolved species count and the subset DB cache key.
First-call side effect:
metadata/_cache/subset_dbs/class_Gastropoda_<hash>/
is created and reused for subsequent calls. The directory contains
the per-taxon FASTA, the format-specific index files
(BLAST volumes or DIAMOND .dmnd), and a
manifest.json recording the source-DB SHA-256 for
auto-invalidation.
4
Swap wrappers — same flag, same cache.
The same --taxon-filter works on every search
wrapper:
wrappers/mg_diamond.sh -q my_query.fa -o gastropoda_diamond \
--taxon-filter Gastropoda
wrappers/mg_hmmsearch.sh --hmm hmm/per_domain/Neur_chan_LBD_REVISION.hmm \
-o gastropoda_hmm --taxon-filter Gastropoda
wrappers/mg_characterize.sh -q my_query.fa -o gastropoda_char \
--taxon-filter Gastropoda
Two notes:
- For
mg_hmmsearch.sh, -q is
omitted — the subset FASTA is the target. -q
and --taxon-filter are mutually exclusive.
- For
mg_characterize.sh, the flag restricts
the BLAST and DIAMOND steps only. The
hmmsearch step scans the user's query against the HMM library
to identify family IDs, so a target-DB filter doesn't apply.
If something goes wrong
- "Ambiguous taxon" — the name matches multiple
ranks (e.g. a genus name that also happens to be a family root).
Disambiguate with the
rank:value form:
--taxon-filter class:Gastropoda,
--taxon-filter order:Octopoda, etc.
- "No matches" — typo, or the taxon isn't
represented in this release. Browse species.html
to find a valid name.
- Zero-protein species — the species exists in
metadata but has no protein records (a placeholder for v1.0).
Pick a broader rank (family or class instead of binomial).
- Subset DB build seems slow — the first call
for a given taxon extracts and indexes that subset. Subsequent
calls reuse the cache and are I/O-bound. The one-time cost is
expected and visible in
run.log.
When NOT to use this. If you want hits from across
all molluscs — including legitimate cross-class homologs — omit the
flag and search the full database. Subset DBs are smaller (and
therefore faster) but by construction exclude every species outside
the requested taxon, even ones that would otherwise be your best
hits.
Troubleshooting
Common problems & fixes.
"command not found"
conda not found — install Miniforge and reopen the terminal.
blastp / diamond / iqtree not found — the conda environment isn't activated. Run conda activate molluscagenes.
- Wrapper scripts not found — you're not in the repo root.
cd to the folder you cloned with git clone.
"no config.sh found"
- Run
mg_fetch.sh first; it writes config.sh for you.
- Or copy
config.sh.template to config.sh and edit the paths by hand to point at your downloaded BLAST volumes / DIAMOND db / HMM file.
"BLAST db not found"
- The path in
config.sh is wrong — open it in a text editor and verify $MG_BLAST_DIR points to the folder containing mollusca_aa.* and mollusca_mrna.*.
- If
mg_fetch.sh failed midway, the tarball may not have been extracted. cd into your storage folder and check whether the .tar.gz files are present alongside the extracted volumes.
Out of disk during a run
- The BLAST/DIAMOND databases are ~13 GB compressed, ~25 GB extracted. Add
mg_place intermediates and you can need 30–40 GB.
- Move the storage folder to a larger drive and re-run
mg_fetch.sh /new/path.
The wrapper "skipped" my run
- The output directory contains a
.done sentinel from a previous run. Pass --force to override, or delete the directory first.
Permission denied
chmod +x wrappers/*.sh scripts/*.sh scripts/*.py to make sure everything is executable. (After git clone they should be — but some Windows file systems strip the permission bit.)
It's slow
- Increase threads with
-t N, where N = number of CPU cores.
- For exploratory work use
--search diamond rather than blastp; DIAMOND is ~50× faster at modest sensitivity loss.
- For very-long-running placements,
--skip-treeshrink halves the runtime.
Reference
Glossary.
- Bash / shell
- The text-based program in the terminal that interprets the commands you type.
bash and zsh are the most common; both work identically for these tutorials.
- BLAST
- The Basic Local Alignment Search Tool. The classic similarity-search algorithm — sensitive but slow on very large databases. Uses include
blastp (protein vs. protein) and blastn (nucleotide vs. nucleotide).
- DIAMOND
- A modern protein-aligner that runs ~50–500× faster than BLAST at comparable sensitivity, by indexing the query in a clever way. Protein-only; there is no DIAMOND mode for nucleotide-vs-nucleotide.
- HMM (hidden Markov model)
- A probabilistic model of a protein domain family. Each position in the model has its own preferences for which amino acids it accepts. HMMs are more sensitive than BLAST for finding distant family members, especially when the family has conserved positions interleaved with variable ones.
- Pfam
- The standard public collection of protein-domain HMMs. MolluscaGenes ships TIAMMAt-revised versions of 1,057 Pfam HMMs that have been retrained on molluscan sequence data.
- FASTA
- A plain-text file format for sequences. Each entry begins with a
> header line and is followed by the sequence on the next lines.
- e-value
- The expected number of hits at least as good as the observed one that you would see by chance. Lower is more stringent —
1e-50 is a near-certain homolog; 1e-5 may include a few spurious hits but catches more distant relatives.
- Bootstrap (UFBoot, aBayes)
- Statistical measures of confidence in tree branches. UFBoot ("ultrafast bootstrap") values ≥95 and aBayes ≥0.95 are typically considered well-supported.
- conda environment
- An isolated bundle of software dependencies. Activating one re-routes
blastp, iqtree, etc. to specific versions installed under ~/miniforge3/envs/molluscagenes, without affecting your system-wide programs.
- Sentinel file (
.done)
- A tiny marker file the wrappers leave behind to record which steps have completed. Lets you re-run a wrapper without redoing finished work.