Tutorials

Common analyses,
explained step by step.

Five worked examples written for users who are not seasoned command-line users — every flag, file, and runtime expectation is spelled out. If you've never run a Bash script before, start with the Getting started walkthrough; otherwise jump straight to the example you need.

Setup

Getting started.

MolluscaGenes is run from a terminal — a text-only window that lets you type commands. On macOS that's the Terminal app (/Applications/Utilities/Terminal.app), on Linux it's any terminal emulator (GNOME Terminal, Konsole, xterm, …), and on Windows we recommend the Windows Subsystem for Linux with Ubuntu, which gives you a real Linux shell. The instructions below assume you have a Bash- or zsh-style shell open.

What's a "shell"? The shell is the program that interprets the commands you type and dispatches them. Everywhere this tutorial says "in the terminal", you can paste the line into your shell and press Enter. A line beginning with $ is a command you type — you don't paste the $ itself.

One-time setup, in five minutes

  1. Install conda if you don't already have it. We recommend Miniforge; follow the installer prompts and reopen your terminal afterward.
  2. Clone the repository. This downloads the wrappers, metadata, HMMs, and site code to a folder on your computer:
    git clone https://github.com/invertome/molluscagenes
    cd molluscagenes
    git clone copies the GitHub repo to a new folder named molluscagenes; cd ("change directory") moves your terminal session into that folder. Every command in this tutorial assumes you are inside it.
  3. Create the conda environment. This installs every external tool the wrappers call (BLAST, DIAMOND, HMMER, MAFFT, ClipKit, IQ-TREE, TreeShrink, plus a few helpers) into an isolated conda environment named molluscagenes. It only needs to be done once and takes 5–15 minutes:
    conda env create -f environment.yml
    conda activate molluscagenes
    The activate step is something you'll need to repeat every time you open a new terminal session.
  4. Download the database from Zenodo. Pick a folder with at least 20 GB of free disk for the BLAST and DIAMOND files (these are too big for GitHub). The mg_fetch.sh wrapper does the download, verifies SHA256 checksums, extracts the tarballs, and writes a configuration file that tells the other wrappers where the data lives:
    bash wrappers/mg_fetch.sh /path/to/storage
    Replace /path/to/storage with the actual folder you picked. If you want to see what would be downloaded without actually downloading anything, add --dry-run.
  5. Source the configuration file in every new terminal session before running any wrapper:
    source config.sh
    This sets a handful of environment variables ($MG_BLAST_AA, $MG_HMM, …) that the wrappers read. If you forget this step you'll see an "no config.sh found" error.
Reference

Anatomy of a wrapper command.

Every MolluscaGenes wrapper follows the same shape. Once you can read one, you can read them all. Take a typical BLAST search:

wrappers/mg_blast.sh -q my_query.fa -o my_results -d aa -e 1e-5 -t 8
  • wrappers/mg_blast.sh — the wrapper script. The path is relative to the repo root, where you should be when you run it.
  • -q my_query.fa — the query: a FASTA file on your disk containing the sequence(s) you want to look up.
  • -o my_results — the output directory. The wrapper creates it if it doesn't exist.
  • -d aa — which database to search. aa = the protein database (uses blastp); mrna = the nucleotide database (uses blastn).
  • -e 1e-5 — the e-value threshold (one in 100,000 by chance). Lower values are stricter; higher values let in more distant hits.
  • -t 8 — number of CPU threads. Use as many as your machine has cores.

Three things every wrapper produces in its output directory:

  • run.log — the exact command, tool versions, start/end times, and any warnings. Hand this to a reviewer asking "what settings did you use?".
  • .done sentinel — a tiny marker file that records that the run completed. If you re-run the same command, the wrapper short-circuits and exits immediately. Pass --force to override.
  • The actual results, named according to the wrapper.

FASTA, briefly. A FASTA file is a plain text file that holds one or more biological sequences. Each sequence starts with a header line beginning with >, followed by the sequence itself on the lines that follow. Example:

>my_protein descriptive text here
MDRSQRLGSPGGFSPRGGSPAPSPRRDSGSGSADGKGGKGIHATIQQ
LADEQERVQKKTFTNWINSYLVKHIPPVKVESL

You can copy a sequence from UniProt or NCBI into a text file with a .fa extension, or extract one out of MolluscaGenes itself using mg_extract.sh (Example 3 below).


Example 01 · characterize

What is this protein, and where does it occur in the Mollusca?

You have a protein sequence — maybe from a UniProt page, a paper supplement, or your own assembly — and you want a fast read on what it is. mg_characterize.sh runs three searches against MolluscaGenes in parallel:

  • BLAST against every protein in the database (sensitive, slower)
  • DIAMOND against the same protein database (less sensitive but ~50× faster)
  • hmmsearch against the 1,057 mollusc-revised Pfam HMMs (best at finding what family your protein belongs to)

The wrapper joins each method's hits to species metadata so you can see the binomial, family, and class of each match without separate lookups.

You'll need
A protein FASTA file. We'll call it my_query.fa; put it in the repo's root folder for these examples (or anywhere — just adjust the path).
Time
~2–5 minutes for a short protein on 8 threads. BLAST dominates the wall time.
Disk
~10 MB output.
1

Run the command.

From the repo root, with the conda env activated and config.sh sourced:

wrappers/mg_characterize.sh -q my_query.fa -o characterize_out -t 8
2

Watch what happens.

The terminal will print progress lines as each of the three searches runs. You'll see something like:

[2026-04-27T14:01:11Z] mg_characterize.sh: query=my_query.fa
[2026-04-27T14:01:11Z] running blastp...
[2026-04-27T14:01:11Z] running diamond blastp --more-sensitive...
[2026-04-27T14:01:11Z] running hmmsearch...
[2026-04-27T14:03:42Z] wrote summary.tsv
[2026-04-27T14:03:42Z] wrote report.html

If you don't want to watch and have a shell that supports it, you can append & to run it in the background and reclaim your prompt.

3

Inspect the report.

Open characterize_out/report.html in your browser. It's a self-contained file — you can email it, share it, or open it offline. It shows one row per query sequence, with the best BLAST/DIAMOND/HMM hit and the species each hit comes from.

If you'd rather work with raw data, the same information is in summary.tsv (open it in a spreadsheet or with pandas):

query              blast_best_hit       blast_best_evalue  diamond_best_hit     hmm_top_domain         species_top_blast
my_protein         OctbimEVm004812t1    1.34e-156          OctbimEVm004812t1    Neur_chan_LBD_REV      Octopus bimaculoides

Per-method detail (every hit, not just the best one) is under the blast/, diamond/, and hmm/ subdirectories — open hits_with_species.tsv in any of them.

4

Try variations.

  • Faster, less sensitive — drop BLAST and just use DIAMOND + HMM:
    wrappers/mg_characterize.sh -q my_query.fa -o cf_fast -t 8 --skip blast
  • HMM-only, e.g. when you suspect the protein is highly divergent:
    wrappers/mg_characterize.sh -q my_query.fa -o cf_hmm -t 8 --skip blast --skip diamond
  • Loosen the e-value to catch distant homologs:
    wrappers/mg_characterize.sh -q my_query.fa -o cf_loose -t 8 -e 1
If something goes wrong
  • "no config.sh found" — you forgot to source config.sh, or you're in the wrong directory. cd back to the repo root and try again.
  • "BLAST db not found" — the path in config.sh doesn't point to a folder with the BLAST volumes. Re-run mg_fetch.sh.
  • The HMM search reports "0 hits" — your protein may genuinely not match any of the 1,057 HMMs (we cover ~50 biological categories, not all of Pfam). Try lowering --hmm-E first; if still nothing, the protein may be from a family we don't have an HMM for.
  • It seems hung — open characterize_out/run.log in another terminal (tail -f characterize_out/run.log) to see live progress.
Example 02 · phylogeny

Build a phylum-wide tree of a gene family from a few seed sequences.

This is the flagship analysis. Given one or more reference (seed) sequences for a gene family — say, three nicotinic acetylcholine receptor sequences from a model species — mg_place.sh finds every homolog across MolluscaGenes, aligns and trims them, infers a phylogenetic tree with bootstrap and aBayes branch support, prunes long-branch outliers ("rogue tips"), re-aligns and re-infers a final tree, and renames the tips from internal species codes to binomials.

The whole pipeline is one command. Internally it runs nine steps (four iterative searches, accession extraction, MAFFT, ClipKit, IQ-TREE 2 round 1, TreeShrink, MAFFT again, IQ-TREE 2 round 2, tip renaming) — each step writes a small .done marker so you can re-run with new flags and only the affected steps recompute.

You'll need
A small reference FASTA — typically 1–10 protein sequences for the family of interest. We'll call it seeds.fa.
Time
~2 hours wall time on 16 threads with DIAMOND-based search; ~6 hours with BLAST. The MAFFT and IQ-TREE rounds dominate.
Disk
~1–5 GB of intermediate files. The final tree itself is small.
Memory
~16 GB peak during IQ-TREE on a few-thousand-sequence alignment.

Heads-up. Iterative search at e ≤ 1e-96 is appropriately strict for most receptor / enzyme families. If you're working with very short or rapidly-evolving proteins (e.g. defensins, venom peptides), relax the e-value and lower the iteration count to avoid pulling in non-homologous noise: -e 1e-20 --iterations 2.

1

Run the command.

wrappers/mg_place.sh \
    -q seeds.fa                  # your reference sequences
    -o nachr_tree                # output directory (will be created)
    --search diamond          # much faster than blastp; use blastp for max sensitivity
    --iterations 4           # 4 rounds of iterative recruitment
    --mafft-mode einsi       # E-INS-i: best for proteins with conserved blocks + variable regions
    --clipkit-mode kpic-smart-gap  # ClipKit trim mode
    --bb-final 10000          # UFBoot replicates for the final tree
    -t 16                       # threads

The trailing \ on each line is shell convention for "this command continues on the next line". You can also paste it all on a single long line if you prefer.

2

What's happening behind the scenes.

  1. Iterative search (rounds 1–4). Round 1 searches your seeds against MolluscaGenes; round 2 uses the round-1 hits as a new query set; rounds 3 and 4 do the same. Each round broadens the recruited set, capturing increasingly divergent homologs.
  2. Extraction. All unique hits are pulled out as FASTA via blastdbcmd and concatenated with your seeds.
  3. Alignment. MAFFT's E-INS-i algorithm aligns the full set, balancing speed against accuracy on protein families with conserved domains separated by variable linkers.
  4. Trimming. ClipKit removes ambiguously aligned positions while keeping parsimony-informative columns.
  5. Round-1 tree. IQ-TREE 2 picks an evolutionary model (TESTNEW), infers a maximum-likelihood tree, and computes ultra-fast bootstrap and aBayes branch support.
  6. TreeShrink. Long-branch outliers ("rogue tips") are detected and removed.
  7. Re-alignment. The pruned set is realigned for the final tree.
  8. Final tree. IQ-TREE 2 runs again, this time using the model selected in round 1 and a higher number of bootstrap replicates (--bb-final 10000).
  9. Tip renaming. Every species-code prefix (OctbimEVm…) is replaced by a sed-driven substitution with the binomial (Octopus_bimaculoides_EVm…), yielding a tree that's human-readable in any visualizer.
3

What you'll find when it finishes.

The most useful files:

  • nachr_tree/final.names.contree — the final consensus tree, with binomial tip labels and bootstrap support values. Open this in iTOL, FigTree, or ete3.
  • nachr_tree/final.iqtree — IQ-TREE log: model selection summary, parameter estimates, support tests.
  • nachr_tree/treeshrink/run/output.fasta — the alignment after rogue-tip removal.
  • nachr_tree/iter1/hits.tsv through iter4/hits.tsv — every iterative round, fully traceable.
4

Iterate on flags without losing work.

If you decide partway through that you want a different alignment mode or trim setting, you don't have to start over. Edit the command and re-run; the wrapper will skip the steps whose output is already on disk and only redo the affected stages.

wrappers/mg_place.sh -q seeds.fa -o nachr_tree --clipkit-mode gappy

That re-runs ClipKit, both IQ-TREE rounds, and the rename — but skips the four iterative searches and the alignment, which are by far the slowest steps. Pass --force to redo everything.

5

If you want to skip TreeShrink.

If you'd rather see every recruited sequence in the final tree (e.g. to look at long-branch outliers explicitly), pass --skip-treeshrink. This roughly halves the total runtime.

If something goes wrong
  • IQ-TREE runs out of memory. Drop --bb-final back to 1000 first; if still failing, sub-sample your alignment or trim more aggressively (--clipkit-mode gappy).
  • "too few sequences for tree inference" — your seeds are too restrictive. Loosen -e or add more seed sequences.
  • The final tree looks like spaghetti — the family is probably too divergent for one tree. Split your seeds by sub-family and run two separate placements, then concatenate the trees externally.
  • You see "diamond: command not found" — the conda env isn't activated. conda activate molluscagenes.
Example 03 · extract

Pull every protein from a class, family, or species set.

When you want to run an analysis on every protein from a particular taxonomic group — say, all cephalopods, or every Octopus species — without writing any glue code. mg_extract.sh looks up the species you specify in the metadata table, then pulls their protein and/or transcript sequences out of the BLAST databases.

You'll need
A selector: either a species code, a binomial, a comma-separated list, a file with one species per line, or a taxonomic rank (class / order / family / phylum).
Time
First run on a fresh database: ~2 minutes (the wrapper builds an accession-by-species cache, ~600 MB under metadata/_cache/). Subsequent runs are I/O-bound and complete in seconds.
1

Pick a selector form.

Concrete forms, all equivalent in usage:

wrappers/mg_extract.sh -s Octbim                          -o out -d aa  # one code
wrappers/mg_extract.sh -s "Octopus bimaculoides" --by binomial -o out -d aa
wrappers/mg_extract.sh -s Octbim,Sepoff,Eupber             -o out -d aa  # comma list
wrappers/mg_extract.sh -s @my_codes.txt                    -o out -d aa  # file
wrappers/mg_extract.sh -s Cephalopoda --by class           -o out -d aa  # by rank
wrappers/mg_extract.sh -s Octopodidae --by family          -o out -d aa
2

Run a real example: every cephalopod protein.

Here we extract every protein from every cephalopod species in v0.1 (28 species, ~2.8 M sequences total), with --merge to concatenate them into a single FASTA file:

wrappers/mg_extract.sh \
    -s Cephalopoda --by class \
    -o cephalopoda_proteins \
    -d aa --merge \
    -t 8

The -d aa flag picks the protein database; use -d mrna for transcripts or -d both for parallel extraction of both.

3

What you get.

  • cephalopoda_proteins/extracted_aa.fa — single merged FASTA with ~2.8 M sequences (because of --merge; without it you'd get one FASTA per species).
  • cephalopoda_proteins/extracted_species.tsv — per-species count of what was actually extracted, plus binomial / class / order / family for each row.

Quick sanity-check from the terminal:

grep -c '^>' cephalopoda_proteins/extracted_aa.fa
2800153

Use the species browser to design your selector. Browse all species, filter by class or family, then export the filtered list as TSV — the first column is the species code, ready to feed into -s @file.txt.

Example 04 · HMM scan

Scan a proteome with the mollusc-revised HMMs.

Given any protein FASTA — your own assembly, a set extracted with mg_extract.sh, or any third-party proteome — annotate it domain-by-domain using the 1,057 mollusc-revised HMMs. Because these HMMs were re-trained against molluscan sequence diversity, they recover divergent homologs that vertebrate-trained Pfam profiles frequently miss.

You'll need
A protein FASTA. We'll generate one from Octopus bimaculoides using mg_extract.
Time
~10 minutes for an 80,000-protein proteome on 8 threads with the full bundle (1,057 HMMs). Single-HMM scans take seconds.
1

Get a proteome.

wrappers/mg_extract.sh -s "Octopus bimaculoides" --by binomial -o obim --merge
2

Scan with the full HMM bundle.

wrappers/mg_hmmsearch.sh -q obim/extracted_aa.fa -o obim_hmm -t 8

By default this uses the e-value threshold 1e-5. For high-confidence calls compatible with how Pfam itself curates, substitute Pfam's gathering thresholds:

wrappers/mg_hmmsearch.sh -q obim/extracted_aa.fa -o obim_hmm --cut-ga -t 8
3

Or scan with a single HMM.

If you only care about one domain family — e.g. the ligand-binding domain of pentameric Cys-loop receptors — point --hmm at the per-domain file:

wrappers/mg_hmmsearch.sh \
    -q obim/extracted_aa.fa \
    -o obim_neur_chan \
    --hmm hmm/per_domain/Neur_chan_LBD_REVISION.hmm \
    -t 8

This is much faster than the full scan (~30 seconds), and is often the right approach when you're chasing a specific family rather than annotating a whole proteome.

4

What you get.

  • obim_hmm/hits.tbl — the standard HMMER per-sequence tabular file. One line per (sequence, HMM) match.
  • obim_hmm/hits.domtbl — per-domain output: detailed coordinates of each domain hit on each sequence.
  • obim_hmm/hits_with_species.tsv — same as hits.tbl but with binomial / class / order / family appended for each row, so you don't need to cross-reference manually.

Pick HMMs interactively. The HMMs (Browser) page lets you filter the 1,057 HMMs by biological category (GPCR, VGIC, LGIC, …), tick a subset, and copy a curl snippet or download a ZIP bundle. Useful when you want to scan with a small custom subset rather than the full collection.

Example 05 · custom flags

Reach a flag we don't expose by name.

Each wrapper exposes a curated set of flags by name (the most-used settings) and an --extra "ARGS" passthrough so anything we haven't named is still reachable. The string is split on whitespace and appended verbatim to the underlying tool. Three examples:

wrappers/mg_blast.sh -q q.fa -o out -d aa \
    --extra "-gapopen 11 -gapextend 1 -seg yes"
wrappers/mg_diamond.sh -q q.fa -o out \
    --extra "--masking 0 --comp-based-stats 1 --frameshift 15"
wrappers/mg_place.sh -q seeds.fa -o phylo \
    --iqtree-extra "-alrt 1000 -lbp 1000" \
    --search-extra "-soft_masking true"

Whatever you pass via --extra also lands verbatim in the run.log, alongside the tool version that was used. A reviewer asking "what settings did you use?" gets a single deterministic answer.

Don't know the underlying tool's flags? Each wrapper calls a familiar tool you can introspect directly: blastp -help, diamond help blastp, hmmsearch -h, mafft --help, iqtree -h. Anything those tools accept can be passed through --extra.


Troubleshooting

Common problems & fixes.

"command not found"
  • conda not found — install Miniforge and reopen the terminal.
  • blastp / diamond / iqtree not found — the conda environment isn't activated. Run conda activate molluscagenes.
  • Wrapper scripts not found — you're not in the repo root. cd to the folder you cloned with git clone.
"no config.sh found"
  • Run mg_fetch.sh first; it writes config.sh for you.
  • Or copy config.sh.template to config.sh and edit the paths by hand to point at your downloaded BLAST volumes / DIAMOND db / HMM file.
"BLAST db not found"
  • The path in config.sh is wrong — open it in a text editor and verify $MG_BLAST_DIR points to the folder containing mollusca_aa.* and mollusca_mrna.*.
  • If mg_fetch.sh failed midway, the tarball may not have been extracted. cd into your storage folder and check whether the .tar.gz files are present alongside the extracted volumes.
Out of disk during a run
  • The BLAST/DIAMOND databases are ~13 GB compressed, ~25 GB extracted. Add mg_place intermediates and you can need 30–40 GB.
  • Move the storage folder to a larger drive and re-run mg_fetch.sh /new/path.
The wrapper "skipped" my run
  • The output directory contains a .done sentinel from a previous run. Pass --force to override, or delete the directory first.
Permission denied
  • chmod +x wrappers/*.sh scripts/*.sh scripts/*.py to make sure everything is executable. (After git clone they should be — but some Windows file systems strip the permission bit.)
It's slow
  • Increase threads with -t N, where N = number of CPU cores.
  • For exploratory work use --search diamond rather than blastp; DIAMOND is ~50× faster at modest sensitivity loss.
  • For very-long-running placements, --skip-treeshrink halves the runtime.

Reference

Glossary.

Bash / shell
The text-based program in the terminal that interprets the commands you type. bash and zsh are the most common; both work identically for these tutorials.
BLAST
The Basic Local Alignment Search Tool. The classic similarity-search algorithm — sensitive but slow on very large databases. Uses include blastp (protein vs. protein) and blastn (nucleotide vs. nucleotide).
DIAMOND
A modern protein-aligner that runs ~50–500× faster than BLAST at comparable sensitivity, by indexing the query in a clever way. Protein-only; there is no DIAMOND mode for nucleotide-vs-nucleotide.
HMM (hidden Markov model)
A probabilistic model of a protein domain family. Each position in the model has its own preferences for which amino acids it accepts. HMMs are more sensitive than BLAST for finding distant family members, especially when the family has conserved positions interleaved with variable ones.
Pfam
The standard public collection of protein-domain HMMs. MolluscaGenes ships TIAMMAt-revised versions of 1,057 Pfam HMMs that have been retrained on molluscan sequence data.
FASTA
A plain-text file format for sequences. Each entry begins with a > header line and is followed by the sequence on the next lines.
e-value
The expected number of hits at least as good as the observed one that you would see by chance. Lower is more stringent — 1e-50 is a near-certain homolog; 1e-5 may include a few spurious hits but catches more distant relatives.
Bootstrap (UFBoot, aBayes)
Statistical measures of confidence in tree branches. UFBoot ("ultrafast bootstrap") values ≥95 and aBayes ≥0.95 are typically considered well-supported.
conda environment
An isolated bundle of software dependencies. Activating one re-routes blastp, iqtree, etc. to specific versions installed under ~/miniforge3/envs/molluscagenes, without affecting your system-wide programs.
Sentinel file (.done)
A tiny marker file the wrappers leave behind to record which steps have completed. Lets you re-run a wrapper without redoing finished work.