MolluscaGenes —
a transcriptomic database
for the Mollusca.
A taxonomically comprehensive resource consolidating de-novo and previously published transcriptomes for ~300 mollusc species across all eight classes, paired with mollusc-revised Pfam HMMs and command-line wrappers for search, extraction, and iterative-BLAST phylogenetic placement.
An under-served phylum, finally consolidated.
Mollusca is the second-largest animal phylum (over 70,000 described species across eight classes), yet its genomic resources remain scattered across NCBI BioProjects, MolluscDB, MateDB, and individual lab repositories. Detection of divergent protein homologs is further constrained by the bias of public HMM resources toward vertebrate and ecdysozoan sequences — Pfam profiles can miss legitimate molluscan family members because of lineage-specific substitution patterns.
MolluscaGenes addresses both problems. It consolidates ~300 species worth of transcriptomic data into a single searchable resource, and ships mollusc-optimized HMMs — Pfam profiles iteratively re-trained against molluscan sequence diversity for substantially higher sensitivity on lophotrochozoan homologs.
v0.1 is the database used in the accompanying biorxiv preprint. A full HPC rebuild (v1.0) is in progress and will supersede v0.1 under the same Zenodo concept DOI.
The release at a glance.
- ~300 species across all eight molluscan classes (Gastropoda, Bivalvia, Cephalopoda, Polyplacophora, Scaphopoda, Solenogastres, Caudofoveata, Monoplacophora).
- ~17 million transcripts (~16.8 Gb of nucleotide data).
- ~17 million predicted proteins (~3.3 Gb of amino-acid data).
- BLAST databases (protein + nucleotide) and a DIAMOND protein database built from the same source.
- 1,057 mollusc-optimized HMMs across 12 themes / 104 subcategories (909 TIAMMAt-revised, 148 original Pfam-A 36.0 fallback after per-HMM specificity QC) — +11.7 % mean detection sensitivity vs. stock Pfam-A 36.0 across six mollusc proteomes (see evaluation).
- Per-species & per-HMM metadata with NCBI taxonomy and WoRMS cross-links.
- CLI wrappers:
mg_blast,mg_diamond,mg_hmmsearch,mg_characterize,mg_place,mg_extract.
Taxonomic coverage
| Class | Species (≈) |
|---|---|
| Gastropoda | 140 |
| Bivalvia | 62 |
| Solenogastres | 36 |
| Cephalopoda | 28 |
| Polyplacophora | 26 |
| Caudofoveata | 8 |
| Scaphopoda | 5 |
| Monoplacophora | 1 |
v0.1 includes ~299 species in this release. The species browser has the full table, filterable by class and exportable to TSV.
+11.7 % detection sensitivity.
Drop-in for Pfam-A 36.0.
Six-proteome benchmark comparing 1,057 mollusc-optimized HMMs against original Pfam-A 36.0. Per-HMM specificity QC, 12-theme curation taxonomy, specificity-decomposed and figure-rich.
View evaluation report →(vs. stock Pfam-A 36.0)
How v0.1 was built.
Raw reads were sourced from the NCBI Sequence Read Archive (paired-end Illumina,
≥20M read-pairs per sample, prioritizing tissue diversity and under-represented
classes), supplemented with pre-assembled transcriptomes from MateDB and MolluscDB.
The pipeline is the
nf-core/denovotranscript
multi-assembler workflow:
Case study — phylum-wide nAChR phylogeny.
To demonstrate the resource on a real question, the accompanying biorxiv preprint applies MolluscaGenes to a phylum-wide characterization of the nicotinic acetylcholine receptor (nAChR) superfamily — including the recently described chemotactile receptors (CRs) of cephalopods.
Among the findings: lineage-specific expansions in bivalves (Alpha-A10a, Alpha-A10b, AChRB) and cephalopods (Alpha-A10a, DopC, CR, CR-like); identification of novel clades with substitutions at canonical ligand-binding residues; and the placement of cephalopod chemotactile receptors and CR-like sequences as distinct lineages within the broader nAChR superfamily — with CR-like sequences detected across non-cephalopod molluscs, suggesting an ancient component of the molluscan repertoire.
The full analysis (alignment, tree, clade membership, motif features) is
reproducible from this repository: see the
phylogenetic placement tutorial for
the equivalent mg_place.sh command, and the manuscript
Methods §2.5 for the reference seed set.
Coming in v1.0.
A full HPC-driven rebuild (v1.0) is tracked under the project's issue tracker and will supersede the current release under the same Zenodo concept DOI. Planned changes:
- Larger taxonomic scope. Re-querying NCBI's SRA against an updated MolluscaBase target list, including taxa added since v0.1 was assembled.
- Snakemake pipeline. The new download + assembly workflow runs on an HPC cluster with explicit per-stage resume, disk watchdogs, and Kraken2 contamination screening.
- BUSCO completeness in metadata. Per-species BUSCO scores on Metazoa odb10 will populate
species_metadata.tsvas a quality column. - Source provenance. Per-species
source_accession(BioProject / SRA run) andreference_citation_doipopulated from NCBI BioProject linkage — empty for most rows in v0.1. - Annotation add-on. A separate v1.1 workflow producing GO terms, KEGG pathways, and InterPro hits as a downloadable annotation TSV.
- Discovery site upgrade. Eleventy-driven static site with client-side MinHash search over species sets, replacing the present plain-HTML browsers.
- Expanded HMM coverage. Considering re-applying TIAMMAt across the full Pfam catalog rather than the current 1,057-domain biological-process subset.
Frequently asked.
What's the difference between v0.1 and v1.0?
v0.1 is the database used in the biorxiv preprint — pre-existing EvidentialGene assemblies for ~300 species, manually curated. v1.0 will be the output of a from-scratch HPC rebuild using a new Snakemake pipeline, including BUSCO completeness scores and full provenance per species. Both share the same Zenodo concept DOI; v1.0 supersedes v0.1 when published.
Can I redistribute the data?
Yes. The Zenodo deposit is licensed CC-BY-4.0 — you can redistribute, modify, and build on it commercially or academically as long as you cite the deposit. Code in this repository is GPL-3.0; derivative software must remain open under a compatible license.
Do I need an NCBI API key?
No, not to use MolluscaGenes — searches run entirely against the
local database. Only the metadata-build script (build_metadata.py)
hits NCBI Entrez, and that runs once at release time. End users never need
a key.
How do I cite v0.1?
Cite both the Zenodo deposit (gives credit to the data) and the biorxiv
preprint (gives credit to the analysis). Both DOIs will appear on this
site once the deposit is published; the machine-readable block is in
CITATION.cff.
How do I report a bug or request a species?
Open an issue on the GitHub tracker. Species requests for v1.0 inclusion are especially welcome — list the taxon and any preferred SRA accession(s).
What if a species I care about has 0 sequences in v0.1?
A subset of species in dict2.tsv are placeholders that will
be assembled in v1.0. The species browser
shows current sequence counts; species with zero counts are listed for
transparency but contribute nothing to v0.1 searches.
Why are some Pfam descriptions blank in the HMMs (Browser) page?
A handful of mollusc-revised HMMs target Pfam families whose InterPro entries have no description text yet, or whose API records use a schema we don't fully parse. The HMM itself is fine — the gap is purely in the human-readable annotation.
Download & first run.
The database files (~13 GB compressed) are hosted on Zenodo. The concept DOI always resolves to the latest version; the v0.1 DOI permanently pins this release.
- Concept DOI (always latest):
10.5281/zenodo.19825265 - v0.1 DOI (pinned):
10.5281/zenodo.19825266 - Zenodo record: zenodo.org/records/19825266
Recommended workflow — clone the repo, then fetch & verify with the wrapper:
git clone https://github.com/invertome/molluscagenes cd molluscagenes conda env create -f environment.yml conda activate molluscagenes bash wrappers/mg_fetch.sh /path/to/storage # downloads, verifies, writes config.sh source config.sh bash wrappers/mg_blast.sh -q my_query.fa -o blast_out -d aa
mg_fetch.sh reads the Zenodo record id from
metadata/zenodo_record.txt, downloads every artifact in the manifest,
verifies SHA256, extracts the BLAST/HMM tarballs, and writes a populated
config.sh. To re-verify an existing download:
bash wrappers/verify_download.sh /path/to/storage.
If you use MolluscaGenes.
Please cite both the data deposit and the biorxiv preprint:
Pérez-Moreno JL, Katz PS. MolluscaGenes: a transcriptomic database for the Mollusca (v0.1). Zenodo, 2026. https://doi.org/10.5281/zenodo.19825266 Pérez-Moreno JL, Katz PS. MolluscaGenes: A Transcriptomic Database for the Mollusca. biorxiv, 2026. DOI: TBD (preprint pending).
The full machine-readable citation block is in
CITATION.cff.