MolluscaGenes —
a transcriptomic database
for the Mollusca.

A taxonomically comprehensive resource consolidating de-novo and previously published transcriptomes for ~300 mollusc species across all eight classes, paired with mollusc-revised Pfam HMMs and command-line wrappers for search, extraction, and iterative-BLAST phylogenetic placement.

License
GPL-3.0 (code) · CC-BY-4.0 (data)
~300Species
8classesCoverage
17MSequences
1,057HMMs
+11.7%Detection sensitivity
Why MolluscaGenes

An under-served phylum, finally consolidated.

Mollusca is the second-largest animal phylum (over 70,000 described species across eight classes), yet its genomic resources remain scattered across NCBI BioProjects, MolluscDB, MateDB, and individual lab repositories. Detection of divergent protein homologs is further constrained by the bias of public HMM resources toward vertebrate and ecdysozoan sequences — Pfam profiles can miss legitimate molluscan family members because of lineage-specific substitution patterns.

MolluscaGenes addresses both problems. It consolidates ~300 species worth of transcriptomic data into a single searchable resource, and ships mollusc-optimized HMMs — Pfam profiles iteratively re-trained against molluscan sequence diversity for substantially higher sensitivity on lophotrochozoan homologs.

v0.1 is the database used in the accompanying biorxiv preprint. A full HPC rebuild (v1.0) is in progress and will supersede v0.1 under the same Zenodo concept DOI.

What's inside

The release at a glance.

  • ~300 species across all eight molluscan classes (Gastropoda, Bivalvia, Cephalopoda, Polyplacophora, Scaphopoda, Solenogastres, Caudofoveata, Monoplacophora).
  • ~17 million transcripts (~16.8 Gb of nucleotide data).
  • ~17 million predicted proteins (~3.3 Gb of amino-acid data).
  • BLAST databases (protein + nucleotide) and a DIAMOND protein database built from the same source.
  • 1,057 mollusc-optimized HMMs across 12 themes / 104 subcategories (909 TIAMMAt-revised, 148 original Pfam-A 36.0 fallback after per-HMM specificity QC) — +11.7 % mean detection sensitivity vs. stock Pfam-A 36.0 across six mollusc proteomes (see evaluation).
  • Per-species & per-HMM metadata with NCBI taxonomy and WoRMS cross-links.
  • CLI wrappers: mg_blast, mg_diamond, mg_hmmsearch, mg_characterize, mg_place, mg_extract.

Taxonomic coverage

ClassSpecies (≈)
Gastropoda 140
Bivalvia 62
Solenogastres 36
Cephalopoda 28
Polyplacophora 26
Caudofoveata 8
Scaphopoda 5
Monoplacophora 1

v0.1 includes ~299 species in this release. The species browser has the full table, filterable by class and exportable to TSV.

Performance

+11.7 % detection sensitivity.
Drop-in for Pfam-A 36.0.

Six-proteome benchmark comparing 1,057 mollusc-optimized HMMs against original Pfam-A 36.0. Per-HMM specificity QC, 12-theme curation taxonomy, specificity-decomposed and figure-rich.

View evaluation report
+11.7% mean sensitivity gain
(vs. stock Pfam-A 36.0)
Method

How v0.1 was built.

Raw reads were sourced from the NCBI Sequence Read Archive (paired-end Illumina, ≥20M read-pairs per sample, prioritizing tissue diversity and under-represented classes), supplemented with pre-assembled transcriptomes from MateDB and MolluscDB. The pipeline is the nf-core/denovotranscript multi-assembler workflow:

Reads Assemble Reduce + QC Annotate QC fastp v0.23 Trinity v2.15 rnaSPAdes v3.15 · k=25,49,73 EvidentialGene 98% clustering main + alt models BUSCO v5.4 Metazoa odb10 DIAMOND vs RefSeq InterProScan Pfam · CDD · SMART TIAMMAt — mollusc-revised HMMs iterative Pfam → hmmsearch → MAFFT → hmmbuild ×3–5
Multi-assembler workflow with EvidentialGene redundancy reduction and BUSCO gating (≥30% completeness on Metazoa odb10). Annotation is DIAMOND vs. RefSeq + InterProScan (Pfam, SMART, CDD, SUPERFAMILY, Gene3D) plus eggNOG-mapper for GO/KEGG. The TIAMMAt branch iteratively re-trains 1,057 Pfam HMMs against the molluscan sequence pool, validated against six independent RefSeq proteomes (A. californica, C. gigas, L. anatina, L. gigantea, O. bimaculoides, P. canaliculata).
Demonstration

Case study — phylum-wide nAChR phylogeny.

To demonstrate the resource on a real question, the accompanying biorxiv preprint applies MolluscaGenes to a phylum-wide characterization of the nicotinic acetylcholine receptor (nAChR) superfamily — including the recently described chemotactile receptors (CRs) of cephalopods.

3,586nAChR sequences recovered
190+Species across all 8 classes
15Phylogenetic clades resolved

Among the findings: lineage-specific expansions in bivalves (Alpha-A10a, Alpha-A10b, AChRB) and cephalopods (Alpha-A10a, DopC, CR, CR-like); identification of novel clades with substitutions at canonical ligand-binding residues; and the placement of cephalopod chemotactile receptors and CR-like sequences as distinct lineages within the broader nAChR superfamily — with CR-like sequences detected across non-cephalopod molluscs, suggesting an ancient component of the molluscan repertoire.

The full analysis (alignment, tree, clade membership, motif features) is reproducible from this repository: see the phylogenetic placement tutorial for the equivalent mg_place.sh command, and the manuscript Methods §2.5 for the reference seed set.

Roadmap

Coming in v1.0.

A full HPC-driven rebuild (v1.0) is tracked under the project's issue tracker and will supersede the current release under the same Zenodo concept DOI. Planned changes:

  • Larger taxonomic scope. Re-querying NCBI's SRA against an updated MolluscaBase target list, including taxa added since v0.1 was assembled.
  • Snakemake pipeline. The new download + assembly workflow runs on an HPC cluster with explicit per-stage resume, disk watchdogs, and Kraken2 contamination screening.
  • BUSCO completeness in metadata. Per-species BUSCO scores on Metazoa odb10 will populate species_metadata.tsv as a quality column.
  • Source provenance. Per-species source_accession (BioProject / SRA run) and reference_citation_doi populated from NCBI BioProject linkage — empty for most rows in v0.1.
  • Annotation add-on. A separate v1.1 workflow producing GO terms, KEGG pathways, and InterPro hits as a downloadable annotation TSV.
  • Discovery site upgrade. Eleventy-driven static site with client-side MinHash search over species sets, replacing the present plain-HTML browsers.
  • Expanded HMM coverage. Considering re-applying TIAMMAt across the full Pfam catalog rather than the current 1,057-domain biological-process subset.
FAQ

Frequently asked.

What's the difference between v0.1 and v1.0?

v0.1 is the database used in the biorxiv preprint — pre-existing EvidentialGene assemblies for ~300 species, manually curated. v1.0 will be the output of a from-scratch HPC rebuild using a new Snakemake pipeline, including BUSCO completeness scores and full provenance per species. Both share the same Zenodo concept DOI; v1.0 supersedes v0.1 when published.

Can I redistribute the data?

Yes. The Zenodo deposit is licensed CC-BY-4.0 — you can redistribute, modify, and build on it commercially or academically as long as you cite the deposit. Code in this repository is GPL-3.0; derivative software must remain open under a compatible license.

Do I need an NCBI API key?

No, not to use MolluscaGenes — searches run entirely against the local database. Only the metadata-build script (build_metadata.py) hits NCBI Entrez, and that runs once at release time. End users never need a key.

How do I cite v0.1?

Cite both the Zenodo deposit (gives credit to the data) and the biorxiv preprint (gives credit to the analysis). Both DOIs will appear on this site once the deposit is published; the machine-readable block is in CITATION.cff.

How do I report a bug or request a species?

Open an issue on the GitHub tracker. Species requests for v1.0 inclusion are especially welcome — list the taxon and any preferred SRA accession(s).

What if a species I care about has 0 sequences in v0.1?

A subset of species in dict2.tsv are placeholders that will be assembled in v1.0. The species browser shows current sequence counts; species with zero counts are listed for transparency but contribute nothing to v0.1 searches.

Why are some Pfam descriptions blank in the HMMs (Browser) page?

A handful of mollusc-revised HMMs target Pfam families whose InterPro entries have no description text yet, or whose API records use a schema we don't fully parse. The HMM itself is fine — the gap is purely in the human-readable annotation.

Get the data

Download & first run.

The database files (~13 GB compressed) are hosted on Zenodo. The concept DOI always resolves to the latest version; the v0.1 DOI permanently pins this release.

Recommended workflow — clone the repo, then fetch & verify with the wrapper:

git clone https://github.com/invertome/molluscagenes
cd molluscagenes
conda env create -f environment.yml
conda activate molluscagenes
bash wrappers/mg_fetch.sh /path/to/storage   # downloads, verifies, writes config.sh
source config.sh
bash wrappers/mg_blast.sh -q my_query.fa -o blast_out -d aa

mg_fetch.sh reads the Zenodo record id from metadata/zenodo_record.txt, downloads every artifact in the manifest, verifies SHA256, extracts the BLAST/HMM tarballs, and writes a populated config.sh. To re-verify an existing download: bash wrappers/verify_download.sh /path/to/storage.

Cite

If you use MolluscaGenes.

Please cite both the data deposit and the biorxiv preprint:

Pérez-Moreno JL, Katz PS. MolluscaGenes: a transcriptomic database for the
  Mollusca (v0.1). Zenodo, 2026. https://doi.org/10.5281/zenodo.19825266

Pérez-Moreno JL, Katz PS. MolluscaGenes: A Transcriptomic Database for the
  Mollusca. biorxiv, 2026. DOI: TBD (preprint pending).

The full machine-readable citation block is in CITATION.cff.