HMM evaluation

1,057 mollusc-optimized Pfam HMMs.
+11.7% detection sensitivity.

We compared 1,057 mollusc-optimized HMMs to the corresponding original Pfam-A 36.0 models on six RefSeq proteomes spanning five molluscan classes plus a brachiopod outgroup. The mollusc-optimized models recover +20,496 net detections (+11.7%) across the panel, with 25,388 new revised-only detections. The largest absolute gain is in cellular machinery (+5,745 detections across 266 domains) and the largest relative gain is in sensory perception (+42.6%).

+11.7%

detection gain
175,013 → 195,509

+25,388

new detections
in mollusc-optimized HMMs

+42.6%

Sensory perception
(largest relative gain)

1,057 / 6 / 12

HMMs · proteomes · themes

Methods

Test design

We assembled six RefSeq reference proteomes that were not used during HMM revision (formally, a held-out evaluation): five molluscs spanning three classes — Aplysia californica, Lottia gigantea, and Pomacea canaliculata (Gastropoda); Crassostrea gigas (Bivalvia); Octopus bimaculoides (Cephalopoda) — plus one non-mollusc lophotrochozoan outgroup, Lingula anatina (Brachiopoda), included to gauge cross-phylum transfer.

Held-out proteomes ranged from 23,822 (L. gigantea) to 63,341 (C. gigas) protein sequences, totaling roughly 225,000 sequences across the panel.

HMM databases compared

Two databases were compared head-to-head on identical inputs. The baseline is original Pfam-A 36.0, restricted to the same 1,057 accessions selected for mollusc curation. The contender is the mollusc-optimized bundle: 909 of those same 1,057 accessions revised by TIAMMAt against 212 mollusc proteomes from MolluscaGenes v1 (iterative re-alignment of each Pfam family to its high-confidence mollusc hits, rebuilding the profile), plus 148 accessions whose revisions did not meet the per-HMM specificity criteria and use the original Pfam-A 36.0 profile (see HMM specificity QC below).

Both databases were concatenated under POSIX LC_ALL=C order and indexed with hmmpress. Byte-identity of the canonical concat is pinned by SHA-256.

HMM specificity quality control

Each revised HMM is screened for specificity against the six-proteome held-out panel. For each HMM we compute the total detection count, median E-value of detections, and fraction of detections at strong significance (E ≤ 10⁻³⁰). The criteria flag revisions whose detections concentrate near the threshold without strong-match support: count > 100 with strong-match rate < 5% and median log₁₀(E) > −15 (NOISE); count > 5× original with strong-match rate < 10% (OVERGEN); or original strong-match rate > 30% but revised dropped by more than half (SPEC_LOSS).

HMMs that do not meet the specificity criteria — predominantly coiled-coil-rich structural families (SCP-1, Filament, Tropomyosin, Laminin) and orphan-class GPCR families that admit high-density alignment matches without strong sequence-level homology — retain the original Pfam-A 36.0 profile in the bundle. 909 of 1,057 HMMs (86%) carry the TIAMMAt revision; 148 (14%) use the original Pfam-A 36.0 profile. The full per-HMM specificity report is at evaluation/hmm_specificity_qc.tsv in the bundle.

Curation taxonomy

Each Pfam domain is annotated with a primary subcategory and theme within the 12-theme / 104-subcategory curation schema. Assignments are reviewed against each domain’s Pfam-A 36.0 NAME and DESC fields, with explicit hand-validated overrides for canonical cases (rhodopsin-family → photoreception, voltage-gated channels → VGIC, etc.). The complete taxonomy is at taxonomy/domain_list.tsv in the bundle.

Detection criteria

For each evaluation proteome we ran hmmscan (HMMER 3.4) against both databases with --domE 1e-5 --incdomE 1e-5, capturing per-sequence and per-domain output. A sequence was counted as detected by domain D if any reported domain on that sequence had target D at the cutoff (best E-value taken on ties).

For specificity analysis we use a stricter threshold to define what we call a “real loss”: an original-only detection counts as a real loss only if its full-sequence E-value in the original DB is at most 10⁻⁸. Anything weaker is treated as marginal evidence and pooled with rejections.

Specificity decomposition

A naive read of the comparison flags 189 domains as worsened (revised < original on raw counts). To separate genuine sensitivity loss from correct cross-reactivity refusal, we partition every original-only sequence into one of four classes:

(1) same-subcategory reassignment — the sequence still hits another revised domain in the same curation subcategory (typical sister-family signal); (2) cross-subcategory reassignment — the sequence hits a revised domain in a different subcategory or theme (the revised HMM has refused a promiscuous match); (3) marginal in original — no revised hit, original full-E above 10⁻⁸ (weak signal that the revised DB plausibly excludes); and (4) real loss — no revised hit, original full-E at or below 10⁻⁸ (the only class that constitutes a defensible sensitivity regression).

Results — Detection gains

Overall detection totals

Aggregate detection counts across all six evaluation proteomes, summed over the 1,057 Pfam accessions evaluated.

Metric	Count
Original Pfam-A 36.0 detections	175,013
Mollusc-optimized bundle (909 revised + 148 original) detections	195,509
Net change	+20,496 (+11.7%)
New (mollusc-optimized only)	25,388
Original only	4,892

Per-theme summary

Twelve curation themes, each spanning multiple subcategories and Pfam domains. Sorted by % change in detections (descending).

Theme	N domains	Original	Revised	New	Net	%
Sensory perception	44	7,034	10,029	3,103	+2,995	+42.6%
Metabolism and physiology	60	7,157	8,616	1,749	+1,459	+20.4%
Innate immunity and host defense	70	20,554	24,372	4,102	+3,818	+18.6%
Cellular machinery	266	42,977	48,722	7,114	+5,745	+13.4%
Neural and synaptic biology	97	10,093	11,429	1,563	+1,336	+13.2%
Endocrine and neuroendocrine	35	4,779	5,247	604	+468	+9.8%
Stem cells regeneration and development	23	12,578	13,732	1,457	+1,154	+9.2%
Mollusc-specific biology	108	4,557	4,878	492	+321	+7.0%
Epigenetics and RNA biology	56	10,073	10,708	933	+635	+6.3%
Intracellular signaling pathways	145	48,227	51,050	3,819	+2,823	+5.9%
Symbiosis longevity and emerging biology	14	924	962	72	+38	+4.1%
Stress environment and xenobiotics	36	6,060	5,764	380	−296	−4.9%

Note. Themes with net-negative percent change (e.g. Stress environment and xenobiotics) reflect specificity gains rather than sensitivity regressions: the mollusc-optimized models reject cross-reactive matches the original Pfam profiles accept. The per-domain specificity decomposition below shows that 178 of 189 net-negative domains have zero strong-E original-only detections.

Top 20 domains by % change

Twenty Pfam accessions with the largest percent change in detections (revised vs original), among domains with at least 50 revised detections. Each accession links out to the corresponding InterPro entry.

Acc	Theme	Subcategory	Orig	Revised	New	Net	%
PF01108	Innate immunity and host defense	Complement system	45	257	222	+212	+471.1%
PF02949	Sensory perception	Chemoreception	12	65	56	+53	+441.7%
PF11701	Mollusc-specific biology	Catch muscle and paramyosin	21	96	76	+75	+357.1%
PF25757	Cellular machinery	Cilia and flagella	137	585	450	+448	+327.0%
PF06003	Epigenetics and RNA biology	Histone modifiers readers	49	205	156	+156	+318.4%
PF24573	Cellular machinery	Cilia and flagella	28	116	91	+88	+314.3%
PF25028	Cellular machinery	Extracellular matrix and cell adhesion	93	370	279	+277	+297.8%
PF11618	Cellular machinery	Ubiquitin-proteasome	32	122	90	+90	+281.2%
PF08123	Epigenetics and RNA biology	Histone modifiers writers	61	228	172	+167	+273.8%
PF15906	Intracellular signaling pathways	Nitric oxide signaling	36	132	98	+96	+266.7%
PF09272	Innate immunity and host defense	SRCR superfamily	89	298	210	+209	+234.8%
PF23244	Cellular machinery	Extracellular matrix and cell adhesion	56	161	106	+105	+187.5%
PF16471	Cellular machinery	Apoptosis machinery	105	289	203	+184	+175.2%
PF03915	Cellular machinery	Extracellular matrix and cell adhesion	58	159	111	+101	+174.1%
PF10320	Sensory perception	Chemoreception	1,407	3,832	2,425	+2,425	+172.4%
PF05004	Innate immunity and host defense	Cytokine-like	50	135	99	+85	+170.0%
PF14658	Intracellular signaling pathways	Calcium signaling	396	1,052	657	+656	+165.7%
PF11834	Neural and synaptic biology	Voltage-gated ion channels	21	54	36	+33	+157.1%
PF23376	Cellular machinery	Extracellular matrix and cell adhesion	190	474	285	+284	+149.5%
PF00735	Cellular machinery	Cell cycle	347	855	514	+508	+146.4%

Scatter plot of revised versus original Pfam detection counts per domain. — **Figure 1.** Per-domain detection counts in the mollusc-optimized HMMs vs original Pfam. Each point is one of the 954 Pfam domains with at least one detection across the six evaluation proteomes. The X-axis shows detection count from the original Pfam-A 36.0 model (log scale); the Y-axis shows detection count from the corresponding mollusc-optimized model (log scale). Points above the diagonal gained sensitivity; points below lost. Color encodes the curation theme. The above-diagonal band in cellular machinery, immunity, and sensory perception themes drives most of the +11.7% net gain.

Horizontal bar chart of per-theme net detection change as a percentage. — **Figure 2.** Per-theme net detection change as a percentage of the original Pfam baseline. Each bar represents one of the 12 curation themes; bar length encodes the percent change in detections (mollusc-optimized vs. original Pfam-A 36.0 across the six evaluation proteomes). Annotations on each bar show the absolute new-detection count and the number of domains contributing to that theme. Sensory perception shows the largest relative gain (+42.6%); cellular machinery shows the largest absolute net gain (+5,745 detections across 266 domains).

Horizontal bar chart of per-subcategory net detection change, colored by theme. — **Figure 3.** Per-subcategory net detection change in the mollusc-optimized bundle, colored by parent theme. The 86 populated subcategories are shown as horizontal bars sorted by theme (consistent with Figure 2 grouping); bar length encodes percent change in detections; bar color encodes the parent theme. Subcategories with apparent regressions are addressed in the Specificity section below.

Heatmap of detection ratio across proteomes for the top 30 domains by absolute gain. — **Figure 4.** Per-domain detection ratio (mollusc-optimized / original) across the six evaluation proteomes for the 30 domains with the largest absolute gains. Columns are species; rows are Pfam domains. Cell color encodes the log₂ fold-change in detections (mollusc-optimized vs original); a darker green indicates a larger gain. Domains were ranked by the sum of new detections across all proteomes. The pattern is uniform across mollusc proteomes (Aplysia, Crassostrea, Lottia, Octopus, Pomacea) with attenuated gains on the Lingula brachiopod outgroup, consistent with mollusc-specific tuning.

Box plot of per-domain net detection change in the Lingula brachiopod outgroup, grouped by theme. — **Figure 5.** Detection gain in the brachiopod outgroup (Lingula anatina) compared to the five mollusc proteomes. The X-axis groups domains by curation theme; the Y-axis shows the per-domain net detection change (mollusc-optimized − original). Box plots summarize the distribution within each theme; median lines indicate the typical effect; whiskers extend to 1.5 × IQR. Mollusc-specific biology gains transfer poorly to Lingula (median near zero), while broadly conserved themes (cellular machinery, intracellular signaling) transfer cleanly, confirming the bundle is mollusc-tuned without being mollusc-exclusive.

Specificity decomposition

For the 189 domains where the mollusc-optimized model produces fewer total detections than the original, we partition every original-only sequence event by hmmscan position into four classes against a fixed strong-evidence threshold of 10⁻⁸ in the original full-sequence E-value. This separates true sensitivity reductions from cross-reactivity refusals and near-threshold marginals.

Decomposition of the 2,609 sequences inside the 189 domains that are detected by original Pfam-A 36.0 only. A sequence counts as a “strong-E original-only” reduction only if it has no mollusc-optimized hit and its original full-E is at most 10⁻⁸.

Class	Count	% of “lost”
Same-subcategory reassignment	321	12.3%
Cross-subcategory reassignment	943	36.1%
Marginal in original (E > 10⁻⁸)	1,183	45.3%
Strong-E original-only (E ≤ 10⁻⁸)	162	6.2%
Total	2,609	100%

Stacked bar chart decomposing original-only sequences into four classes across the top 50 worsened domains. — **Figure 6.** Decomposition of original-only sequences across 189 domains where the mollusc-optimized model produces fewer detections than the original. Stacked bars per domain show the four-class breakdown: same-subcategory reassignment (light green), cross-subcategory reassignment (mint), marginal-E rejection (yellow), and strong-E original-only (red). Domains are sorted by total event count; the top 25 are shown for readability. The red wedge — the only class that represents a true sensitivity reduction — is visible in only 11 of the 189 domains. The dominant patterns are marginal-E rejection (yellow) and cross-subcategory reassignment (mint), reflecting that the mollusc-optimized model more often declines cross-reactive or near-threshold matches that the original would have accepted.

Case studies

Sensitivity gains

Domains where mollusc-specific seed expansion captures legitimate molluscan paralogs that the original Pfam-A 36.0 profile misses.

PF10320 — chemoreception (Notch family)

PF10320 shows the largest absolute new-detection count in the bundle: 1,407 → 3,832 detections (+2,425 new). The gain is uniform across all five mollusc proteomes and attenuated on the brachiopod outgroup, consistent with capture of mollusc-specific sequence diversity rather than promiscuous matching.

PF13912 — metal stress and cell fate

PF13912 (zinc-finger / cell-fate family) gains 3,237 → 4,463 detections (+1,226, +37.9%). The gain is consistent across the panel, with strong-E support throughout. Illustrative of marginal-but-coherent gains in metal-stress response domains across the mollusc proteomes, where mollusc-specific paralog expansions had limited representation in the original Pfam-A 36.0 seed.

Apparent reductions reflecting specificity gains

For some domains the revised model produces fewer total detections than the original, but the “lost” sequences are reassigned to a different Pfam family rather than dropped — the revised model is sharper at distinguishing the domain in its canonical context from cross-reactive matches elsewhere.

PF00612 — IQ calmodulin-binding motif

PF00612 shows 646 → 554 detections (−103). All 103 “lost” sequences are reassignments — 87 to a different subcategory, 16 marginal-E rejections, zero strong-E original-only. The IQ motif is a short (~25 aa) calmodulin- binding signature shared across many calcium-signaling proteins (myosins, IQGAP, neurogranin…); the mollusc-revised model is sharper at calling IQ in its primary calcium-signaling context and refusing cross-reactive matches in unrelated proteins. No biological signal is dropped.

PF12661 — hEGF (within-family disambiguation)

PF12661 shows 1,673 → 1,655 detections (−18 net). 126 sequences appear to be lost, but 114 of them are reassigned to a sibling EGF-family Pfam (PF00008 EGF, PF07645 EGF_CA, PF09262 EGF_3…). The revised model better separates hEGF from its closely-related EGF paralogs rather than rejecting EGF-bearing proteins outright — purely within-superfamily disambiguation.

Genuine sensitivity reductions

A small number of domains show strong-E original-only detections the revised model no longer captures. These tend to be families where the original Pfam profile is broadly trained across taxa and mollusc-specific re-training narrows the model in ways that miss canonical-but-divergent homologs.

PF07719 — TPR_2 (tetratricopeptide repeat)

PF07719 has the highest count of strong-E original-only detections in the bundle: 1,218 → 1,056 (−162 net), with 52 strong-E reductions. TPR is a 33-aa α-helical protein-protein interaction motif found in >100 distinct protein families — cell-cycle regulators, mitochondrial import receptors, Hsp70/Hsp90 cochaperones, peroxisomal targeting receptors. Sequence conservation within TPR is low while structural conservation is high; mollusc-specific re- training appears to bias the profile toward mollusc-typical TPR variants and miss canonical TPR-bearing proteins from non-mollusc-specific contexts (the chaperone and peroxisomal-import machinery in particular).

PF00578 — AhpC-TSA (peroxiredoxin)

PF00578 shows 222 → 161 detections (−61, with 22 strong-E reductions). Peroxiredoxins are an ancient, ubiquitous antioxidant family with very high structural conservation but high sequence divergence across taxa, and mollusc proteomes contain peroxiredoxin paralogs of likely mitochondrial-symbiont ancestral origin. The revised HMM appears biased toward the cytosolic mollusc-typical isoforms and misses some of these conserved-but-divergent ancestral variants.

Net assessment

The mollusc-optimized HMMs are a drop-in upgrade for mollusc proteome annotation: same 1,057 Pfam accessions, +11.7% more detections, fully compatible with downstream tooling indexed by Pfam-A 36.0 accession.
Per-HMM specificity QC ensures every model in the bundle meets a specificity bar against the six-proteome held-out panel. 178 of 189 domains where total counts decrease have zero strong-E original-only events; 182 have at most five.
The largest absolute gains are in cellular machinery (+5,745), innate immunity (+3,818), and intracellular signaling (+2,823); the largest relative gain is in sensory perception (+42.6%) — families where the original Pfam was least informed by mollusc sequence data.

Limitations

The evaluation panel is six proteomes — five molluscs and one brachiopod outgroup. Less-studied lineages (especially Solenogastres and Caudofoveata) are not represented.
Mollusc-specific families are necessarily small. Effective sequence counts in some subcategories (radula matrix, some allergens) are below community defaults for Pfam revision; their gains, while large, rest on smaller seed alignments.
Detection-count is a proxy for biological relevance. Downstream validation (phylogeny, structure prediction, functional assays) is the user’s responsibility; a domain hit is a starting point, not an end.

Reproducibility

SHA-256 checksums

The canonical concat and its provenance manifests pin to these digests.

mollusca_revised_hmms.hmm:     1818e0d56612b68478e39a6dbc71dcb786b6df4e2ced63c5c17d15b133595d42
IDENTIFICATION_STATISTICS.txt: 335208b8b4e2ff263c36262637b67c1d14e3a0e854aae46854f93fde536a7a2f
SCRIPT_LOG_combined.txt:       8cf01a07113891282ba4bfa8054f52b1f63f5778fdbd8b65cecebb9dbd9270ec

Rebuild the concat from individual HMMs

Reconstruct the byte-identical concatenation from the per-domain HMM files and verify the digest.

LC_ALL=C cat hmm/per_domain/*.hmm > mollusca_revised_hmms.hmm
hmmpress mollusca_revised_hmms.hmm
sha256sum mollusca_revised_hmms.hmm
# expected: 1818e0d56612b68478e39a6dbc71dcb786b6df4e2ced63c5c17d15b133595d42

Citation

BibTeX placeholder — DOI / preprint URL will be filled in at release.

@misc{molluscagenes_2026,
  author       = {P{\'e}rez-Moreno, Jorge and Katz, Paul S.},
  title        = {{MolluscaGenes: 1,057 mollusc-optimized Pfam HMMs and evaluation report}},
  year         = {2026},
  note         = {Database resource. URL: \url{https://invertome.github.io/molluscagenes/}},
  howpublished = {\url{https://github.com/invertome/molluscagenes}}
}

1,057 mollusc-optimized Pfam HMMs. +11.7% detection sensitivity.

Methods

Test design

HMM databases compared

HMM specificity quality control

Curation taxonomy

Detection criteria

Specificity decomposition

Results — Detection gains

Overall detection totals

Per-theme summary

Top 20 domains by % change

Specificity decomposition

Case studies

Sensitivity gains

PF10320 — chemoreception (Notch family)

PF13912 — metal stress and cell fate

Apparent reductions reflecting specificity gains

PF00612 — IQ calmodulin-binding motif

PF12661 — hEGF (within-family disambiguation)

Genuine sensitivity reductions

PF07719 — TPR_2 (tetratricopeptide repeat)

PF00578 — AhpC-TSA (peroxiredoxin)

Net assessment

Limitations

Reproducibility

SHA-256 checksums

Rebuild the concat from individual HMMs

Citation

1,057 mollusc-optimized Pfam HMMs.
+11.7% detection sensitivity.