HMM evaluation

1,057 mollusc-optimized Pfam HMMs.
+11.7% detection sensitivity.

We compared 1,057 mollusc-optimized HMMs to the corresponding original Pfam-A 36.0 models on six RefSeq proteomes spanning five molluscan classes plus a brachiopod outgroup. The mollusc-optimized models recover +20,496 net detections (+11.7%) across the panel, with 25,388 new revised-only detections. The largest absolute gain is in cellular machinery (+5,745 detections across 266 domains) and the largest relative gain is in sensory perception (+42.6%).

+11.7%
detection gain
175,013 → 195,509
+25,388
new detections
in mollusc-optimized HMMs
+42.6%
Sensory perception
(largest relative gain)
1,057 / 6 / 12
HMMs · proteomes · themes

Methods

Test design

We assembled six RefSeq reference proteomes that were not used during HMM revision (formally, a held-out evaluation): five molluscs spanning three classes — Aplysia californica, Lottia gigantea, and Pomacea canaliculata (Gastropoda); Crassostrea gigas (Bivalvia); Octopus bimaculoides (Cephalopoda) — plus one non-mollusc lophotrochozoan outgroup, Lingula anatina (Brachiopoda), included to gauge cross-phylum transfer.

Held-out proteomes ranged from 23,822 (L. gigantea) to 63,341 (C. gigas) protein sequences, totaling roughly 225,000 sequences across the panel.

HMM databases compared

Two databases were compared head-to-head on identical inputs. The baseline is original Pfam-A 36.0, restricted to the same 1,057 accessions selected for mollusc curation. The contender is the mollusc-optimized bundle: 909 of those same 1,057 accessions revised by TIAMMAt against 212 mollusc proteomes from MolluscaGenes v1 (iterative re-alignment of each Pfam family to its high-confidence mollusc hits, rebuilding the profile), plus 148 accessions whose revisions did not meet the per-HMM specificity criteria and use the original Pfam-A 36.0 profile (see HMM specificity QC below).

Both databases were concatenated under POSIX LC_ALL=C order and indexed with hmmpress. Byte-identity of the canonical concat is pinned by SHA-256.

HMM specificity quality control

Each revised HMM is screened for specificity against the six-proteome held-out panel. For each HMM we compute the total detection count, median E-value of detections, and fraction of detections at strong significance (E ≤ 10−30). The criteria flag revisions whose detections concentrate near the threshold without strong-match support: count > 100 with strong-match rate < 5% and median log10(E) > −15 (NOISE); count > 5× original with strong-match rate < 10% (OVERGEN); or original strong-match rate > 30% but revised dropped by more than half (SPEC_LOSS).

HMMs that do not meet the specificity criteria — predominantly coiled-coil-rich structural families (SCP-1, Filament, Tropomyosin, Laminin) and orphan-class GPCR families that admit high-density alignment matches without strong sequence-level homology — retain the original Pfam-A 36.0 profile in the bundle. 909 of 1,057 HMMs (86%) carry the TIAMMAt revision; 148 (14%) use the original Pfam-A 36.0 profile. The full per-HMM specificity report is at evaluation/hmm_specificity_qc.tsv in the bundle.

Curation taxonomy

Each Pfam domain is annotated with a primary subcategory and theme within the 12-theme / 104-subcategory curation schema. Assignments are reviewed against each domain’s Pfam-A 36.0 NAME and DESC fields, with explicit hand-validated overrides for canonical cases (rhodopsin-family → photoreception, voltage-gated channels → VGIC, etc.). The complete taxonomy is at taxonomy/domain_list.tsv in the bundle.

Detection criteria

For each evaluation proteome we ran hmmscan (HMMER 3.4) against both databases with --domE 1e-5 --incdomE 1e-5, capturing per-sequence and per-domain output. A sequence was counted as detected by domain D if any reported domain on that sequence had target D at the cutoff (best E-value taken on ties).

For specificity analysis we use a stricter threshold to define what we call a “real loss”: an original-only detection counts as a real loss only if its full-sequence E-value in the original DB is at most 10−8. Anything weaker is treated as marginal evidence and pooled with rejections.

Specificity decomposition

A naive read of the comparison flags 189 domains as worsened (revised < original on raw counts). To separate genuine sensitivity loss from correct cross-reactivity refusal, we partition every original-only sequence into one of four classes:

(1) same-subcategory reassignment — the sequence still hits another revised domain in the same curation subcategory (typical sister-family signal); (2) cross-subcategory reassignment — the sequence hits a revised domain in a different subcategory or theme (the revised HMM has refused a promiscuous match); (3) marginal in original — no revised hit, original full-E above 10−8 (weak signal that the revised DB plausibly excludes); and (4) real loss — no revised hit, original full-E at or below 10−8 (the only class that constitutes a defensible sensitivity regression).

Results — Detection gains

Overall detection totals

Aggregate detection counts across all six evaluation proteomes, summed over the 1,057 Pfam accessions evaluated.

MetricCount
Original Pfam-A 36.0 detections175,013
Mollusc-optimized bundle (909 revised + 148 original) detections195,509
Net change+20,496 (+11.7%)
New (mollusc-optimized only)25,388
Original only4,892

Per-theme summary

Twelve curation themes, each spanning multiple subcategories and Pfam domains. Sorted by % change in detections (descending).

Theme N domains Original Revised New Net %
Sensory perception447,03410,0293,103+2,995+42.6%
Metabolism and physiology607,1578,6161,749+1,459+20.4%
Innate immunity and host defense7020,55424,3724,102+3,818+18.6%
Cellular machinery26642,97748,7227,114+5,745+13.4%
Neural and synaptic biology9710,09311,4291,563+1,336+13.2%
Endocrine and neuroendocrine354,7795,247604+468+9.8%
Stem cells regeneration and development2312,57813,7321,457+1,154+9.2%
Mollusc-specific biology1084,5574,878492+321+7.0%
Epigenetics and RNA biology5610,07310,708933+635+6.3%
Intracellular signaling pathways14548,22751,0503,819+2,823+5.9%
Symbiosis longevity and emerging biology1492496272+38+4.1%
Stress environment and xenobiotics366,0605,764380−296−4.9%

Note. Themes with net-negative percent change (e.g. Stress environment and xenobiotics) reflect specificity gains rather than sensitivity regressions: the mollusc-optimized models reject cross-reactive matches the original Pfam profiles accept. The per-domain specificity decomposition below shows that 178 of 189 net-negative domains have zero strong-E original-only detections.

Top 20 domains by % change

Twenty Pfam accessions with the largest percent change in detections (revised vs original), among domains with at least 50 revised detections. Each accession links out to the corresponding InterPro entry.

Acc Theme Subcategory Orig Revised New Net %
PF01108Innate immunity and host defenseComplement system45257222+212+471.1%
PF02949Sensory perceptionChemoreception126556+53+441.7%
PF11701Mollusc-specific biologyCatch muscle and paramyosin219676+75+357.1%
PF25757Cellular machineryCilia and flagella137585450+448+327.0%
PF06003Epigenetics and RNA biologyHistone modifiers readers49205156+156+318.4%
PF24573Cellular machineryCilia and flagella2811691+88+314.3%
PF25028Cellular machineryExtracellular matrix and cell adhesion93370279+277+297.8%
PF11618Cellular machineryUbiquitin-proteasome3212290+90+281.2%
PF08123Epigenetics and RNA biologyHistone modifiers writers61228172+167+273.8%
PF15906Intracellular signaling pathwaysNitric oxide signaling3613298+96+266.7%
PF09272Innate immunity and host defenseSRCR superfamily89298210+209+234.8%
PF23244Cellular machineryExtracellular matrix and cell adhesion56161106+105+187.5%
PF16471Cellular machineryApoptosis machinery105289203+184+175.2%
PF03915Cellular machineryExtracellular matrix and cell adhesion58159111+101+174.1%
PF10320Sensory perceptionChemoreception1,4073,8322,425+2,425+172.4%
PF05004Innate immunity and host defenseCytokine-like5013599+85+170.0%
PF14658Intracellular signaling pathwaysCalcium signaling3961,052657+656+165.7%
PF11834Neural and synaptic biologyVoltage-gated ion channels215436+33+157.1%
PF23376Cellular machineryExtracellular matrix and cell adhesion190474285+284+149.5%
PF00735Cellular machineryCell cycle347855514+508+146.4%
Scatter plot of revised versus original Pfam detection counts per domain.
Figure 1. Per-domain detection counts in the mollusc-optimized HMMs vs original Pfam. Each point is one of the 954 Pfam domains with at least one detection across the six evaluation proteomes. The X-axis shows detection count from the original Pfam-A 36.0 model (log scale); the Y-axis shows detection count from the corresponding mollusc-optimized model (log scale). Points above the diagonal gained sensitivity; points below lost. Color encodes the curation theme. The above-diagonal band in cellular machinery, immunity, and sensory perception themes drives most of the +11.7% net gain.
Horizontal bar chart of per-theme net detection change as a percentage.
Figure 2. Per-theme net detection change as a percentage of the original Pfam baseline. Each bar represents one of the 12 curation themes; bar length encodes the percent change in detections (mollusc-optimized vs. original Pfam-A 36.0 across the six evaluation proteomes). Annotations on each bar show the absolute new-detection count and the number of domains contributing to that theme. Sensory perception shows the largest relative gain (+42.6%); cellular machinery shows the largest absolute net gain (+5,745 detections across 266 domains).
Horizontal bar chart of per-subcategory net detection change, colored by theme.
Figure 3. Per-subcategory net detection change in the mollusc-optimized bundle, colored by parent theme. The 86 populated subcategories are shown as horizontal bars sorted by theme (consistent with Figure 2 grouping); bar length encodes percent change in detections; bar color encodes the parent theme. Subcategories with apparent regressions are addressed in the Specificity section below.
Heatmap of detection ratio across proteomes for the top 30 domains by absolute gain.
Figure 4. Per-domain detection ratio (mollusc-optimized / original) across the six evaluation proteomes for the 30 domains with the largest absolute gains. Columns are species; rows are Pfam domains. Cell color encodes the log2 fold-change in detections (mollusc-optimized vs original); a darker green indicates a larger gain. Domains were ranked by the sum of new detections across all proteomes. The pattern is uniform across mollusc proteomes (Aplysia, Crassostrea, Lottia, Octopus, Pomacea) with attenuated gains on the Lingula brachiopod outgroup, consistent with mollusc-specific tuning.
Box plot of per-domain net detection change in the Lingula brachiopod outgroup, grouped by theme.
Figure 5. Detection gain in the brachiopod outgroup (Lingula anatina) compared to the five mollusc proteomes. The X-axis groups domains by curation theme; the Y-axis shows the per-domain net detection change (mollusc-optimized − original). Box plots summarize the distribution within each theme; median lines indicate the typical effect; whiskers extend to 1.5 × IQR. Mollusc-specific biology gains transfer poorly to Lingula (median near zero), while broadly conserved themes (cellular machinery, intracellular signaling) transfer cleanly, confirming the bundle is mollusc-tuned without being mollusc-exclusive.

Specificity decomposition

For the 189 domains where the mollusc-optimized model produces fewer total detections than the original, we partition every original-only sequence event by hmmscan position into four classes against a fixed strong-evidence threshold of 10−8 in the original full-sequence E-value. This separates true sensitivity reductions from cross-reactivity refusals and near-threshold marginals.

Decomposition of the 2,609 sequences inside the 189 domains that are detected by original Pfam-A 36.0 only. A sequence counts as a “strong-E original-only” reduction only if it has no mollusc-optimized hit and its original full-E is at most 10−8.

Class Count % of “lost”
Same-subcategory reassignment32112.3%
Cross-subcategory reassignment94336.1%
Marginal in original (E > 10−8)1,18345.3%
Strong-E original-only (E ≤ 10−8)1626.2%
Total2,609100%
Stacked bar chart decomposing original-only sequences into four classes across the top 50 worsened domains.
Figure 6. Decomposition of original-only sequences across 189 domains where the mollusc-optimized model produces fewer detections than the original. Stacked bars per domain show the four-class breakdown: same-subcategory reassignment (light green), cross-subcategory reassignment (mint), marginal-E rejection (yellow), and strong-E original-only (red). Domains are sorted by total event count; the top 25 are shown for readability. The red wedge — the only class that represents a true sensitivity reduction — is visible in only 11 of the 189 domains. The dominant patterns are marginal-E rejection (yellow) and cross-subcategory reassignment (mint), reflecting that the mollusc-optimized model more often declines cross-reactive or near-threshold matches that the original would have accepted.

Case studies

Sensitivity gains

Domains where mollusc-specific seed expansion captures legitimate molluscan paralogs that the original Pfam-A 36.0 profile misses.

PF10320 — chemoreception (Notch family)

PF10320 shows the largest absolute new-detection count in the bundle: 1,407 → 3,832 detections (+2,425 new). The gain is uniform across all five mollusc proteomes and attenuated on the brachiopod outgroup, consistent with capture of mollusc-specific sequence diversity rather than promiscuous matching.

PF13912 — metal stress and cell fate

PF13912 (zinc-finger / cell-fate family) gains 3,237 → 4,463 detections (+1,226, +37.9%). The gain is consistent across the panel, with strong-E support throughout. Illustrative of marginal-but-coherent gains in metal-stress response domains across the mollusc proteomes, where mollusc-specific paralog expansions had limited representation in the original Pfam-A 36.0 seed.

Apparent reductions reflecting specificity gains

For some domains the revised model produces fewer total detections than the original, but the “lost” sequences are reassigned to a different Pfam family rather than dropped — the revised model is sharper at distinguishing the domain in its canonical context from cross-reactive matches elsewhere.

PF00612 — IQ calmodulin-binding motif

PF00612 shows 646 → 554 detections (−103). All 103 “lost” sequences are reassignments — 87 to a different subcategory, 16 marginal-E rejections, zero strong-E original-only. The IQ motif is a short (~25 aa) calmodulin- binding signature shared across many calcium-signaling proteins (myosins, IQGAP, neurogranin…); the mollusc-revised model is sharper at calling IQ in its primary calcium-signaling context and refusing cross-reactive matches in unrelated proteins. No biological signal is dropped.

PF12661 — hEGF (within-family disambiguation)

PF12661 shows 1,673 → 1,655 detections (−18 net). 126 sequences appear to be lost, but 114 of them are reassigned to a sibling EGF-family Pfam (PF00008 EGF, PF07645 EGF_CA, PF09262 EGF_3…). The revised model better separates hEGF from its closely-related EGF paralogs rather than rejecting EGF-bearing proteins outright — purely within-superfamily disambiguation.

Genuine sensitivity reductions

A small number of domains show strong-E original-only detections the revised model no longer captures. These tend to be families where the original Pfam profile is broadly trained across taxa and mollusc-specific re-training narrows the model in ways that miss canonical-but-divergent homologs.

PF07719 — TPR_2 (tetratricopeptide repeat)

PF07719 has the highest count of strong-E original-only detections in the bundle: 1,218 → 1,056 (−162 net), with 52 strong-E reductions. TPR is a 33-aa α-helical protein-protein interaction motif found in >100 distinct protein families — cell-cycle regulators, mitochondrial import receptors, Hsp70/Hsp90 cochaperones, peroxisomal targeting receptors. Sequence conservation within TPR is low while structural conservation is high; mollusc-specific re- training appears to bias the profile toward mollusc-typical TPR variants and miss canonical TPR-bearing proteins from non-mollusc-specific contexts (the chaperone and peroxisomal-import machinery in particular).

PF00578 — AhpC-TSA (peroxiredoxin)

PF00578 shows 222 → 161 detections (−61, with 22 strong-E reductions). Peroxiredoxins are an ancient, ubiquitous antioxidant family with very high structural conservation but high sequence divergence across taxa, and mollusc proteomes contain peroxiredoxin paralogs of likely mitochondrial-symbiont ancestral origin. The revised HMM appears biased toward the cytosolic mollusc-typical isoforms and misses some of these conserved-but-divergent ancestral variants.

Net assessment

  1. The mollusc-optimized HMMs are a drop-in upgrade for mollusc proteome annotation: same 1,057 Pfam accessions, +11.7% more detections, fully compatible with downstream tooling indexed by Pfam-A 36.0 accession.
  2. Per-HMM specificity QC ensures every model in the bundle meets a specificity bar against the six-proteome held-out panel. 178 of 189 domains where total counts decrease have zero strong-E original-only events; 182 have at most five.
  3. The largest absolute gains are in cellular machinery (+5,745), innate immunity (+3,818), and intracellular signaling (+2,823); the largest relative gain is in sensory perception (+42.6%) — families where the original Pfam was least informed by mollusc sequence data.

Limitations

  • The evaluation panel is six proteomes — five molluscs and one brachiopod outgroup. Less-studied lineages (especially Solenogastres and Caudofoveata) are not represented.
  • Mollusc-specific families are necessarily small. Effective sequence counts in some subcategories (radula matrix, some allergens) are below community defaults for Pfam revision; their gains, while large, rest on smaller seed alignments.
  • Detection-count is a proxy for biological relevance. Downstream validation (phylogeny, structure prediction, functional assays) is the user’s responsibility; a domain hit is a starting point, not an end.

Reproducibility

SHA-256 checksums

The canonical concat and its provenance manifests pin to these digests.

mollusca_revised_hmms.hmm:     1818e0d56612b68478e39a6dbc71dcb786b6df4e2ced63c5c17d15b133595d42
IDENTIFICATION_STATISTICS.txt: 335208b8b4e2ff263c36262637b67c1d14e3a0e854aae46854f93fde536a7a2f
SCRIPT_LOG_combined.txt:       8cf01a07113891282ba4bfa8054f52b1f63f5778fdbd8b65cecebb9dbd9270ec

Rebuild the concat from individual HMMs

Reconstruct the byte-identical concatenation from the per-domain HMM files and verify the digest.

LC_ALL=C cat hmm/per_domain/*.hmm > mollusca_revised_hmms.hmm
hmmpress mollusca_revised_hmms.hmm
sha256sum mollusca_revised_hmms.hmm
# expected: 1818e0d56612b68478e39a6dbc71dcb786b6df4e2ced63c5c17d15b133595d42

Citation

BibTeX placeholder — DOI / preprint URL will be filled in at release.

@misc{molluscagenes_2026,
  author       = {P{\'e}rez-Moreno, Jorge and Katz, Paul S.},
  title        = {{MolluscaGenes: 1,057 mollusc-optimized Pfam HMMs and evaluation report}},
  year         = {2026},
  note         = {Database resource. URL: \url{https://invertome.github.io/molluscagenes/}},
  howpublished = {\url{https://github.com/invertome/molluscagenes}}
}