About WntHub

Interactive scientific database for Wnt signaling pathway research. Ungureanu Lab · University of Oulu

Overview

WntHub is a panel-based interactive platform for exploring Wnt signaling pathway genes across genomic, transcriptomic, epigenomic, proteomic, and clinical dimensions. It integrates data from GTEx, Human Protein Atlas, TCGA, CELLxGENE, Tahoe100m, ENCODE, ChIP-Atlas, LINCS L1000, and more into a unified sidebar + tabbed interface with interactive Plotly.js charts, an IGV.js genome browser, D3 force-directed co-expression networks, and an AI-powered research assistant grounded in 23K+ Wnt publications.

The platform covers 91 Wnt-pathway genes, organised into eight functional categories: ligands (all 19 WNTs), Frizzled receptors (FZD1–10), co-receptors (LRP5/6) and non-canonical receptors (ROR1/2, RYK), the R-spondin axis (RSPO3, LGR4/5/6, RNF43, ZNRF3), secreted antagonists (DKK1–4, SFRP1/2/4/5, FRZB, WIF1, SOST, KREMEN1/2, NOTUM) and secretion machinery (PORCN, WLS), the destruction complex (APC, AXIN1/2, GSK3B, CSNK1A1) with cytoplasmic transducers (CTNNB1, DVL1–3, CSNK1E) and feedback inhibitors (NKD1/2, FRAT1/2), the planar-cell-polarity / Rho axis (VANGL1/2, PRICKLE1/2, CELSR1–3, DAAM1/2, RHOA, RAC1, CDC42), and the transcription layer (TCF7, TCF7L1/2, LEF1, NLK, TLE1, CTNNBIP1) with canonical target genes (MYC, CCND1).

Data Sections

Gene Summary tab: Gene Summary

Gene Identity Card (full name, genomic coordinates, aliases, NCBI summary, cross-reference IDs: Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) plus AI-generated summaries across 5 dimensions: Genomic, Expression, Pathway, Functional, Clinical.

Good for: Quick overview of any gene's role, function, disease associations, expression patterns, and database identifiers.

Genomic Context tab: Genomic

IGV.js genome browser (hg38) with an expandable track catalog:

Good for: Gene structure, exon/intron layout, clinical variants, regulatory landscape, TF binding, epigenomics, conservation.

RNA-Seq tab: RNA-Seq

Kitchen-sink Plotly.js charts across 6 data sources:

Good for: Tissue expression patterns, cancer expression, cell line data, survival analysis, immunotherapy survival, cross-database comparison, multi-gene analysis.

Proteomics tab: Proteomics

Protein-level data across 5 sources with interactive 3D structure viewer:

Good for: Protein expression across tissues and cancers, post-translational modifications, kinase-substrate relationships, 3D structure visualization with PTM overlays.

scRNA-Seq tab: scRNA-Seq

Good for: Cell-type-specific expression, which cell types express a gene, tumor microenvironment expression.

Correlation Analysis tab: Correlation

Good for: Co-expression partners, tissue-specific networks, pathway enrichment, functional associations.

Perturbations tab: Perturbations

Two complementary perturbation datasets accessible via sub-tab toggle, identifying drugs and genetic manipulations that significantly alter expression of Wnt pathway genes.

LINCS L1000 — Compound Perturbations

LINCS L1000 — Genetic Perturbations (CRISPR, shRNA, Overexpression)

72 of 91 genes are detected as L1000 readout targets (View 1) and 72 of 91 are used as L1000 perturbagens (View 2). Not detected as readout targets: AXIN2, KREMEN1, LGR6, NKD1, NKD2, NOTUM, PRICKLE1, PRICKLE2, RSPO3, SFRP2, SOST, VANGL2, WNT10A, WNT3A, WNT7B, WNT8A, WNT9A, WNT9B, ZNRF3.

Tahoe-100M — Single-Cell Perturbations

Good for: Drug discovery, target validation, single-cell perturbation responses, identifying compounds that modulate Wnt gene expression, cross-referencing bulk (LINCS) and single-cell (Tahoe) evidence.

Cross References tab: Cross Refs

Dynamic links to 30+ external databases (GeneCards, UniProt, NCBI, Ensembl, KEGG, Reactome, ClinVar, OMIM, etc.).

Good for: Finding a gene in other databases, jumping to external resources.

AI Research Assistant

RAG-powered chatbot grounded in 23,323 Wnt pathway publications (31,773 text chunks). Accessible via the floating widget from any tab. Supports multi-turn conversation with history, markdown rendering, and adjustable temperature.

Architecture

Data Integration

The AI assistant can optionally include structured experimental data from the site alongside literature context. Selectable data sections:

Changelog

May 11, 2026 — wnthub.org Launch & Engine Polishing
  • Custom Domain Live: WntHub now serves at https://wnthub.org. Apex A record → Netlify load balancer + www CNAME, Let’s Encrypt cert auto-provisioned, Force HTTPS enabled. The auto-generated aesthetic-cascaron-f8669f.netlify.app URL keeps working — Netlify auto-301-redirects it to wnthub.org — so any grant-application links already circulating remain valid. Pre-submission inquiry letter (NAR Database Issue) updated to the live URL.
  • Site-Neutral Engine Identifiers: Two-step engine refactor moved every WntHub-specific identifier (WNT_GENES, WNT_SET, WNT_SET_HASH, _wntSetHash, wnt_gene/wnt_adj/wnt_edges/wnt_df local variables, load_wnt_expression(), etc.) into neutral names (GENES, GENE_SET, GENE_SET_HASH, _geneSetHash, pathway_gene, gene_adj, ...) across 27 engine scripts. WNT_GENES kept as a one-line alias so site-specific code (the AI-summary toolchain) keeps working unchanged. Future cherry-picks between WntHub and CABase now auto-merge instead of needing manual conflict resolution.
  • Site-Config-Driven Identity: scripts/pipelines/config.py now reads site-config.json at import and re-exports SITE_NAME, GENE_SET_LABEL, GENE_SET_FULL, DATA_SUFFIX — engine code never needs the literal “WntHub” / “Wnt” / “Wnts” strings hardcoded.
  • Automated Site Stats: Two new engine scripts (scripts/build_site_stats.py + scripts/render_site_text.py) compute every data-derived number on the site (gene count, GTEx subtissues, TCGA KM curves, PRECOG records, iPTMnet sites, CELLxGENE counts, LINCS records, Tahoe drug-perturbation counts, RAG corpus size, ...) and substitute them into HTML/JS via declarative <span data-stat="key"> markers. Wired into master_rebuild_all.sh as Step 18, after data pipelines and before the _site/ rsync. Manual overrides via site_stats_overrides.json for values that can’t be derived from on-disk data (e.g. RAG corpus counts that live in the Neon Postgres DB). Eliminates the stale-number drift that has historically dogged about/index page edits.
May 6, 2026 — Pathway Coverage Doubled (46 → 91 genes)
  • Gene Set Expanded: The curated Wnt-pathway set grew from 46 to 91 genes, organised into eight functional categories. Selection driven by recurrence across 70 MSigDB Wnt-related pathway sets. New additions span the secreted-antagonist arm (DKK1–4, SFRP1/2/4/5, FRZB, WIF1, SOST, KREMEN1/2), the R-spondin / RNF43 / LGR stem-cell axis (RSPO3, LGR4/5/6, RNF43, ZNRF3), Wnt secretion machinery (PORCN, WLS, NOTUM), CK1 destruction-complex isoforms (CSNK1A1, CSNK1E), non-canonical receptors (ROR1/2, RYK), planar-cell-polarity core (VANGL1/2, PRICKLE1/2, CELSR1–3, DAAM1/2), Rho GTPases (RHOA, RAC1, CDC42), and transcription-layer regulators (NLK, TLE1, CTNNBIP1). WNT11, previously dropped due to a hand-transcription error, is now included — bringing all 19 human WNT ligands into the set.
  • Pipelines Re-run End-to-End: All 18 ingestion steps in scripts/pipelines/master_rebuild_all.sh were executed against the expanded set with cache-aware skip logic, so existing per-gene API calls (ChIP-Atlas, iPTMnet, ProteomicsDB) and per-gene LINCS GCTX extractions were re-used; only the 45 new genes hit the external services. Co-expression network JSONs now embed a _wntSetHash fingerprint and self-invalidate when the gene set changes.
  • Downstream Counts Refreshed: ClinVar 84,249 variants across 91 genes; ChIP-Atlas TF universe 1,086 → 1,328 chromatin-associated factors; TCGA Kaplan-Meier curves 1,451 → 2,898; PRECOG records ~5K → ~9.4K (adult + ICI + pediatric combined); LINCS L1000 genetic 45,868 → 103,206 records (View 1) and 59,533 → 97,680 (View 2); Tahoe-100M perturbations 217,915 → 458,122 records; iPTMnet 1,118 → 1,875 PTM sites; co-expression network files 2,144 → 4,269.
  • Engineering Hardening: Three drifted hand-typed gene lists (in build_survival_json.py, build_network_json.py, gtex/06_build_site_correlations.py) consolidated to import from a single source (scripts/pipelines/config.py); the duplicate gtex/04_network_json.py was reduced to a deprecation stub. LINCS pipelines gained per-gene skip-if-exists logic with --force override, with cached per-gene files folded into the combined output so subsequent runs touch only new genes.
  • HPA Source-Schema Fixes (caught en route): HPA renamed the FANTOM5 sTPM column case (“Scaled Tags Per Million” → “Scaled tags per million”) and retired the rna_cancer.tsv.zip endpoint; pipeline definitions in hpa/01_retrieve_and_filter.py updated accordingly. The DepMap portal is currently 403’ing public file downloads; CCLE expression now resolves through a local snapshot symlinked into data/_raw/DepMap/.
April 14, 2026 — GTEx Direct Build & Documentation Refresh
  • GTEx Source Unified: The RNA-Seq tab's GTEx panels (bar + 4 radar charts) now consume the same in-house GTEx V10 build that powers the Correlation/Network analyses. The HPA-redistributed GTEx nTPM file is no longer used anywhere on the site. Pipeline: scripts/pipelines/gtex/08_build_true_gtex_expression.py (GTEx V10 parquet → per-subtissue median TPM → nTPM, ≥20 samples/subtissue, 55 subtissues)
  • Manuscript Data Table Refresh: Updated docs/manuscript-data-table.2026-04-14.xlsx supersedes the 2026-03-23 version. GTEx row reclassified as direct/in-house; added rows for Tahoe-100M perturbations, ENCODE/UCSC regulatory tracks, and HPA subcellular localization; LINCS L1000 split into three rows (compounds, "what affects gene", "what gene affects"); LINCS compound count corrected to 12,735 filtered uniques
  • New Provenance Document: docs/data-generation-report.md — per-plot/per-table provenance covering every panel in every sidebar tab, from upstream database through pipeline script to rendered value
  • About Page Counts Corrected: ChIP-Atlas factors 1,086 → 1,846; Tahoe scRNA-Seq "116M cells" → "2.3M DMSO cells (77M total)"; CELLxGENE "38 tissues · 187 cell types" → actual 62 tissues · 806 cell types; co-expression network files 2,088 → 2,144
March 24, 2026 — Tahoe-100M Perturbations & iPTMnet Improvements
  • Tahoe-100M Perturbations: New sub-tab on the Perturbations page integrating single-cell RNA-seq drug response data from the Tahoe-100M dataset (~77M cells, 379 drugs, 50 cell lines, 14 plates). Pseudobulk replicate approach: cells split into 25- or 50-cell replicates, Wilcoxon rank-sum test vs matched DMSO replicates, BH FDR correction. 217,915 records across 32 genes (84.3% BH-significant). Waterfall chart, MOA enrichment, and filterable table with p-values, replicate counts, and confidence tiers
  • Info Tooltips: Added contextual info icons across all 10 site-wide data tables explaining statistical metrics, column meanings, and data sources. Floating tooltip design escapes all CSS stacking contexts
  • Tab Persistence Fix: Fixed Plotly charts going blank when navigating away from a tab and returning. Charts now re-render from cached data on tab switch
  • iPTMnet Plot: Y-axis now uses iPTMnet score (instead of known enzymes), circle size represents number of enzymes. Hover tooltip includes enzyme list and publication count
  • iPTMnet Table: Rebuilt with TableViewer for sorting, filtering, and export. Added Score and Position columns, enzyme names link to UniProt, dropdown filters on Type/Score/Known/Evidence
  • Mol* Viewer Fix: Fixed issue where switching from PTM overlays to structure property themes would fail. Viewer now reloads cleanly when transitioning between overlay categories
March 22, 2026 — Perturbations Tab & Proteomics Enhancements
  • Perturbations Tab: New tab integrating LINCS L1000 compound perturbation data (720K experiments, 33K compounds, 230 cell lines). Waterfall chart of top activators/repressors, MOA enrichment analysis, and full filterable/exportable table with CLUE.io compound links. 39 of 46 genes measured
  • Genetic Perturbations: LINCS L1000 CRISPR (142K), shRNA (238K), and overexpression (34K) data in two views: (1) genetic perturbations affecting each Wnt gene (45,868 records, 39 genes), (2) downstream effects split into Knockout (CRISPR + shRNA) and Overexpression sections (59,533 records, 37 genes). Waterfall charts + filterable tables for all sections
  • PRECOG Survival Analysis: PRECOG v2 survival z-scores added to RNA-Seq tab. Three databases: Adult (46/46 genes, 51 cancers, ~28K patients), Pediatric (44/46 genes, 12 cancers, ~3K patients), ICI immunotherapy (46/46 genes, 20 cancers, ~4K patients). Waterfall charts + filterable tables. ICI table enriched with ICI target, tumor stage, treatment status, cohort size, outcome type, and study source. Pipeline: scripts/pipelines/precog/01_extract_survival_zscores.py
  • LINCS Pipelines: Compound pipeline (scripts/pipelines/lincs/01_extract_perturbations.py) extracts per-gene perturbation data at |modz| ≥ 3.0 from Level 5 GCTX files (401K perturbations, 12,735 compounds, 702 MOA classes). Genetic pipeline (scripts/pipelines/lincs/02_extract_genetic_perturbations.py) extracts CRISPR/shRNA/overexpression data with the same threshold
  • PTM Table: PMID counts replaced with clickable PubMed links (collapsible when >3). Table card scrollable with sticky header
  • Structure Viewer: Overlay selection persists across gene changes. New built-in Mol* overlays: secondary structure, hydrophobicity, residue type, sequence position, B-factor/pLDDT. Fullscreen now fills the viewport
  • Dev Workflow: Added [dev] publish = "." to netlify.toml — local dev serves from project root, no rsync needed. Production deploys via netlify deploy --prod (build is automatic)
March 21, 2026 — Proteomics Tab & AI Query Classifier
  • Proteomics Tab: New dedicated tab with protein-level data from 5 sources — HPA IHC (normal tissue + cancer), CPTAC mass-spec proteomics (11 cancers), ProteomicsDB (67 tissues), iPTMnet PTM sites (1,118 sites with kinase-substrate relationships)
  • 3D Structure Viewer: PDBe Mol* integration — interactive PDB and AlphaFold structures with PTM overlays (phosphorylation, ubiquitination). Structure descriptions fetched from RCSB PDB and AlphaFold APIs. 26 genes with PDB structures, all genes with AlphaFold
  • AI Query Classifier: Qwen3-14B pre-classifies queries as site/hybrid/science. Site questions ("which species?", "where is variant data?") skip RAG entirely and answer in ~1-2s. Science questions proceed through full RAG pipeline with zero overhead
  • Data Pipelines: New pipelines for HPA protein IHC, CPTAC, ProteomicsDB (OData API), and iPTMnet (REST API). All follow existing retrieve-filter-compress pattern
  • Gene Identity Card: New card at top of Gene Summary tab with full name, genomic coordinates, aliases, NCBI summary, and cross-reference IDs (Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) with clickable links
  • Tab Rename & Reorder: Expression → RNA-Seq, Single Cell → scRNA-Seq. New order: Gene Summary → Genomic → RNA-Seq → Proteomics → scRNA-Seq → Correlation → Cross Refs
  • Sidebar Fixes: Ensembl ID parsing fixed for LRP6, TCF7L1, WNT3, WNT9B (nested JSON fallback). Ensembl ID text removed from sidebar info card. WntHub logo links to Overview tab. University of Oulu logo enlarged
March 20, 2026 — AI Data Integration & TCGA Restructure
  • AI — GTEx Correlations: New data section sends top co-expressed genes to AI. Automatic pairwise lookup when multiple genes queried. Tissue-aware (uses specific tissue or ALL_SAMPLES)
  • AI — TCGA Expression: New data section sends tumor expression stats (median, IQR, n) across 33 TCGA cancer types. Smart cancer detection from natural language ("liver cancer" → LIHC)
  • AI — Conversation: Multi-turn chat with history. Gene detection from conversation context. Temperature slider. Markdown rendering. Inline PMID citations. New Chat button
  • AI — Model: Upgraded to Qwen3-Next-80B MoE (fast inference with reliable instruction following)
  • Data: TCGA data reorganized under data/TCGA/ (survival + expression). All paths and pipelines updated
  • UI: AI tab removed (floating widget only). Hero page updated. About page restyled. Chatbot colors matched to site palette
March 20, 2026 — TCGA Survival, Pipeline & Network Upgrades
  • TCGA Survival: KM curves from UCSC Xena (33 cancers, 1,451 curves). Box plot with scatter overlay. All on same TCGA row
  • Co-expression Networks: Proper p-values + BH FDR. 55 subtissues. 2,088 network files
  • Data Pipeline: 3-stage architecture (Retrieve/Analyze/Format). Vectorized extraction 10x faster. Site 524MB to 404MB
  • Multi-Gene: Database selector (GTEx/HPA/FANTOM5). Auto-populated. Tall charts
  • UI: AI tab removed (floating widget only). Compact export buttons. Model display fixed
March 19, 2026 — Phase 7: Tracks, Networks & Polish
  • IGV.js Tracks: ENCODE cCREs, DNA Methylation Atlas (39 cell types), RNA-seq signal. Info tooltips. Auto-reload
  • Networks: All-vs-all Spearman across 50+ GTEx tissues. D3 force-directed two-hop ego networks
  • Skeleton Loading: Shimmer animations across all tabs
  • UI: Unified sidebar gene selector. Side-by-side network + tables. Fullscreen fix for SVG
March 18, 2026 — Phases 1-6: Full Redesign
  • Vertical-scroll SPA replaced with sidebar + 8 tabbed views + panel grid
  • Plotly.js kitchen-sink expression charts, IGV.js genome browser, enhanced TableViewer
  • LLM optimization: Nemotron-3 Super default (~10s responses), BAAI/bge-base-en-v1.5 embeddings
September 2025 — Initial Release
  • WntHub platform launch with genomic context, expression, correlation, gene regulation, cross-references, and AI assistant

Credits

WntHub integrates data from:

Developed at the University of Oulu, Precision Oncology group (Ungureanu Lab). Contact: harlan[dot]barker[at]oulu.fi