About - WntHub

Overview

WntHub is a panel-based interactive platform for exploring Wnt signaling pathway genes across genomic, transcriptomic, epigenomic, proteomic, and clinical dimensions. It integrates data from GTEx, Human Protein Atlas, TCGA, CELLxGENE, Tahoe100m, ENCODE, ChIP-Atlas, LINCS L1000, and more into a unified sidebar + tabbed interface with interactive Plotly.js charts, an IGV.js genome browser, D3 force-directed co-expression networks, and an AI-powered research assistant grounded in 23K+ Wnt publications.

The platform covers 91 Wnt-pathway genes, organised into eight functional categories: ligands (all 19 WNTs), Frizzled receptors (FZD1–10), co-receptors (LRP5/6) and non-canonical receptors (ROR1/2, RYK), the R-spondin axis (RSPO3, LGR4/5/6, RNF43, ZNRF3), secreted antagonists (DKK1–4, SFRP1/2/4/5, FRZB, WIF1, SOST, KREMEN1/2, NOTUM) and secretion machinery (PORCN, WLS), the destruction complex (APC, AXIN1/2, GSK3B, CSNK1A1) with cytoplasmic transducers (CTNNB1, DVL1–3, CSNK1E) and feedback inhibitors (NKD1/2, FRAT1/2), the planar-cell-polarity / Rho axis (VANGL1/2, PRICKLE1/2, CELSR1–3, DAAM1/2, RHOA, RAC1, CDC42), and the transcription layer (TCF7, TCF7L1/2, LEF1, NLK, TLE1, CTNNBIP1) with canonical target genes (MYC, CCND1).

Data Sections

Gene Summary tab: Gene Summary

Gene Identity Card (full name, genomic coordinates, aliases, NCBI summary, cross-reference IDs: Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) plus AI-generated summaries across 5 dimensions: Genomic, Expression, Pathway, Functional, Clinical.

Good for: Quick overview of any gene's role, function, disease associations, expression patterns, and database identifiers.

Genomic Context tab: Genomic

IGV.js genome browser (hg38) with an expandable track catalog:

Reference & Variants: RefSeq gene models, ClinVar VCF variants, PhyloP 100-way conservation
Regulatory Elements: ENCODE cCREs (color-coded promoters, enhancers, CTCF-bound, DNase-only)
Histone Marks: H3K27ac, H3K4me3, H3K4me1 — 7 ENCODE cell types
DNA Methylation: Human Methylation Atlas WGBS — 39 merged cell types
Transcription: ENCODE RNA-seq signal — 7 cell types
ChIP-Atlas TF Binding: 1,846 chromatin-associated factors, individual or merged view

Good for: Gene structure, exon/intron layout, clinical variants, regulatory landscape, TF binding, epigenomics, conservation.

RNA-Seq tab: RNA-Seq

Kitchen-sink Plotly.js charts across 6 data sources:

GTEx V10 · 53 subtissues · in-house build Bar, Radar (top 20, organ systems, brain, GI). Expression computed directly from GTEx V10 gene TPM parquet (per-subtissue median TPM → nTPM, ≥20 samples per subtissue). Same source now feeds the Correlation tab — no longer via the HPA-redistributed GTEx file
HPA 43 tissues Bar, Radar
FANTOM5 CAGE Bar, Radar
Cross-Database: Parallel coordinates, Slope chart
DepMap cancer cell lines Waterfall, Box, Violin, Median by lineage
TCGA 33 cancers Diverging bar, Box plot with scatter, KM survival curves (2,898 curves across 91 genes × 33 cancers, median expression split, UCSC Xena log2 TPM)
Multi-Gene: Searchable multi-select (2-5 genes), database selector (GTEx/HPA/FANTOM5), grouped bar + heatmap
PRECOG Survival Analysis PRECOG v2 Three sub-sections: Adult (51 cancers, ~28K patients, 4,225 records), ICI immunotherapy (20 cancers, ~4K patients, 4,219 records), Pediatric (12 cancers, ~3K patients, 954 records). Each with waterfall chart + filterable table side by side. ICI table includes cancer, ICI target (anti-PD-1/PD-L1/CTLA-4), stage, treatment, patients, outcome, study. Small cohorts (<20) flagged with warning icon. Positive z-score = unfavorable prognosis; significant if |z| > 3.09

Good for: Tissue expression patterns, cancer expression, cell line data, survival analysis, immunotherapy survival, cross-database comparison, multi-gene analysis.

Proteomics tab: Proteomics

Protein-level data across 5 sources with interactive 3D structure viewer:

HPA Protein (IHC): 46 tissues · 110 cell types Tissue × cell type heatmap, tissue distribution bar chart. Subcellular localization (IF-based) shown inline
CPTAC Mass-Spec: 11 cancers Tumor vs normal protein fold change (logFC + adjusted p-values)
HPA Cancer IHC: 20 cancers Stacked bar of patient counts per staining level (High/Medium/Low/Not detected)
ProteomicsDB: 67 tissues Mass-spec protein abundance (normalized intensity) across human tissues
Post-Translational Modifications (iPTMnet): 1,875 PTM sites Phosphorylation, ubiquitination, acetylation, methylation sites with kinase-substrate relationships. Lollipop scatter plot + detail table
3D Structure Viewer (PDBe Mol*): Interactive PDB and AlphaFold structures with PTM site overlays (phosphorylation, ubiquitination). Structure descriptions from RCSB PDB / AlphaFold APIs

Good for: Protein expression across tissues and cancers, post-translational modifications, kinase-substrate relationships, 3D structure visualization with PTM overlays.

scRNA-Seq tab: scRNA-Seq

CELLxGENE 61 tissues · 873 cell types · 476 datasets · Census 2025-11-17 Dot plots, tissue selector, PNG/SVG/CSV export (91 Wnt-pathway genes, 398,125 aggregated records)
Tahoe100m 2.3M DMSO cells · 50 cell lines · 28 cancers Bubble plots + detailed table, gene/cancer filters. Pseudobulked from the DMSO-control subset of the Tahoe-100M atlas (~77M cells total across 14 plates)

Good for: Cell-type-specific expression, which cell types express a gene, tumor microenvironment expression.

Correlation Analysis tab: Correlation

Co-expression Network: D3 force-directed two-hop ego networks from GTEx Spearman correlations (p-values + BH FDR). 53 subtissues + ALL_SAMPLES. 4,269 network files (per gene × per tissue)
Correlated Genes: Positive/negative tables with CSV/TSV/XLSX export
Pathway Enrichment: MSigDB enrichment with hyperlinks and filters

Good for: Co-expression partners, tissue-specific networks, pathway enrichment, functional associations.

Perturbations tab: Perturbations

Two complementary perturbation datasets accessible via sub-tab toggle, identifying drugs and genetic manipulations that significantly alter expression of Wnt pathway genes.

LINCS L1000 — Compound Perturbations

LINCS L1000: 720K experiments · 33K compounds · 230 cell lines Level 5 moderated z-scores from compound perturbation experiments (Subramanian et al., Cell 2017)
Waterfall Chart: Top 15 activators and repressors ranked by effect size
MOA Enrichment: Mechanism of action enrichment showing which drug classes most frequently alter the gene
Full Table: Filterable by direction, cell lineage, disease, dose, timepoint. Compound names link to CLUE.io. Export CSV/TSV/XLSX

LINCS L1000 — Genetic Perturbations (CRISPR, shRNA, Overexpression)

View 1 — Genetic Perturbations: 103,206 records · 72 genes Which CRISPR knockouts, shRNA knockdowns, or overexpression experiments significantly alter a Wnt gene's expression? Identifies upstream genetic regulators
View 2 — Downstream Effects: 97,680 records · 72 genes What happens to the transcriptome when a Wnt gene is knocked out or overexpressed? Split into Knockout (CRISPR + shRNA) and Overexpression sections. 19 CRISPR, 32 shRNA, 19 overexpression perturbagens
Data Sources: CRISPR (142K experiments), shRNA (238K), overexpression (34K) — all Level 5 GCTX, |modz| ≥ 3.0

72 of 91 genes are detected as L1000 readout targets (View 1) and 72 of 91 are used as L1000 perturbagens (View 2). Not detected as readout targets: AXIN2, KREMEN1, LGR6, NKD1, NKD2, NOTUM, PRICKLE1, PRICKLE2, RSPO3, SFRP2, SOST, VANGL2, WNT10A, WNT3A, WNT7B, WNT8A, WNT9A, WNT9B, ZNRF3.

Tahoe-100M — Single-Cell Perturbations

Tahoe-100M: 458,122 records · 379 drugs · 50 cell lines · 71 genes · 85.0% BH-significant Pseudobulk replicate differential expression from ~77 million single cells across 14 plates (Zhang et al., bioRxiv 2025)
Full Transcriptome: All 91 Wnt genes detectable in baseline expression; 71 produce statistically supported drug effects (vs. 72 measured as readout targets in LINCS L1000) — scRNA-seq captures complete gene expression, not a limited panel
Methodology: Cells split into pseudobulk replicates (25 cells for groups of 50–199 cells, 50 cells for groups of 200+). Wilcoxon rank-sum test (treatment vs DMSO replicates) with Benjamini-Hochberg FDR correction. Effect score = log2FC z-scored within cell line and dose. Three dose levels (0.05, 0.5, 5.0 μM)
Waterfall Chart + MOA Enrichment + Table: Same visualization pattern as LINCS, with additional columns for p-value, BH p-value, treatment/DMSO replicate counts, cell count, fraction expressing, approval status, and confidence tier

Good for: Drug discovery, target validation, single-cell perturbation responses, identifying compounds that modulate Wnt gene expression, cross-referencing bulk (LINCS) and single-cell (Tahoe) evidence.

Cross References tab: Cross Refs

Dynamic links to 30+ external databases (GeneCards, UniProt, NCBI, Ensembl, KEGG, Reactome, ClinVar, OMIM, etc.).

Good for: Finding a gene in other databases, jumping to external resources.

AI Research Assistant

RAG-powered chatbot grounded in 23,323 Wnt pathway publications (31,773 text chunks). Accessible via the floating widget from any tab. Supports multi-turn conversation with history, markdown rendering, and adjustable temperature.

Architecture

Query Classifier: Qwen3-14B pre-classifies every query as site, hybrid, or science. Site questions (e.g. "which species?", "where can I find variant data?") skip the RAG pipeline entirely and are answered in ~1-2s from platform knowledge. Science questions proceed through the full RAG pipeline. Hybrid questions get both
Embeddings: BAAI/bge-base-en-v1.5 (768-dim), Neon PostgreSQL + pgvector HNSW
Search: Hybrid — vector similarity (70%) + BM25 (30%), 100 candidates
Reranking: Qwen3-Reranker (4B/8B) with quality threshold
LLM: Qwen3-Next-80B MoE (default) · Qwen3-235B (deep analysis)
Response Types: Concise (3K), Overview (5K), Interpretive (10K), Deep (10K)

Data Integration

The AI assistant can optionally include structured experimental data from the site alongside literature context. Selectable data sections:

Genomic: Gene structure, ClinVar variants, ChIP-Atlas TF peaks
HPA / DepMap / CELLxGENE / Tahoe: Tissue and cancer cell line expression
Pathways: MSigDB pathway enrichment (KEGG, Reactome, GO, Hallmark)
GTEx Correlations: Top co-expressed genes per tissue (Spearman r ≥ 0.5). Automatic pairwise lookup when multiple genes are queried
TCGA: Tumor expression (log₂ TPM) across 33 TCGA cancer types with tissue-aware cancer detection (e.g., "liver cancer" → LIHC)

Changelog

May 13, 2026 — AI Gene Summaries v2: 91-Gene Regeneration with 7-Dimension Pipeline

Gene-Summary Pipeline v2 (Engine Refactor): 5 dimensions → 7. Added Protein Biology (HPA IHC + HPA subcellular + ProteomicsDB + CPTAC + iPTMnet) and Perturbation Response (LINCS upstream-regulator perturbations + LINCS-downstream effects of perturbing the gene itself + Tahoe-100M drug perturbations with MoA distribution and UP/DOWN split). New scripts/lib/gene-data-extractors.js consolidates 21 per-source extractors (ChIP-Atlas with ±1.5 kb promoter-window peak clustering and TSS-distance reporting, ClinVar with cross-assembly dedup, PRECOG with ICI cohort regimen detail, MSigDB enrichment with ALL_SAMPLES-empty per-tissue fallback for tissue-restricted genes, etc.) parameterised entirely on site-config.json so the same code serves WntHub and CABase. scripts/generate-gene-summaries.js fully rewritten + new --preview flag that writes a numbered markdown dossier of every prompt for human review before any LLM submission.
All 91 Genes Regenerated: 47 v1 summaries from October 2025 overwritten + 44 Bundle-expansion genes generated for the first time. End-to-end: 8-way parallel batch, 56 minutes wall for 637 main-LLM calls (100% success rate, zero 429s, zero retries), ~$1.30 DeepInfra spend including reranker / classifier / embedding overhead. v1 snapshot preserved as a tarball outside the repo.
Citation-Discipline Hardening: v1 occasionally fell back to numbered bracket references ([1], [2,3]) instead of the requested PMID format, particularly in the genomic-context and clinical-relevance dimensions. The v2 question templates now append an explicit anti-bracket directive; the extractor for Ensembl now lists every transcript ID (preventing the v1 hallucination of plausible-looking but fictional ENST IDs); ClinVar deduplicates GRCh37 + GRCh38 rows. Result on regen: zero bracket refs across all 91 genes × 7 dimensions.
Local Batch Rate-Limit Bypass: netlify/functions/query-initiate.js now reads BATCH_BYPASS_RATE_LIMIT: when set to 1, the per-IP 5-requests/minute limit (which protects the public site against query spam) is skipped. Production deploys don’t set the variable, so the rate limit stays on for end users while local 8-way parallel batches can complete without self-throttling.
SITE_CONFIG-Driven Frontend Branding: The remaining hardcoded “WntHub” / “Wnt signaling” strings in js/components/tab-manager.js, js/components/ai-assistant.js, and js/utils/ai-api-client.js now read from window.SITE_CONFIG — the bookmark document.title, the AI chat panel header and greetings, the LLM prompt templates sent on every query, the AI Insights Report HTML, citation strings, and the generic offline fallback. Engine JS is now byte-identical between WntHub and CABase.
NAR Pre-Submission Inquiry Refreshed: docs/Pre-submission inquiry NAR.docx updated for the 91-gene set with current stats: ChIP-Atlas 1,328 TFs (up from 1,086), CELLxGENE 61 tissues × 873 cell types × 63.4 M human cells, LINCS L1000 ~1.4 M compound + ~200 K genetic perturbations, Tahoe-100M corrected from “CRISPR” to drug-perturbation DE statistics (379 compounds × 50 cell lines), and the (vii) AI summaries line expanded to mention the seven v2 dimensions.

May 11, 2026 — wnthub.org Launch & Engine Polishing

Custom Domain Live: WntHub now serves at https://wnthub.org. Apex A record → Netlify load balancer + www CNAME, Let’s Encrypt cert auto-provisioned, Force HTTPS enabled. The auto-generated aesthetic-cascaron-f8669f.netlify.app URL keeps working — Netlify auto-301-redirects it to wnthub.org — so any grant-application links already circulating remain valid. Pre-submission inquiry letter (NAR Database Issue) updated to the live URL.
Site-Neutral Engine Identifiers: Two-step engine refactor moved every WntHub-specific identifier (WNT_GENES, WNT_SET, WNT_SET_HASH, _wntSetHash, wnt_gene/wnt_adj/wnt_edges/wnt_df local variables, load_wnt_expression(), etc.) into neutral names (GENES, GENE_SET, GENE_SET_HASH, _geneSetHash, pathway_gene, gene_adj, ...) across 27 engine scripts. WNT_GENES kept as a one-line alias so site-specific code (the AI-summary toolchain) keeps working unchanged. Future cherry-picks between WntHub and CABase now auto-merge instead of needing manual conflict resolution.
Site-Config-Driven Identity: scripts/pipelines/config.py now reads site-config.json at import and re-exports SITE_NAME, GENE_SET_LABEL, GENE_SET_FULL, DATA_SUFFIX — engine code never needs the literal “WntHub” / “Wnt” / “Wnts” strings hardcoded.
Automated Site Stats: Two new engine scripts (scripts/build_site_stats.py + scripts/render_site_text.py) compute every data-derived number on the site (gene count, GTEx subtissues, TCGA KM curves, PRECOG records, iPTMnet sites, CELLxGENE counts, LINCS records, Tahoe drug-perturbation counts, RAG corpus size, ...) and substitute them into HTML/JS via declarative <span data-stat="key"> markers. Wired into master_rebuild_all.sh as Step 18, after data pipelines and before the _site/ rsync. Manual overrides via site_stats_overrides.json for values that can’t be derived from on-disk data (e.g. RAG corpus counts that live in the Neon Postgres DB). Eliminates the stale-number drift that has historically dogged about/index page edits.

May 6, 2026 — Pathway Coverage Doubled (46 → 91 genes)

Gene Set Expanded: The curated Wnt-pathway set grew from 46 to 91 genes, organised into eight functional categories. Selection driven by recurrence across 70 MSigDB Wnt-related pathway sets. New additions span the secreted-antagonist arm (DKK1–4, SFRP1/2/4/5, FRZB, WIF1, SOST, KREMEN1/2), the R-spondin / RNF43 / LGR stem-cell axis (RSPO3, LGR4/5/6, RNF43, ZNRF3), Wnt secretion machinery (PORCN, WLS, NOTUM), CK1 destruction-complex isoforms (CSNK1A1, CSNK1E), non-canonical receptors (ROR1/2, RYK), planar-cell-polarity core (VANGL1/2, PRICKLE1/2, CELSR1–3, DAAM1/2), Rho GTPases (RHOA, RAC1, CDC42), and transcription-layer regulators (NLK, TLE1, CTNNBIP1). WNT11, previously dropped due to a hand-transcription error, is now included — bringing all 19 human WNT ligands into the set.
Pipelines Re-run End-to-End: All 18 ingestion steps in scripts/pipelines/master_rebuild_all.sh were executed against the expanded set with cache-aware skip logic, so existing per-gene API calls (ChIP-Atlas, iPTMnet, ProteomicsDB) and per-gene LINCS GCTX extractions were re-used; only the 45 new genes hit the external services. Co-expression network JSONs now embed a _wntSetHash fingerprint and self-invalidate when the gene set changes.
Downstream Counts Refreshed: ClinVar 84,249 variants across 91 genes; ChIP-Atlas TF universe 1,086 → 1,328 chromatin-associated factors; TCGA Kaplan-Meier curves 1,451 → 2,898; PRECOG records ~5K → ~9.4K (adult + ICI + pediatric combined); LINCS L1000 genetic 45,868 → 103,206 records (View 1) and 59,533 → 97,680 (View 2); Tahoe-100M perturbations 217,915 → 458,122 records; iPTMnet 1,118 → 1,875 PTM sites; co-expression network files 2,144 → 4,269.
Engineering Hardening: Three drifted hand-typed gene lists (in build_survival_json.py, build_network_json.py, gtex/06_build_site_correlations.py) consolidated to import from a single source (scripts/pipelines/config.py); the duplicate gtex/04_network_json.py was reduced to a deprecation stub. LINCS pipelines gained per-gene skip-if-exists logic with --force override, with cached per-gene files folded into the combined output so subsequent runs touch only new genes.
HPA Source-Schema Fixes (caught en route): HPA renamed the FANTOM5 sTPM column case (“Scaled Tags Per Million” → “Scaled tags per million”) and retired the rna_cancer.tsv.zip endpoint; pipeline definitions in hpa/01_retrieve_and_filter.py updated accordingly. The DepMap portal is currently 403’ing public file downloads; CCLE expression now resolves through a local snapshot symlinked into data/_raw/DepMap/.

April 14, 2026 — GTEx Direct Build & Documentation Refresh

GTEx Source Unified: The RNA-Seq tab's GTEx panels (bar + 4 radar charts) now consume the same in-house GTEx V10 build that powers the Correlation/Network analyses. The HPA-redistributed GTEx nTPM file is no longer used anywhere on the site. Pipeline: scripts/pipelines/gtex/08_build_true_gtex_expression.py (GTEx V10 parquet → per-subtissue median TPM → nTPM, ≥20 samples/subtissue, 55 subtissues)
Manuscript Data Table Refresh: Updated docs/manuscript-data-table.2026-04-14.xlsx supersedes the 2026-03-23 version. GTEx row reclassified as direct/in-house; added rows for Tahoe-100M perturbations, ENCODE/UCSC regulatory tracks, and HPA subcellular localization; LINCS L1000 split into three rows (compounds, "what affects gene", "what gene affects"); LINCS compound count corrected to 12,735 filtered uniques
New Provenance Document: docs/data-generation-report.md — per-plot/per-table provenance covering every panel in every sidebar tab, from upstream database through pipeline script to rendered value
About Page Counts Corrected: ChIP-Atlas factors 1,086 → 1,846; Tahoe scRNA-Seq "116M cells" → "2.3M DMSO cells (77M total)"; CELLxGENE "38 tissues · 187 cell types" → actual 62 tissues · 806 cell types; co-expression network files 2,088 → 2,144

March 24, 2026 — Tahoe-100M Perturbations & iPTMnet Improvements

Tahoe-100M Perturbations: New sub-tab on the Perturbations page integrating single-cell RNA-seq drug response data from the Tahoe-100M dataset (~77M cells, 379 drugs, 50 cell lines, 14 plates). Pseudobulk replicate approach: cells split into 25- or 50-cell replicates, Wilcoxon rank-sum test vs matched DMSO replicates, BH FDR correction. 217,915 records across 32 genes (84.3% BH-significant). Waterfall chart, MOA enrichment, and filterable table with p-values, replicate counts, and confidence tiers
Info Tooltips: Added contextual info icons across all 10 site-wide data tables explaining statistical metrics, column meanings, and data sources. Floating tooltip design escapes all CSS stacking contexts
Tab Persistence Fix: Fixed Plotly charts going blank when navigating away from a tab and returning. Charts now re-render from cached data on tab switch
iPTMnet Plot: Y-axis now uses iPTMnet score (instead of known enzymes), circle size represents number of enzymes. Hover tooltip includes enzyme list and publication count
iPTMnet Table: Rebuilt with TableViewer for sorting, filtering, and export. Added Score and Position columns, enzyme names link to UniProt, dropdown filters on Type/Score/Known/Evidence
Mol* Viewer Fix: Fixed issue where switching from PTM overlays to structure property themes would fail. Viewer now reloads cleanly when transitioning between overlay categories

March 22, 2026 — Perturbations Tab & Proteomics Enhancements

Perturbations Tab: New tab integrating LINCS L1000 compound perturbation data (720K experiments, 33K compounds, 230 cell lines). Waterfall chart of top activators/repressors, MOA enrichment analysis, and full filterable/exportable table with CLUE.io compound links. 39 of 46 genes measured
Genetic Perturbations: LINCS L1000 CRISPR (142K), shRNA (238K), and overexpression (34K) data in two views: (1) genetic perturbations affecting each Wnt gene (45,868 records, 39 genes), (2) downstream effects split into Knockout (CRISPR + shRNA) and Overexpression sections (59,533 records, 37 genes). Waterfall charts + filterable tables for all sections
PRECOG Survival Analysis: PRECOG v2 survival z-scores added to RNA-Seq tab. Three databases: Adult (46/46 genes, 51 cancers, ~28K patients), Pediatric (44/46 genes, 12 cancers, ~3K patients), ICI immunotherapy (46/46 genes, 20 cancers, ~4K patients). Waterfall charts + filterable tables. ICI table enriched with ICI target, tumor stage, treatment status, cohort size, outcome type, and study source. Pipeline: scripts/pipelines/precog/01_extract_survival_zscores.py
LINCS Pipelines: Compound pipeline (scripts/pipelines/lincs/01_extract_perturbations.py) extracts per-gene perturbation data at |modz| ≥ 3.0 from Level 5 GCTX files (401K perturbations, 12,735 compounds, 702 MOA classes). Genetic pipeline (scripts/pipelines/lincs/02_extract_genetic_perturbations.py) extracts CRISPR/shRNA/overexpression data with the same threshold
PTM Table: PMID counts replaced with clickable PubMed links (collapsible when >3). Table card scrollable with sticky header
Structure Viewer: Overlay selection persists across gene changes. New built-in Mol* overlays: secondary structure, hydrophobicity, residue type, sequence position, B-factor/pLDDT. Fullscreen now fills the viewport
Dev Workflow: Added [dev] publish = "." to netlify.toml — local dev serves from project root, no rsync needed. Production deploys via netlify deploy --prod (build is automatic)

March 21, 2026 — Proteomics Tab & AI Query Classifier

Proteomics Tab: New dedicated tab with protein-level data from 5 sources — HPA IHC (normal tissue + cancer), CPTAC mass-spec proteomics (11 cancers), ProteomicsDB (67 tissues), iPTMnet PTM sites (1,118 sites with kinase-substrate relationships)
3D Structure Viewer: PDBe Mol* integration — interactive PDB and AlphaFold structures with PTM overlays (phosphorylation, ubiquitination). Structure descriptions fetched from RCSB PDB and AlphaFold APIs. 26 genes with PDB structures, all genes with AlphaFold
AI Query Classifier: Qwen3-14B pre-classifies queries as site/hybrid/science. Site questions ("which species?", "where is variant data?") skip RAG entirely and answer in ~1-2s. Science questions proceed through full RAG pipeline with zero overhead
Data Pipelines: New pipelines for HPA protein IHC, CPTAC, ProteomicsDB (OData API), and iPTMnet (REST API). All follow existing retrieve-filter-compress pattern
Gene Identity Card: New card at top of Gene Summary tab with full name, genomic coordinates, aliases, NCBI summary, and cross-reference IDs (Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) with clickable links
Tab Rename & Reorder: Expression → RNA-Seq, Single Cell → scRNA-Seq. New order: Gene Summary → Genomic → RNA-Seq → Proteomics → scRNA-Seq → Correlation → Cross Refs
Sidebar Fixes: Ensembl ID parsing fixed for LRP6, TCF7L1, WNT3, WNT9B (nested JSON fallback). Ensembl ID text removed from sidebar info card. WntHub logo links to Overview tab. University of Oulu logo enlarged

March 20, 2026 — AI Data Integration & TCGA Restructure

AI — GTEx Correlations: New data section sends top co-expressed genes to AI. Automatic pairwise lookup when multiple genes queried. Tissue-aware (uses specific tissue or ALL_SAMPLES)
AI — TCGA Expression: New data section sends tumor expression stats (median, IQR, n) across 33 TCGA cancer types. Smart cancer detection from natural language ("liver cancer" → LIHC)
AI — Conversation: Multi-turn chat with history. Gene detection from conversation context. Temperature slider. Markdown rendering. Inline PMID citations. New Chat button
AI — Model: Upgraded to Qwen3-Next-80B MoE (fast inference with reliable instruction following)
Data: TCGA data reorganized under data/TCGA/ (survival + expression). All paths and pipelines updated
UI: AI tab removed (floating widget only). Hero page updated. About page restyled. Chatbot colors matched to site palette

March 20, 2026 — TCGA Survival, Pipeline & Network Upgrades

TCGA Survival: KM curves from UCSC Xena (33 cancers, 1,451 curves). Box plot with scatter overlay. All on same TCGA row
Co-expression Networks: Proper p-values + BH FDR. 55 subtissues. 2,088 network files
Data Pipeline: 3-stage architecture (Retrieve/Analyze/Format). Vectorized extraction 10x faster. Site 524MB to 404MB
Multi-Gene: Database selector (GTEx/HPA/FANTOM5). Auto-populated. Tall charts
UI: AI tab removed (floating widget only). Compact export buttons. Model display fixed

March 19, 2026 — Phase 7: Tracks, Networks & Polish

IGV.js Tracks: ENCODE cCREs, DNA Methylation Atlas (39 cell types), RNA-seq signal. Info tooltips. Auto-reload
Networks: All-vs-all Spearman across 50+ GTEx tissues. D3 force-directed two-hop ego networks
Skeleton Loading: Shimmer animations across all tabs
UI: Unified sidebar gene selector. Side-by-side network + tables. Fullscreen fix for SVG

March 18, 2026 — Phases 1-6: Full Redesign

Vertical-scroll SPA replaced with sidebar + 8 tabbed views + panel grid
Plotly.js kitchen-sink expression charts, IGV.js genome browser, enhanced TableViewer
LLM optimization: Nemotron-3 Super default (~10s responses), BAAI/bge-base-en-v1.5 embeddings

September 2025 — Initial Release

WntHub platform launch with genomic context, expression, correlation, gene regulation, cross-references, and AI assistant

Credits

WntHub integrates data from:

GTEx (V10) · Human Protein Atlas · FANTOM5
TCGA via UCSC Xena (Pan-Cancer Atlas, Liu et al. 2018)
CELLxGENE (CZI) · Tahoe100m · DepMap
ENCODE · Human Methylation Atlas · ChIP-Atlas
CPTAC (via HPA) · ProteomicsDB · iPTMnet
LINCS L1000 (Subramanian et al., Cell 2017; compound + CRISPR/shRNA/overexpression) · Connectivity Map (Broad Institute)
PRECOG v2 (Benard et al., Nucleic Acids Research 2026; adult, pediatric, and ICI survival z-scores. CC BY-NC 4.0)
ClinVar · MSigDB · Ensembl · NCBI

Developed at the University of Oulu, Precision Oncology group (Ungureanu Lab). Contact: harlan[dot]barker[at]oulu.fi

About WntHub