Three Dimensions of Neglect

How Biobanks, Clinical Trials, and Scientific Literature Systematically Underserve Global South Diseases

538×
More clinical trials for type 2 diabetes than lymphatic filariasis
Type 2 diabetes: 47,892 trials (2.5M DALYs) vs Lymphatic filariasis: 89 trials (1.3M DALYs)

Corpas M, Freidin MB, Valdivia-Silva J, Baker S, Fatumo S, Guio H (2026)

Three Dimensions of Research Inequity

HEIM measures compound neglect across discovery, translation, and knowledge - revealing how disadvantage accumulates at every stage of the research pipeline.

Discovery
70 biobanks
22 of 175 diseases have critical research gaps
22 diseases with no biobank coverage of 175 total
Translation
563,725 trials
2.4x more trials per DALY in high-income countries
Global South share All clinical trials
Knowledge
13.1M papers
Neglected tropical diseases 40% more semantically isolated
HEIM Framework - Three Dimensions of Research Inequity

Figure 1. The HEIM framework quantifies research inequity across three independent dimensions: Discovery (biobank gaps), Translation (clinical trial equity), and Knowledge (semantic isolation).

Unified = 0.501 × Discovery + 0.293 × Translation + 0.206 × Knowledge
PCA-derived weights (PC1 explains 63.3% variance, n=86 diseases)

Top 30 Most Neglected Diseases

PCA-derived unified score combining biobank gaps, clinical trial equity, and semantic isolation.

Nine of the top ten most neglected diseases have primary burden in the Global South. Five are WHO-classified neglected tropical diseases. The remainder, including malaria, encephalitis, iodine deficiency, and invasive non-typhoidal Salmonella, are likewise conditions of poverty and structural neglect, collectively affecting over 1.5 billion people.
28.0
Mean Score
0.5 – 87.9
Score Range
86
3-Dimension Diseases

The Geography of Neglect

Research investment follows wealth, not disease burden.

57.78:1
HIC vs LMIC publication ratio (biobanks)
2.5:1
HIC vs LMIC trial site ratio
61.5%
Clinical trials focused on cancer
Figure 2 - Discovery Dimension: Biobank Research Gaps

Figure 2. Discovery dimension analysis. Biobank research gaps across 175 GBD disease categories, showing disease coverage and publication intensity for 70 IHCC biobanks.

Figure 3 - Translation Dimension: Clinical Trial Equity

Figure 3. Translation dimension analysis. Clinical trial equity across 563,725 studies, showing geographic distribution and disease-specific research intensity relative to burden.

Figure 4 - UMAP Disease Semantic Landscape

Figure 4. Disease research semantic landscape. UMAP projection of 175 GBD Level 3 diseases based on PubMedBERT embeddings of 13.1M abstracts. Point size reflects research volume; colour indicates Semantic Isolation Index (SII). Neglected tropical diseases cluster at the periphery, far from the dense core of well-studied conditions.

The Pipeline Paradox

Geographic concentration is better in clinical trials (2.5:1 vs 57.8:1 HIC:LMIC). But disease priorities are worse: lymphatic filariasis has 89 trials for 1.3M DALYs, while type 2 diabetes has 47,892 trials for 2.5M DALYs - a 538-fold disparity for comparable disease burden.

Interactive Explorer

Adjust weights, model scenarios, and explore the data.

Weight Adjustment

Modify the PCA-derived dimension weights and watch rankings update in real time.

Using published PCA weights
0.501
0.293
0.206
Sum: 1.000
Unified = 0.501 × D_norm + 0.293 × T_norm + 0.206 × K_norm

Scenario Builder

Model "what if" scenarios to see how research investment changes would shift disease rankings.

+50%

Increase Neglected Tropical Disease Research

50% more clinical trials for all 17 WHO neglected tropical diseases

=

Equalise HIC/LMIC

Double trial intensity for highly neglected diseases

2x

Double Infectious Research

Double clinical trial intensity for infectious diseases

Reduce Cancer Bias

Redistribute 30% of excess cancer trials to neglected diseases

Results: --

--

Baseline Rankings

#DiseaseScore

Most Affected Diseases

#DiseaseScoreΔ
--
Spearman rho
--
Rank changes
--
Largest shift
Verification trace

Biobank Comparison

Select 2-5 biobanks for side-by-side comparison. Grouped by WHO region.

Metric

Methods & Citation

Data sources, statistical validation, and how to cite this work.

Data Sources

Biobanks (Discovery): Publication data from PubMed for 70 biobanks in the International HundredK+ Cohorts Consortium (IHCC). Disease coverage mapped to GBD 2021 taxonomy.

Clinical Trials (Translation): 563,725 studies from ClinicalTrials.gov via the AACT database (2000-2025). 2,189,930 disease-trial mappings across 770,178 trial sites in 194 countries.

PubMed Semantic (Knowledge): 13.1M papers with PubMedBERT embeddings (768-dimensional). 175 diseases analysed for semantic isolation, knowledge transfer potential, and research clustering.

Disease Burden: Global Burden of Disease Study 2021 (IHME). 175 disease categories mapped to GBD Level 3 causes. 30 Global South Priority diseases identified.

Key Metrics

Unified Neglect Score (0–100)
A single number summarising how neglected a disease is across all three research stages. It combines the Gap Score (biobank research gaps), clinical trial equity, and semantic isolation using weights derived from principal component analysis: Discovery receives the largest weight (0.50) because biobank-stage neglect explains the most variance, followed by Translation (0.29) and Knowledge (0.21). A score of 0 means no measurable neglect; higher scores indicate compounding disadvantage across multiple dimensions.

Gap Score (0–100) — Discovery dimension
Measures how far a disease's biobank research output falls short of what its global burden would warrant. Diseases with zero publications across all 70 biobanks receive the maximum penalty (95). Diseases with some publications are scored based on whether their research volume is proportionate to their share of global DALYs, with stricter thresholds for infectious and neglected tropical diseases. An additional penalty (+10) is applied to Global South priority diseases with fewer than 50 publications. Categories: Critical (>70), High (50–70), Moderate (30–50), Low (<30).

Research Intensity — Translation dimension
The number of registered clinical trials per million DALYs for each disease. This metric reveals whether translational research investment is proportionate to disease burden. For example, type 2 diabetes has approximately 19,000 trials per million DALYs, while malaria has only 25. For the Unified Score, intensity is inverted so that diseases with fewer trials per unit of burden receive higher neglect scores.

Semantic Isolation Index (SII) — Knowledge dimension
Measures how disconnected a disease's research literature is from the rest of biomedical science. For each disease, we compute a representative "fingerprint" (centroid) from all its PubMed abstracts using PubMedBERT, a language model trained on biomedical text. We then measure how distant that fingerprint is from the 100 most similar diseases. Diseases with higher SII values have research that "speaks its own language," cut off from the methods, concepts, and findings of mainstream biomedicine. NTDs are 40% more isolated than other diseases (Cohen's d = 1.80).

Knowledge Transfer Potential (KTP)
How much a disease could benefit from advances in related fields. Computed as the similarity between a disease's research fingerprint and its closest neighbours. Diseases with low KTP have few related conditions whose progress might "spill over" into new treatments or understanding.

Equity Alignment Score (EAS, 0–100) — per biobank
Rates each biobank on how well its research portfolio matches global disease burden. A score of 100 would mean the biobank covers all high-burden diseases proportionately. The score penalises three things: the severity of research gaps (weighted 40%), the share of global burden in diseases the biobank barely studies (30%), and limited disease breadth (30%). Only 1 of 70 biobanks (UK Biobank, EAS = 84.6) achieves High equity (≥70); 55 score below 40 (Low).

Statistical Validation

Semantic Isolation: 17 WHO neglected tropical diseases show 40% higher isolation than other diseases (mean SII 0.00204 vs 0.00146, P < 0.0001, Cohen's d = 1.80).

Dimension Orthogonality: r=0.07 between dimensions confirms they measure independent aspects of neglect.

Ranking Robustness: Systematic perturbation of each dimension's weight by ±20% across 51 schemes yields Spearman rho >0.975 for all pairwise rank comparisons. Additionally, 200 Dirichlet-distributed random weight vectors confirm mean rho >0.95.

PCA Justification: PC1 explains 63.3% of variance across 86 diseases with complete data. PC1+PC2 explain 86.9%.

Limitations
  • ClinicalTrials.gov: US/Western registration bias; many non-US trials unregistered.
  • PubMed: English-language bias; some diseases have incomplete MeSH coverage (lung cancer, Alzheimer's).
  • Biobank coverage: 70 IHCC biobanks may not represent all global biobank activity.
  • Temporal scope: Embeddings trained on 2000-2025 literature.
  • Causal claims: Framework measures association not causation between dimensions.
Extended Data Figures
Extended Data Figure 1 - Temporal Trends

Extended Data Figure 1. Temporal trends in research activity across the three HEIM dimensions, showing how discovery, translation, and knowledge patterns have evolved over time.

Extended Data Figure 2 - Sensitivity Analysis

Extended Data Figure 2. Sensitivity analysis of unified neglect rankings under systematic ±20% weight perturbation across 51 schemes (all Spearman rho > 0.975) and 200 Dirichlet-distributed random weight vectors (mean rho > 0.95).

Extended Data Figure 3 - Regional Comparison

Extended Data Figure 3. Regional comparison of research equity metrics across WHO regions, highlighting systematic disparities between high-income and low- and middle-income country settings.

Corpas M, Freidin MB, Valdivia-Silva J, Baker S, Fatumo S, Guio H. (2026). Three Dimensions of Neglect: How Biobanks, Clinical Trials, and Scientific Literature Systematically Underserve Global South Diseases. medRxiv. doi: 10.64898/2026.02.10.26346004