How Biobanks, Clinical Trials, and Scientific Literature Systematically Underserve Global South Diseases
HEIM measures compound neglect across discovery, translation, and knowledge - revealing how disadvantage accumulates at every stage of the research pipeline.
Figure 1. The HEIM framework quantifies research inequity across three independent dimensions: Discovery (biobank gaps), Translation (clinical trial equity), and Knowledge (semantic isolation).
Unified = 0.501 × Discovery + 0.293 × Translation + 0.206 × Knowledge
PCA-derived unified score combining biobank gaps, clinical trial equity, and semantic isolation.
Research investment follows wealth, not disease burden.
Figure 2. Discovery dimension analysis. Biobank research gaps across 175 GBD disease categories, showing disease coverage and publication intensity for 70 IHCC biobanks.
Figure 3. Translation dimension analysis. Clinical trial equity across 563,725 studies, showing geographic distribution and disease-specific research intensity relative to burden.
Figure 4. Disease research semantic landscape. UMAP projection of 175 GBD Level 3 diseases based on PubMedBERT embeddings of 13.1M abstracts. Point size reflects research volume; colour indicates Semantic Isolation Index (SII). Neglected tropical diseases cluster at the periphery, far from the dense core of well-studied conditions.
Geographic concentration is better in clinical trials (2.5:1 vs 57.8:1 HIC:LMIC). But disease priorities are worse: lymphatic filariasis has 89 trials for 1.3M DALYs, while type 2 diabetes has 47,892 trials for 2.5M DALYs - a 538-fold disparity for comparable disease burden.
Adjust weights, model scenarios, and explore the data.
Modify the PCA-derived dimension weights and watch rankings update in real time.
Unified = 0.501 × D_norm + 0.293 × T_norm + 0.206 × K_norm
Model "what if" scenarios to see how research investment changes would shift disease rankings.
50% more clinical trials for all 17 WHO neglected tropical diseases
Double trial intensity for highly neglected diseases
Double clinical trial intensity for infectious diseases
Redistribute 30% of excess cancer trials to neglected diseases
--
| # | Disease | Score |
|---|
| # | Disease | Score | Δ |
|---|
Select 2-5 biobanks for side-by-side comparison. Grouped by WHO region.
| Metric |
|---|
Data sources, statistical validation, and how to cite this work.
Biobanks (Discovery): Publication data from PubMed for 70 biobanks in the International HundredK+ Cohorts Consortium (IHCC). Disease coverage mapped to GBD 2021 taxonomy.
Clinical Trials (Translation): 563,725 studies from ClinicalTrials.gov via the AACT database (2000-2025). 2,189,930 disease-trial mappings across 770,178 trial sites in 194 countries.
PubMed Semantic (Knowledge): 13.1M papers with PubMedBERT embeddings (768-dimensional). 175 diseases analysed for semantic isolation, knowledge transfer potential, and research clustering.
Disease Burden: Global Burden of Disease Study 2021 (IHME). 175 disease categories mapped to GBD Level 3 causes. 30 Global South Priority diseases identified.
Unified Neglect Score (0–100)
A single number summarising how neglected a disease is across all three research stages. It combines the Gap Score (biobank research gaps), clinical trial equity, and semantic isolation using weights derived from principal component analysis: Discovery receives the largest weight (0.50) because biobank-stage neglect explains the most variance, followed by Translation (0.29) and Knowledge (0.21). A score of 0 means no measurable neglect; higher scores indicate compounding disadvantage across multiple dimensions.
Gap Score (0–100) — Discovery dimension
Measures how far a disease's biobank research output falls short of what its global burden would warrant. Diseases with zero publications across all 70 biobanks receive the maximum penalty (95). Diseases with some publications are scored based on whether their research volume is proportionate to their share of global DALYs, with stricter thresholds for infectious and neglected tropical diseases. An additional penalty (+10) is applied to Global South priority diseases with fewer than 50 publications. Categories: Critical (>70), High (50–70), Moderate (30–50), Low (<30).
Research Intensity — Translation dimension
The number of registered clinical trials per million DALYs for each disease. This metric reveals whether translational research investment is proportionate to disease burden. For example, type 2 diabetes has approximately 19,000 trials per million DALYs, while malaria has only 25. For the Unified Score, intensity is inverted so that diseases with fewer trials per unit of burden receive higher neglect scores.
Semantic Isolation Index (SII) — Knowledge dimension
Measures how disconnected a disease's research literature is from the rest of biomedical science. For each disease, we compute a representative "fingerprint" (centroid) from all its PubMed abstracts using PubMedBERT, a language model trained on biomedical text. We then measure how distant that fingerprint is from the 100 most similar diseases. Diseases with higher SII values have research that "speaks its own language," cut off from the methods, concepts, and findings of mainstream biomedicine. NTDs are 40% more isolated than other diseases (Cohen's d = 1.80).
Knowledge Transfer Potential (KTP)
How much a disease could benefit from advances in related fields. Computed as the similarity between a disease's research fingerprint and its closest neighbours. Diseases with low KTP have few related conditions whose progress might "spill over" into new treatments or understanding.
Equity Alignment Score (EAS, 0–100) — per biobank
Rates each biobank on how well its research portfolio matches global disease burden. A score of 100 would mean the biobank covers all high-burden diseases proportionately. The score penalises three things: the severity of research gaps (weighted 40%), the share of global burden in diseases the biobank barely studies (30%), and limited disease breadth (30%). Only 1 of 70 biobanks (UK Biobank, EAS = 84.6) achieves High equity (≥70); 55 score below 40 (Low).
Semantic Isolation: 17 WHO neglected tropical diseases show 40% higher isolation than other diseases (mean SII 0.00204 vs 0.00146, P < 0.0001, Cohen's d = 1.80).
Dimension Orthogonality: r=0.07 between dimensions confirms they measure independent aspects of neglect.
Ranking Robustness: Systematic perturbation of each dimension's weight by ±20% across 51 schemes yields Spearman rho >0.975 for all pairwise rank comparisons. Additionally, 200 Dirichlet-distributed random weight vectors confirm mean rho >0.95.
PCA Justification: PC1 explains 63.3% of variance across 86 diseases with complete data. PC1+PC2 explain 86.9%.
Extended Data Figure 1. Temporal trends in research activity across the three HEIM dimensions, showing how discovery, translation, and knowledge patterns have evolved over time.
Extended Data Figure 2. Sensitivity analysis of unified neglect rankings under systematic ±20% weight perturbation across 51 schemes (all Spearman rho > 0.975) and 200 Dirichlet-distributed random weight vectors (mean rho > 0.95).
Extended Data Figure 3. Regional comparison of research equity metrics across WHO regions, highlighting systematic disparities between high-income and low- and middle-income country settings.
Corpas M, Freidin MB, Valdivia-Silva J, Baker S, Fatumo S, Guio H. (2026). Three Dimensions of Neglect: How Biobanks, Clinical Trials, and Scientific Literature Systematically Underserve Global South Diseases. medRxiv. doi: 10.64898/2026.02.10.26346004