PARE – a framework for removal of confounding effects from any distance-based dimension reduction method

Imagine you have a complex dataset with lots of dimensions, like a massive library of books each with hundreds of attributes. To understand the patterns in this data, you might use tools like t-SNE or UMAP, which help reduce the number of dimensions while preserving the important similarities and relationships between the data points. This is like organizing the library so that books on similar topics are grouped together on a few shelves, making it easier to see patterns.
However, a challenge with these tools is that they often don’t distinguish between the patterns you care about and the ones you don’t. For example, if some books are grouped together because they have the same color covers, but you’re more interested in grouping them by topic, the color is a confounder—it confuses the patterns you’re trying to study.
To tackle this problem, a team led by researchers at the Medical University of South Carolina have developed a new framework called PARE, which stands for partial embedding. This framework can be added to any distance-based dimension reduction method, like t-SNE or UMAP, to remove the unwanted effects of confounders. So, using our library analogy, PARE helps to organize the books by topic, removing the influence of their cover color.
Partial embeddings remove batch and donor effects in single-cell RNA-sequencingmeasurements aggregated from four studies

Embeddings and partial embeddings are compared in data from 13,369 human pancreatic cells. The original counts data is log-normalized and reduced to 2,000 highly variable genes. Local inverse Simpson’s index and average silhouette width are computed for each cell for batch (bLISI, bASW) and cell type (cLISI, cASW) with the median, 2.5% quantile, and 97.5% quantile shown. Higher bLISI and lower bASW indicate greater integration across batches. Lower cLISI and higher cASW indicate greater separation between cell types. Partial t-SNE (p-t-SNE) and partial UMAP (p-UMAP) adjust for either batch or donor effects. We compare our new methodology to the existing projected t-SNE for batch correction (BC-t-SNE). All t-SNE embeddings have a perplexity of 10 and UMAP embeddings use 15 nearest neighbors.
The researchers applied this PARE framework to both genomic data and neuroimaging data. In genomic data, for example, they used it to handle single-cell sequencing data, which often suffers from batch effects—unwanted variations that arise when samples are processed at different times or in different batches. PARE helped to remove these batch effects, allowing the true biological patterns to stand out more clearly.
In neuroimaging data, PARE was used to separate the clinical information (like disease status) from technical variability (differences due to how the data was collected or processed). This separation helps researchers focus on the meaningful clinical patterns without being misled by technical noise.
In essence, PARE enhances dimension reduction methods, making them more powerful tools for uncovering genuine biological insights by filtering out the noise and confounding effects. This advancement opens up new possibilities for researchers working with complex biological data, enabling them to better understand the underlying patterns and make more accurate scientific discoveries.
Availability – The code for PARE are available on Github: https://github.com/andy1764/PARE.

Chen AA, Clark K, Dewey BE, DuVal A, Pellegrini N, Nair G, Jalkh Y, Khalil S, Zurawski J, Calabresi PA, Reich DS, Bakshi R, Shou H, Shinohara RT; Alzheimer’s Disease Neuroimaging Initiative, and North American Imaging in Multiple Sclerosis Cooperative. (2024) PARE: A framework for removal of confounding effects from any distance-based dimension reduction method. PLoS Comput Biol 20(7):e1012241. [article]

PARE – a framework for removal of confounding effects from any distance-based dimension reduction method

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis