We, humans, can perceive things in a maximum of four dimensions: length, height, width, time. Scientific data, however, are often much higher dimensional. The billions of neurons in the brain, for example. At any given moment, they are in a unique state, characterized by what an animal or human is seeing and thinking at that moment. Such a pattern of activity can be imagined as a high-dimensional, if not billion-dimensional, space. Provided one can still imagine that… How can scientists work with it at all? How can we visualize data that is in such a space?
Expert term is “low-dimensional embedding”
For this purpose, neuroscientists use machine learning algorithms. These transfer the data from high-dimensional space, which cannot be visualized directly, into 2- or 3-dimensionality. This way data can not only be visualized but it can also be worked with. During the transfer, it is important to make sure that the distances between any two data points in the lower dimensional space are as similar as possible to the data in the higher dimensional space. The technical term for this transfer process is “low-dimensional embedding.” This technique is widely used in neuroscience to visualize the behavior of groups of neurons and in bioinformatics to map data from different types of cells. It can also translate high-dimensional data into a few main factors that capture most of the variability between data points to analyze them further. Scientists also rely on it often to support the claim that there are reliable clusters in data points simply by showing that there are reliable clusters. True to the motto “seeing is believing.”
Noise points blur the result
ESI PhD student Jinke Liu and his research group leader Martin Vinck have now found a way to improve low-dimensional embedding techniques. In their recent paper in the scientific journal PLoS Computational Biology, they first show that a particular problem arises in high-dimensional spaces. Imagine there are some very reliable clusters in the data, clouds of points that are very densely packed. At the same time, there are scattered, isolated noise points which do not belong to any cluster. One would expect that with low-dimensional embedding, the noise points would remain separate from the clusters and not intrude on them. However, Jinke Liu and Martin Vinck show this is not the case because the noise points start to invade the clusters. They explain it by the mathematical properties of the low-dimensional embedding techniques, which derive from the principle of attraction and repulsion: to ensure that the distances in the low-dimensional space are maximally similar to the pairwise distances in the high-dimensional space, a data point should move towards another point that is close in high-dimensional space (attraction). But it should move away from a point that is far away in high-dimensional space (repulsion). Since the noise points are all very far away from each other, they start to repel each other – and invade the clusters…
Distance has to be quantified differently
Jinke Liu and Martin Vinck now show that there is an elegant trick to avoid this problem. They call it “distance-of-distance transformation” and it changes the way the distance is quantified. Instead of taking the raw Euclidean distance between two data points, they calculate the similarity of the distances between two data points and their respective neighbors. The idea is that noise points are now more attracted to each other because they tend to have fairly similar distances to their respective neighbors. The two researchers provide a principled mathematical argument while showing that this works particularly well in high-dimensional spaces because it vastly improves the low-dimensional embedding techniques.
Important for Big Data
The work is relevant for data science in general, but especially neuroscience and bioinformatics. Many researchers in Martin Vinck’s research group routinely use low-dimensional embedding techniques to cluster data - to isolate different types of cells from recordings or to study how a population of neurons encodes information and distinguishes between various visual stimuli an animal sees. Jinke Liu is completing a second thesis applying low-dimensional embedding to study how neural codes change over time with experience. The technique might also have broad applications in bioinformatics, for example, to understand how many different cell classes are in a brain area and how they differ between species.
Liu J, Vinck M (2022). Improved visualization of high-dimensional data using the distance-of-distance transformation. PLoS Computational Biology 18(12): e1010764. https://doi.org/10.1371/journal.pcbi.1010764