Pixel-based clustering

In this tutorial we will focus on pixel-based clustering. This tutorial assumes, that you have already registered and pre-processed your data.

[1]:
import numpy as np
import seaborn as sns
from skimage.transform import downscale_local_mean

import spatiomic as so
[2]:
# read the example data
img_data = so.data.read().read_tiff("./data/example.tif")
img_data = (downscale_local_mean(img_data, (4, 4, 1)) + 1 > 0).astype(float) * 255

Let’s take a look at the data:

[3]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3)

for i in range(3):
    ax[i].imshow(img_data[:, :, i], cmap="gray")
    ax[i].set_title("Channel {}".format(i))
    ax[i].axis("off")

plt.tight_layout()
../_images/tutorials_clustering_4_0.png

SOM training

First, we create our self-organizing map. The node count should scale with the size of your dataset and the complexity of your data. It is advisable to use the euclidean, cosine or correlation (Pearson correlation) distance metric for biological data.

[4]:
som = so.dimension.som(
    node_count=(30, 30),
    dimension_count=img_data.shape[-1],
    distance_metric="euclidean",
    use_gpu=False,
    seed=42,
)
WARNING: CuPy could not be imported
WARNING: CuPy could not be imported
WARNING: CuPy could not be imported

Next, we train our SOM on the data. Depending on the size of the data and the node count, this may take a while. To speed training up, consider training on a random subsample of your data. This can be achieved via pp.data.subsample(data, method="fraction", fraction=0.1) or pp.data.subsample(data, method="count", count=10^9).

[5]:
img_data_subsample = so.data.subsample().fit_transform(img_data, method="fraction", fraction=0.5, seed=42)
[6]:
som.fit(img_data_subsample, iteration_count=100)
 53%|█████▎    | 53/100 [00:00<00:00, 134.43it/s]/Users/au734063/Documents/code/spatiomic/.venv/lib/python3.12/site-packages/xpysom/xpysom.py:413: RuntimeWarning: divide by zero encountered in divide
  self._numerator_gpu / self._denominator_gpu,
/Users/au734063/Documents/code/spatiomic/.venv/lib/python3.12/site-packages/xpysom/xpysom.py:413: RuntimeWarning: invalid value encountered in divide
  self._numerator_gpu / self._denominator_gpu,
100%|██████████| 100/100 [00:00<00:00, 117.93it/s]

Graph clustering

After our self-organising map finishes training, we can use its nodes that now represent the topography of our dataset to create a kNN graph. Finally, we perform Leiden graph clustering.

[7]:
knn_graph = so.neighbor.knn_graph().create(som, neighbor_count=100, distance_metric="euclidean", use_gpu=False)
clusters, modularity = so.cluster.leiden().predict(
    knn_graph, resolution=0.5, seed=42, use_gpu=False, iteration_count=10
)

Plotting & evaluation

Now that we have identified clusters for each SOM node, we can use our SOM class to assign the cluster label of the closest SOM node to each pixel in the original data.

[8]:
labelled_data = som.label(img_data, clusters, save_path="./data/example_labelled.npy")

We can use the plot submodule to visualize the data:

[9]:
colormap = so.plot.colormap(color_count=np.array(clusters).max() + 1, flavor="glasbey")

fig = so.plot.cluster_image(
    labelled_data,
    colormap=colormap,
)
/var/folders/vg/vt1jjh256dn1g4sd6pknwxbs_qmnws/T/ipykernel_71212/2329922638.py:1: UserWarning: The glasbey color palette is part of colorcet and distributed under the Creative Commons Attribution 4.0 International Public License (CC-BY).
  colormap = so.plot.colormap(color_count=np.array(clusters).max() + 1, flavor="glasbey")
../_images/tutorials_clustering_18_1.png

We can also plot the locations of individual clusters:

[10]:
fig = so.plot.cluster_location(labelled_data)
fig.set_size_inches(5, 5)
fig.tight_layout(pad=2)
../_images/tutorials_clustering_20_0.png

We can further evaluate how distributed the clusters are by calculating Moran’s I.

[11]:
df_morans_i = so.spatial.autocorrelation().predict(labelled_data, method="moran", permutation_count=500)
Calculating spatial autocorrelation for each channel/cluster: 100%|██████████| 4/4 [00:00<00:00, 22.80it/s]
[12]:
df_morans_i
[12]:
cluster morans_i morans_i_expected p_value z_score
0 0 0.810741 -0.000278 0.001996 95.179651
1 1 0.841830 -0.000278 0.001996 102.301234
2 2 0.846760 -0.000278 0.001996 96.882256
3 4 0.846887 -0.000278 0.001996 102.385384

To evaluate the correlation of the clusters with the channels in the original image, we can use the bivariate version of Moran’s I.

[13]:
df_bv_morans_i = so.spatial.bivariate_correlation().predict(
    data=labelled_data,
    channel_image=img_data,
    method="moran",
)
/Users/au734063/Documents/code/spatiomic/.venv/lib/python3.12/site-packages/spatiomic/spatial/_bivariate_correlation.py:126: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_clusters = pd.concat([df_clusters, df_correlation], ignore_index=True)

An easy way to look at this data is by using the heatmap functionality from seaborn.

[14]:
fig, ax = plt.subplots(dpi=100, figsize=(2.5, 2.5))
sns.heatmap(
    df_bv_morans_i.pivot(index="cluster", columns="channel", values="morans_i"),
    cmap="coolwarm",
    vmin=-1,
    vmax=1,
    annot=True,
    fmt=".2f",
    square=True,
    ax=ax,
)
[14]:
<Axes: xlabel='channel', ylabel='cluster'>
../_images/tutorials_clustering_27_1.png

Statistics

We can also get the counts for each cluster in our image.

[15]:
df_cluster_abundances = so.tool.count_clusters(
    file_paths=["./data/example_labelled.npy"],
    cluster_count=labelled_data.max() + 1,
)
[16]:
df_cluster_abundances.plot.barh()
[16]:
<Axes: ylabel='file_path'>
../_images/tutorials_clustering_31_1.png
[ ]: