Full example

Here, we show a full example analysis of PathoPlex spatial proteomics data with spatiomic. To avoid version conflicts, we recommend installing the package in a fresh virtual environment. Installing in a new environment may take a couple of minutes as common dependencies are installed. Caution: The minimum Python requirement is 3.10.

pip install -q spatiomic

In addition, installation of session_info2 and requests (e.g., via pip) is necessary to run this notebook: pip install -q session_info2 requests

[1]:
from logging import ERROR, getLogger
from warnings import filterwarnings
[2]:
getLogger("matplotlib.font_manager").setLevel(ERROR)
filterwarnings("ignore", category=UserWarning)
filterwarnings("ignore", category=FutureWarning)
[3]:
from time import time as get_time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from colorcet import linear_kryw_5_100_c67
from matplotlib.colors import ListedColormap
from session_info2 import session_info
from skimage.transform import downscale_local_mean
from tqdm import tqdm

import spatiomic as so

We will time how long it takes to run this notebook without GPU support on a standard laptop. If a CUDA-enabled GPU is available, the time will be significantly reduced. If you wish to use GPU acceleration, please install the necessary RAPIDS.ai dependencies as described in the installation instructions and set use_gpu to True in the spatiomic function calls.

[4]:
time_start = get_time()

Data loading

We will work with a subsample of fields-of-view acquired through a controlled PathoPlex spatial proteomics experiment. Here, murine kidney biopsies from mice with crescentic glomerulonephritis (an immune-mediated kidney disease) a compared to healthy control biopsies. Note that the images have already been registered. If you have cyclically acquired images with an offset between cycles, you will need to register them first. Please refer to the registration tutorial.

[5]:
images = [
    "245 g1",
    "245 g2",
    "247 g2",
    "247 g4",
    "248 g2",
    "248 g4",
    "263 g2",
    "263 g4",
    "265 g2",
    "265 g3",
    "266 g4",
    "266 g5",
]
samples = [image.split(" ")[0] for image in images]
is_disease = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

If the data is not available locally, example data can be downloaded from data.complextissue.com. This requires the additional installation of the requests package and about 3 GB of free disk space.

[6]:
import os
from warnings import warn

from requests import get

base_url = "https://data.complextissue.com/spatiomic/tutorial/data/"


def download_file(url, output_path):
    """Download file from URL to output path."""
    response = get(url, stream=True, timeout=10)
    total_size = int(response.headers.get("content-length", 0))
    block_size = 1024

    if os.path.exists(output_path):
        warn(f"File already exists: {output_path}")
        response.close()
        return

    with (
        open(output_path, "wb") as file,
        tqdm(
            total=total_size,
            unit="iB",
            unit_scale=True,
            unit_divisor=1024,
            leave=False,
            desc=f"Downloading {output_path}",
        ) as bar,
    ):
        for data in response.iter_content(block_size):
            bar.update(len(data))
            file.write(data)

    # Check that the size of the downloaded file is correct
    if total_size != 0 and bar.n != total_size:
        os.remove(output_path)
        raise ValueError(f"Downloaded file is incorrect and has been removed: {output_path}. Please try again.")


os.makedirs("./data", exist_ok=True)
os.makedirs("./data/mouse_cgn", exist_ok=True)

for image in images:
    download_file(f"{base_url}{image}.tif", f"./data/mouse_cgn/{image}.tif")
download_file(f"{base_url}channel_names.csv", "./data/mouse_cgn/channel_names.csv")

We still have “DAPI” and “DRAQ5” (two nuclear markers) in the data. Since one is sufficient and DRAQ5 has better contrast, let’s remove DAPI.

[7]:
channels_data = list(pd.read_csv("./data/mouse_cgn/channel_names.csv", header=0).T.to_numpy()[0])
channels = sorted([channel for channel in channels_data if "DAPI" not in channel])

Now, let’s load in the actual data. First, we read in each image, next we downscale it for this example notebook. In a real-world scenario, you would not downscale the images. Then, we filter the channels to the subset them to the desired channels and finally we subsample random pixels from each image to train our clustering algorithm on.

[8]:
img_data = []
subsamples = []

for image in tqdm(images):
    # read the image and convert the dimension order to XYC/channel-last
    data = so.data.read().read_tiff(
        f"./data/mouse_cgn/{image}.tif",
        input_dimension_order="CYX",
    )

    # for the example, we downscale the image by a factor of 16, as we are not using GPU-acceleration
    data = downscale_local_mean(data, factors=(4, 4, 1))

    # subset the channels
    data = so.data.subset(
        data,
        channel_names_data=channels_data,
        channel_names_subset=channels,
    )

    # create a random subsample of 40000 pixels per image
    subsample = so.data.subsample().fit_transform(
        data,
        method="count",  # you can also use "fraction"
        count=40000,  # or fraction=0.1,
        seed=0,
    )

    img_data.append(data)
    subsamples.append(subsample)

subsample = np.concatenate(subsamples, axis=0)
len(img_data), img_data[0].shape, subsample.shape
  0%|          | 0/12 [00:00<?, ?it/s]100%|██████████| 12/12 [00:09<00:00,  1.31it/s]
[8]:
(12, (438, 438, 34), (480000, 34))

For larger datasets, we recommend getting a weighted subsample from all images. This enables you to stay within the memory limit of your system while making sure that the data is representative of the entire dataset. Once the preprocessing classes have been fitted on the subsample, they can be applied to the entire dataset on an image-by-image basis. When working with a real-world dataset, you might want to control for multiple variables, such as the experimental conditions, the sample count per condition, the number of FOVs per sample or multiple imaging plates. In such a case, we recommend calculating the subsample size for each image beforehand.

Data preprocessing

Available preprocessing classes include clip for histogram clipping, normalize for normalizing images to a specific range and zscore for z-scoring images, among others. Using normalize facilitates interpretability while the zscore class ensures that all channels have a similar influence. We usually use clip in combination with normalize.

When all signals are well represented in a relevant part of the dataset with specific signals, performing histogram clipping based on percentiles can be useful and very quick. The lower value should be chosen to lie surely outside the portion of the data covered by specific signals and the upper threshold should surely represent specific signal.

[9]:
clipper = so.process.clip(
    method="percentile",
    percentile_min=50.0,  # assuming no marker covers more than 50% of the image with any specific signal
    percentile_max=99.95,  # assuming no marker covers less than 0.05% of the image with high-intensity specific signal
    use_gpu=False,
)

However, sometimes there might be markers that have very low abundance of specific signal in the dataset. If you don’t want to exclude such markers, manually evaluating reasonable histogram thresholds based on a number of images is recommended. You can also easily combine the percentile-based and the absolute value-based approaches.

[10]:
low_abundance_markers = [
    # Add if needed
]

min_values = np.array([np.percentile(subsample[..., i], 75.0) for i in range(subsample.shape[-1])])
max_values = np.array(
    [
        np.percentile(subsample[..., i], 99.95) if channel not in low_abundance_markers else 100
        for i, channel in enumerate(channels)
    ]
)

clipper = so.process.clip(
    method="minmax",
    min_value=min_values,
    max_value=max_values,
    use_gpu=False,
)
[11]:
normalizer = so.process.normalize(
    min_value=0.0,
    max_value=1.0,
    use_gpu=False,
)

We first apply fit_transform on the subsample, to configure the classes to be representative of the subsample and thus all images.

[12]:
# clip the data to defined percentiles, channel-wise
subsample_processed = clipper.fit_transform(subsample)

# normalize the data to a range between 0 and 1, channel-wise
subsample_processed = normalizer.fit_transform(subsample_processed)

We can then apply the preprocessing to each image in the dataset individually using the transform method.

[13]:
img_data_processed = []

for image, image_name in zip(img_data, images):
    image = clipper.transform(image)
    image = normalizer.transform(image)
    np.save(f"./data/mouse_cgn/{image_name}.npy", image)

    img_data_processed.append(image)

You can use the plot submodule to view the individual channels of an image.

[14]:
fig = so.plot.marker_expression(
    img_data_processed[0],
    channel_names=channels,
    max_value=1.0,
    figsize=(12, 12),
    colormap=ListedColormap(linear_kryw_5_100_c67),
)
../_images/tutorials_full_example_31_0.png

Self-organizing map training

Even with a subsample, graph clustering would be computationally very slow. If the subsample was clustered directly, rarer signals might also get grouped together with more abundant signals in the neighborhood graph. We therefore first train a self-organizing map to get a dimensionality-reduced representation of the data. This representation can then be used to identify clusters in the data. The number of SOM nodes will determine how many different co-expression signatures can be learned well. For real world datasets, we commonly use values between (100, 100) and (500, 500) with GPU acceleration. For this example, we will use a smaller SOM size, knowing that we might only be able to identify the most abundant and coherent signals.

[15]:
img_data_som = so.dimension.som(
    node_count=(30, 30),  # we generally recommend higher node counts
    dimension_count=subsample_processed.shape[-1],
    distance_metric="euclidean",  # "euclidean" and "cosine" are the fastest, while there are some theoretical reasons to use "correlation"
    use_gpu=False,
    seed=0,
)
WARNING: CuPy could not be imported
WARNING: CuPy could not be imported
WARNING: CuPy could not be imported
[16]:
img_data_som.fit(
    subsample_processed,
    iteration_count=10,  # We recommend to start with about 25 and to evaluate the fit in production settings
)
img_data_som.save("./data/mouse_cgn/example_map.p")
100%|██████████| 10/10 [00:22<00:00,  2.22s/it]

You can easily evaluate how well your SOM fits the training data by plotting the distances (here, Euclidean distance) from each data point in the subsample to its corresponding SOM node. What values are considered “good” will depend on the average intensity in your dataset, the number of markers and the distance metric you use. We encourage you to evaluate multiple SOM sizes with few iterations first and once you have identified the approximate size for a SOM that can fit your data well to then train it for more iterations.

[17]:
quantization_error, distances = img_data_som.get_quantization_error(subsample_processed, return_distances=True)
[18]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(
    distances,
    bins=100,
)
ax.vlines(
    quantization_error,
    ymin=0,
    ymax=ax.get_ylim()[1],
    color="red",
    linestyle="--",
)
[18]:
<matplotlib.collections.LineCollection at 0x3af2f3f20>
../_images/tutorials_full_example_38_1.png

As a quality control of SOM-fitting, we can examine which SOM nodes express which markers. Spatially co-expressed markers should be closer to each other in the SOM space.

[19]:
fig = so.plot.som_marker_expression(
    img_data_som.get_nodes(flatten=False),
    figsize=(8, 8),
    channel_names=channels,
)
../_images/tutorials_full_example_40_0.png

Cluster the SOM nodes

To identify pixel clusters of protein co-expression, we use similarity-based graph clustering with Leiden.

[20]:
# create a k nearest neighbor graph
knn_graph_som = so.neighbor.knn_graph().create(
    img_data_som,
    neighbor_count=12,
    use_gpu=False,
)
[21]:
# perform Leiden clustering
communities, modularity = so.cluster.leiden().predict(
    knn_graph_som,
    resolution=2.5,  # typically higher resolution values than in scRNA seq experiments
    iteration_count=100,
    seed=42,
    use_gpu=False,
)
[22]:
len(np.unique(communities)), modularity
[22]:
(25, 0.7158304531257043)

It can be difficult to evaluate which cluster resolution is best based on the modularity alone, as it is not really comparable between resolutions. It’s best practice to check whether your clustering approach can identify expected co-expression patterns by running the subsequent steps. You may also want to use a tool like pyclustree to evaluate different resolutions.

To visualize the clusters, we first need a colormap. For fewer than 100 clusters, we can use the glasbey_light colormap from colorcet, else we can randomly generate one (set flavor to random_hls and specify a seed). We can always override individual colors in the colormap, e.g., to color background clusters black.

[23]:
# get the som node with the smallest mean intensity
background_node = np.argmin(np.mean(img_data_som.get_nodes(), axis=1))
background_cluster = communities[background_node]

colormap = so.plot.colormap(
    color_count=np.max(communities) + 1,
    flavor="glasbey",
    color_override={
        background_cluster: "#000000",  # set the empty background to black
    },
)

We can visually inspect the clustered neighborhood graph. Note that such a two-dimensional projection may result in neighbors lying further apart than they actually are.

[24]:
fig = so.plot.connectivity_graph(
    knn_graph_som,
    clusters=communities,
    cluster_legend=True,
    figsize=(5, 5),
    colormap=colormap,
)
fig.set_dpi(100)
../_images/tutorials_full_example_50_0.png

We can do the same for the SOM nodes as they are arranged within the self-organizing map.

[25]:
fig = so.plot.som_clusters(img_data_som, communities, colormap=colormap)
../_images/tutorials_full_example_52_0.png

You can also visualize the clustering in a UMAP of the SOM nodes.

[26]:
umap = so.dimension.umap(dimension_count=2, neighbor_count=20, distance_min=0.1, spread=0.5, seed=42, use_gpu=False)
umap_values = umap.fit_transform(img_data_som.get_nodes())
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[27]:
fig = so.plot.cluster_scatter(
    umap_values,
    communities,
    colormap=colormap,
)
../_images/tutorials_full_example_55_0.png

Cluster composition

The composition of the clusters can either be calculated from the representative SOM nodes or from randomly sampled clustered images. The former allows you to just the clustering before applying it to the entire dataset, enabling quicker iterations. The latter represents the dataset slightly more accurately, but is computationally more expensive and generally similar to the representative SOM nodes.

[28]:
df_stats = so.tool.get_stats(
    img_data_som.get_nodes(),
    group=communities,
    channel_names=channels,
    comparison="all",
    test="t",
    permutation_count=100,  # Setting permutation count will perform a permutation test with the "test" statistic
    equal_variance=False,
    dependent=False,
    correction="fdr",
)

From a previous analysis, we know that there is a cluster that has an increased phospho-c-Jun expression in PECs (an effector cell type in crescentic glomerulonephritis), so let’s try to find it.

[29]:
df_stats_cjun = df_stats[df_stats["marker"] == "cjun"].sort_values("mean_group", ascending=False).head(1)
df_stats_cjun
[29]:
group comparison marker p_value_corrected log10_p_value_corrected mean_group mean_comparison p_value log10_p_value ranksum log2_fold_change
30 0 all cjun 0.039511 1.403282 0.117867 0.013775 0.019802 1.703291 None 3.097082

You can then plot the contributors for each cluster with their mean intensity, log2 fold change and p-value. The p-value is not shown on the axis but color-coded.

[30]:
fig = so.plot.cluster_contributors(
    df_stats,
    cluster_idx=df_stats_cjun["group"].to_numpy()[0],
    increased_log2_fold_change_threshold=2,
    log10_p_value_column="log10_p_value_corrected",
    mean_max=0.2,
    figsize=(8, 4),
)
fig.set_dpi(100)
../_images/tutorials_full_example_62_0.png

We can also get the mean intensity of all markers for each cluster as a table.

[31]:
df_cluster_marker_intensity = so.tool.mean_cluster_intensity(
    img_data_som.get_nodes(),
    clusters=communities,
    channel_names=channels,
)
df_cluster_marker_intensity.head()
[31]:
5MC ANXA AQP2 CD3 CD9 Calreticulin DACH1 DRAQ5 EEA1 Endomucin ... RAP1GAP SNP TH Vimentin ZO1 aSMAfitc cjun p57 prpS6 rab7
0 0.004611 0.026990 0.001665 0.006472 0.010657 0.018187 0.003667 0.017597 0.013944 0.005845 ... 0.005145 0.021471 0.004766 0.124615 0.011050 0.009301 0.117867 0.006892 0.008928 0.023200
1 0.004613 0.003401 0.001340 0.001712 0.001726 0.001186 0.001491 0.002893 0.000763 0.001916 ... 0.000843 0.001944 0.001337 0.001485 0.001264 0.001277 0.002282 0.001686 0.001138 0.001147
2 0.005265 0.008623 0.004845 0.032312 0.004748 0.005965 0.017611 0.006173 0.015904 0.003141 ... 0.010980 0.009089 0.014863 0.004299 0.017340 0.004963 0.010010 0.019894 0.018631 0.020892
3 0.045893 0.040196 0.012315 0.171179 0.010708 0.005907 0.077159 0.004139 0.027647 0.012415 ... 0.025948 0.019607 0.040096 0.008615 0.106511 0.018705 0.014579 0.065555 0.094898 0.033293
4 0.004895 0.005059 0.002724 0.008685 0.006137 0.008434 0.004887 0.002702 0.038363 0.001591 ... 0.005645 0.012481 0.006135 0.002663 0.006471 0.004740 0.006261 0.007168 0.006891 0.010898

5 rows × 34 columns

For a given cluster, we can plot a histogram of all the intensities of SOM nodes or pixels assigned to that cluster by marker. We again see that Vimentin, Podoplanin and phospho-c-Jun are highly expressed in this cluster and seem to be driving it, as no assigned SOM nodes have a low expression of these markers.

[32]:
fig = so.plot.cluster_contributor_histogram(
    img_data_som.get_nodes()[np.array(communities) == df_stats_cjun["group"].to_numpy()[0]],
    channel_names=channels,
    bin_count=20,
    scaling_factor_y=0.2,  # play around with this value to adjust the y-axis scaling, smaller values zoom out
)
../_images/tutorials_full_example_66_0.png

Vimentin is a cytoskeletal protein that is expressed in PECs but also upregulated in kidney injury or fibrosis. Podoplanin is expressed in both Podocytes and PECs. Phospho-c-Jun is involved in the AP-1 signalling cascade and has previously been indicated to play a role in crescentic glomerulonephritis, though its upregulation in PECs has not been profiled. However, to fully interpret this signal, we need to evaluate its expression in space. For this, we need to cluster all images in the dataset and perform differential abundance analysis as well as spatial projection.

Clustering of entire images

[33]:
for image, image_name in tqdm(zip(img_data_processed, images)):
    clustered_image = img_data_som.label(
        image,
        communities,
        save_path=f"./data/mouse_cgn/{image_name}-labelled.npy",
    )

    so.plot.save_cluster_image(
        clustered_image,
        f"./data/mouse_cgn/{image_name}-labelled.png",
        colormap=colormap,
    )
12it [00:09,  1.32it/s]

You can view the cluster projection in space (shown in the downsampled example image, PathoPlex can provide higher resolution images).

[34]:
labelled_image = np.load(f"./data/mouse_cgn/{images[0]}-labelled.npy")
[35]:
fig = so.plot.cluster_image(
    labelled_image,
    colormap=colormap,
)
../_images/tutorials_full_example_72_0.png

It’s always useful to get a legend for the cluster colors separately.

[36]:
fig = so.plot.cluster_legend(
    cluster_count=len(np.unique(communities)),
    cluster_labels=[f"{i}" for i in range(len(np.unique(communities)))],
    colormap=colormap,
)
../_images/tutorials_full_example_74_0.png

We can evaluate the location of each cluster individually.

[37]:
fig = so.plot.cluster_location(labelled_image)
../_images/tutorials_full_example_76_0.png

Differential abundance analysis

One of the main goals of spatiomic analyses is to generate a cluster abundance count matrix to perform differential cluster abundance analysis. First, we have to quantify the cluster abundance in each image (and if there are multiple images per sample, aggregate them per sample).

[38]:
df_cluster_counts = so.tool.count_clusters(
    [f"./data/mouse_cgn/{image}-labelled.npy" for image in images],
    cluster_count=len(np.unique(communities)),
    normalize=True,
)

Since in this case we had two FOVs per sample, we need to aggregate the counts at the sample level.

[39]:
df_cluster_counts["sample"] = [file_path.split("mouse_cgn/")[-1].split(" ")[0] for file_path in df_cluster_counts.index]
df_cluster_counts["disease"] = [is_disease[samples.index(sample)] for sample in df_cluster_counts["sample"]]
df_cluster_counts = df_cluster_counts.groupby("sample").mean()
df_cluster_counts.head()
[39]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 disease
sample
245 0.042511 0.197194 0.096740 0.080365 0.106037 0.024640 0.048558 0.007894 0.024882 0.014092 ... 0.009659 0.013649 0.037171 0.044020 0.039185 0.019534 0.008106 0.032837 0.009583 1.0
247 0.053392 0.276545 0.083860 0.035169 0.078509 0.030311 0.070987 0.012349 0.019247 0.013693 ... 0.010681 0.000943 0.047463 0.049736 0.028797 0.013923 0.011999 0.025925 0.013063 1.0
248 0.058078 0.117252 0.038768 0.032735 0.063408 0.031455 0.050101 0.022112 0.039128 0.009164 ... 0.026579 0.031685 0.017603 0.058772 0.056637 0.046788 0.012604 0.031351 0.035318 1.0
263 0.039960 0.080352 0.058008 0.227049 0.020381 0.062577 0.017673 0.007152 0.027327 0.020272 ... 0.031966 0.035174 0.056387 0.017832 0.049858 0.048813 0.017298 0.018804 0.009271 0.0
265 0.019310 0.159249 0.042464 0.224151 0.029232 0.037437 0.030666 0.011522 0.027254 0.010065 ... 0.021994 0.028085 0.053945 0.031523 0.030494 0.031812 0.007433 0.013039 0.012169 0.0

5 rows × 26 columns

We can plot a PCA of the cluser abundances. Note the comparatively increased heterogeneity of the disease samples compared to the control samples.

[40]:
pca = so.dimension.pca(dimension_count=2, use_gpu=False)
pca_values = pca.fit_transform(df_cluster_counts.to_numpy()[:, :-1])
print(pca_values.shape)
condition = ["Disease" if disease else "Control" for disease in df_cluster_counts["disease"].to_numpy()]

fig = so.plot.cluster_scatter(
    pca_values,
    clusters=condition,
    colormap=ListedColormap(["#33AA33", "#AA3333"]),
    figsize=(5, 5),
)
(6, 2)
../_images/tutorials_full_example_83_1.png

The first PCA dimension seems to differentiate the samples well and explains most of the variance seen in the cluster abundance data.

[41]:
pca.get_explained_variance_ratio()
[41]:
array([0.68732449, 0.22635426])

We can also plot the loadings of the first PC to see which clusters seem to be driving the separation.

[42]:
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(
    x=df_cluster_counts.columns[:-1].to_numpy().astype(np.uint8),
    height=pca.get_loadings()[0],
)
ax.set_xticks(df_cluster_counts.columns[:-1].to_numpy().astype(np.uint8))
ax.set_xlabel("Cluster")
plt.show()
../_images/tutorials_full_example_87_0.png

The PCA plot already indicates differences between the disease and non-disease samples. We can visuallize differential cluster abundance with a volcano plot following statistical testing.

[43]:
df_stats_abundance = so.tool.get_stats(
    df_cluster_counts.iloc[:, :-1].to_numpy(),
    group=df_cluster_counts["disease"].to_numpy(),
    channel_names=df_cluster_counts.columns,
    comparison=0,  # Choose your control group
    test="t",
    equal_variance=False,  # Use Welch's t-test
    dependent=False,
    correction=None,  # For larger sample sizes, you can use "fdr", "holm-sidak" or "bonferroni"
    test_kwargs={},
)
[44]:
fig = so.plot.volcano(
    df_stats_abundance,
    channel_column="marker",
    title="Disease vs. Control",
    significant_log2_fold_change_threshold=0.5,
)
../_images/tutorials_full_example_90_0.png

We can see, that even a downscaled analysis on two FOVs per sample already identifies the over-expression of phospho-c-Jun and Vimentin within the same pixels as a defining feature of the disease group.

[45]:
df_stats_abundance_significant = df_stats_abundance[
    (df_stats_abundance["p_value"] < 0.05)
    & ((df_stats_abundance["log2_fold_change"] > 0.5) | (df_stats_abundance["log2_fold_change"] < -0.5))
].sort_values("log2_fold_change", ascending=False)
df_stats_abundance_significant
[45]:
group comparison marker p_value_corrected log10_p_value_corrected mean_group mean_comparison p_value log10_p_value ranksum log2_fold_change
4 1.0 0.0 4 None None 0.082651 0.027449 0.038524 1.414265 None 1.590265
6 1.0 0.0 6 None None 0.056549 0.028378 0.040201 1.395763 None 0.994712
0 1.0 0.0 0 None None 0.051327 0.026062 0.046583 1.331775 None 0.977780
19 1.0 0.0 19 None None 0.050843 0.027398 0.022410 1.649567 None 0.891958
3 1.0 0.0 3 None None 0.049423 0.191962 0.034745 1.459106 None -1.957564
[46]:
for marker in df_stats_abundance_significant["marker"]:
    df_stats_cluster = df_stats[df_stats["group"] == marker]
    df_stats_cluster_significant = df_stats_cluster[
        (df_stats_cluster["p_value"] < 0.05)
        & (np.abs(df_stats_cluster["log2_fold_change"]) > 0.5)
        & (np.abs(df_stats_cluster["mean_group"]) > 0.03)
    ].sort_values("log2_fold_change", ascending=False)

    print(f"Differentially abundant cluster {marker} consists of: " + ", ".join(df_stats_cluster_significant["marker"]))
Differentially abundant cluster 4 consists of: EEA1
Differentially abundant cluster 6 consists of: PDGFR
Differentially abundant cluster 0 consists of: Vimentin, cjun, Podoplanin
Differentially abundant cluster 19 consists of: LaminB, DRAQ5, Calreticulin, GR
Differentially abundant cluster 3 consists of: CD3, ZO1, p57, DACH1, TH, OAT1, prpS6, Fibronectin, Nephrin, 5MC, GR, ANXA
[47]:
fig = so.plot.cluster_selection(
    labelled_image,
    clusters=df_stats_abundance_significant["marker"].head(5).to_numpy(),
    colormap=colormap,
)
../_images/tutorials_full_example_94_0.png

Spatial analysis

Now we know which clusters are differentially abundant between groups, where they are located within images and what markers are driving them. But we can go a bit further and also evaluate which clusters are located near one another. First, we create a neighborhood offset with the radius of interest. This way, we can evaluate the spatial relationship at different distances. Here, we look at a queen neighborhood of 5 pixels.

[48]:
neighborhood_offset = so.spatial.neighborhood_offset(
    neighborhood_type="queen",
    order=2,
    exclude_inner_order=0,  # note that 0, 0 will always be excluded
)
[49]:
dfs_vicinity_composition = []

for image in tqdm(images):
    clustered_data = np.load(f"./data/mouse_cgn/{image}-labelled.npy")
    df_vicinity_composition = so.spatial.vicinity_composition(
        clustered_data,
        neighborhood_offset=neighborhood_offset,
        ignore_identities=True,
        ignore_repeated=True,
        use_gpu=False,
    )

    dfs_vicinity_composition.append(df_vicinity_composition)

dfs_vicinity_composition[0].head()
100%|██████████| 12/12 [00:06<00:00,  1.81it/s]
[49]:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 0 1015 117 219 401 143 298 100 521 314 ... 328 51 67 78 294 210 145 84 167 78
1 741 0 1351 1130 3991 1008 1040 105 395 345 ... 404 121 75 298 290 274 159 75 883 137
2 109 1336 0 1274 1932 292 302 265 62 30 ... 51 92 125 62 242 57 134 44 476 101
3 286 1072 1227 0 1297 347 469 131 172 75 ... 270 106 111 156 250 78 182 117 347 117
4 434 4039 1981 1271 0 317 216 206 360 48 ... 246 85 29 293 452 172 334 49 1252 527

5 rows × 25 columns

We can combine the vicinity composition of individual images for the entire dataset and for each group.

[50]:
df_vicinity_composition_overall = pd.concat(dfs_vicinity_composition).groupby(level=0).sum()

dfs_vicinity_composition_disease = [
    df for df, disease in zip(dfs_vicinity_composition, is_disease, strict=True) if disease
]
dfs_vicinity_composition_control = [
    df for df, disease in zip(dfs_vicinity_composition, is_disease, strict=True) if not disease
]

df_vicinity_composition_disease = pd.concat(dfs_vicinity_composition_disease).groupby(level=0).sum()
df_vicinity_composition_control = pd.concat(dfs_vicinity_composition_control).groupby(level=0).sum()

From this data, we can generate a vicinity graph and visualize it.

[51]:
vicinity_graph = so.spatial.vicinity_graph(
    df_vicinity_composition_overall,
)
[52]:
fig = so.plot.spatial_graph(
    vicinity_graph,
    edge_threshold=0.1,
)
fig.set_size_inches(5, 5)
fig.set_dpi(100)
../_images/tutorials_full_example_103_0.png

We can already identify a heavily interconnected group of clusters. Let’s look at their composition. Note, that while the code is deterministic on a given computer, which clusters are nuclear might change between machines and operating systems.

[53]:
clusters = [15, 19, 20]
for cluster in clusters:
    df_stats_cluster = df_stats[df_stats["group"] == cluster]
    df_stats_cluster_significant = df_stats_cluster[
        (df_stats_cluster["p_value_corrected"] < 0.05)
        & (np.abs(df_stats_cluster["log2_fold_change"]) > 0.5)
        & (np.abs(df_stats_cluster["mean_group"]) > 0.05)
    ].sort_values("log2_fold_change", ascending=False)

    print(f"Cluster {cluster} consists of: " + ", ".join(df_stats_cluster_significant["marker"]))
Cluster 15 consists of: DRAQ5, GATA3, DACH1
Cluster 19 consists of: LaminB, DRAQ5
Cluster 20 consists of: GR, DRAQ5

All these clusters seem to represent nuclear signals, which is why they are spatially co-located. We can confirm this in space again.

[54]:
fig = so.plot.cluster_selection(
    labelled_image,
    clusters=clusters,
    colormap=colormap,
)
../_images/tutorials_full_example_107_0.png

Now, let’s evaluate the differences in cluster vicinities between groups.

[55]:
vicinity_graph_disease = so.spatial.vicinity_graph(
    df_vicinity_composition_disease,
)

vicinity_graph_control = so.spatial.vicinity_graph(
    df_vicinity_composition_control,
)

The spatial_graph plotting function allows us to specify a reference graph. Increased connections are shown in red, decreased connections in blue.

[56]:
fig = so.plot.spatial_graph(
    vicinity_graph_disease,
    reference_graph=vicinity_graph_control,
    edge_threshold=0.1,
)
fig.set_size_inches(5, 5)
fig.set_dpi(100)
../_images/tutorials_full_example_111_0.png

We can see that the connection between cluster 9 (an alpha-SMA cluster) and cluster 6 (a PDGFR cluster) increases in the disease group. Both signals are implicated in pro-fibrotic responses.

[57]:
clusters = [6, 9]
for cluster in clusters:
    df_stats_cluster = df_stats[df_stats["group"] == cluster]
    df_stats_cluster_significant = df_stats_cluster[
        (df_stats_cluster["p_value_corrected"] < 0.05)
        & (np.abs(df_stats_cluster["log2_fold_change"]) > 0.5)
        & (np.abs(df_stats_cluster["mean_group"]) > 0.05)
    ].sort_values("log2_fold_change", ascending=False)

    print(f"Cluster {cluster} consists of: " + ", ".join(df_stats_cluster_significant["marker"]))
Cluster 6 consists of: PDGFR
Cluster 9 consists of: aSMAfitc

We can evaluate the significance of these changes with a permutation test. Instead of reyling on the t-statistic or the t-test, we can define our own test statistic.

[58]:
def compare_means(data, data_comparison, **kwargs):
    """Compare means of two groups."""
    return np.mean(data, axis=0) - np.mean(data_comparison, axis=0)


join_counts = np.expand_dims(
    np.array([df.iloc[9, 6] / df.iloc[9].sum() for df in dfs_vicinity_composition]),
    axis=-1,
)

df_stats_vicinity = so.tool.get_stats(
    join_counts,
    group=is_disease,
    channel_names=["join_counts"],
    comparison=0,
    test=compare_means,
    dependent=False,
    correction=None,
    permutation_count=1000,
    permutation_seed=42,
)
[59]:
df_stats_vicinity["p_value"]
[59]:
0    0.02381
Name: p_value, dtype: float64

The increased spatial enrichment of cluster 6 around pixels assigned to cluster 9 appears to be statistically significant at the 0.05 level of significance.

Visualizing all clustered images

Now that we know which clusters are upregulated and what spatial relationships they have, we can visualize all images in the dataset with the cluster assignments.

[60]:
fig, axes = plt.subplots(2, 6, figsize=(15, 5))

for ax, image in zip(axes.flatten(), images, strict=True):
    clustered_image = np.load(f"./data/mouse_cgn/{image}-labelled.npy")
    ax.imshow(clustered_image, cmap=colormap)
    ax.axis("off")
    ax.set_title("Control" if not is_disease[images.index(image)] else "Disease")

plt.tight_layout()
../_images/tutorials_full_example_120_0.png

Extended analyses

spatiomic builds upon fundamental data structures in the scientific Python ecosystem, e.g., NumPy arrays and pandas DataFrames. Many function also accept an AnnData object as input. Similarily, the spatial submodule of spatiomic largely relies on the pysal weights object. This allows for a seamless integration with other community tools, such as esda or squidpy. Please refer to the PathoPlex manuscript that first introduced the spatiomic package for example analyses using spatiomic output with MistyR, PILOT, UnPaSt, Cellpose and other tools.

E.g., based on the neighborhood offset generated above, we can create a spatial weights matrix for a given image to use with pysal and esda. For Python examples of esda for spatial analyses, please refer to the pasta spatial statistics guide for which we have contributed Python examples, specifically to all sections dealing with lattice-based data: pasta spatial statistics guide.

[61]:
spatial_weights = so.spatial.spatial_weights(
    data_shape=labelled_image.shape,
    neighborhood_offset=neighborhood_offset,
)
Creating spatial weights for each offset: 100%|██████████| 24/24 [00:00<00:00, 1726.23it/s]

We can also easily convert each image to an AnnData object for use with scverse® tools.

[62]:
adata = so.data.anndata_from_array(
    img_data_processed[-4],
    channel_names=channels,
    clusters=labelled_image,
    cluster_key="cluster",
    spatial_weights=spatial_weights,
)
adata
[62]:
AnnData object with n_obs × n_vars = 191844 × 34
    obs: 'cluster'
    obsm: 'spatial'
    obsp: 'spatial_connectivities'

You can also refer to the spatial statistics tutorial included in this documentation for more examples of built-in methods and interoperability with esda. Please note that the spatial statistics functionality shown there is still actively developed, not fully covered by unittests and might change in the future.

Time evaluation

Let’s see how long it took to run this example on a MacBook Pro with M2 Max chip but no CUDA-enabled GPU.

[63]:
time_end = get_time()
time_execution = time_end - time_start

print(f"Execution time: {time_execution:.2f} seconds")
Execution time: 82.92 seconds

Session info

[64]:
info = session_info(
    cpu=True,
    gpu=True,
)  # note that the output cannot be copied in the compiled docs version of this notebook
info
[64]:

For the online documentation, we can also print the session info as plain text.

[65]:
for part in info._repr_mimebundle_()["text/markdown"].split("\n"):
    print(part)
| Package      | Version |
| ------------ | ------- |
| anndata      | 0.11.4  |
| pandas       | 2.2.3   |
| matplotlib   | 3.10.1  |
| numpy        | 2.2.4   |
| scikit-image | 0.25.2  |
| tqdm         | 4.67.1  |
| spatiomic    | 0.4.0   |
| requests     | 2.32.3  |
| igraph       | 0.11.8  |
| networkx     | 3.4.2   |
| libpysal     | 4.13.0  |

| Dependency                    | Version                |
| ----------------------------- | ---------------------- |
| sphinxcontrib-jquery          | 4.1                    |
| esda                          | 2.7.0                  |
| pyarrow                       | 19.0.1                 |
| idna                          | 3.10                   |
| zarr                          | 2.18.7                 |
| tifffile                      | 2025.3.30              |
| asttokens                     | 3.0.0                  |
| psutil                        | 7.0.0                  |
| comm                          | 0.2.2                  |
| pytz                          | 2025.2                 |
| coverage                      | 7.8.0                  |
| h5py                          | 3.13.0                 |
| pure_eval                     | 0.2.3                  |
| kiwisolver                    | 1.4.8                  |
| certifi                       | 2025.1.31 (2025.01.31) |
| sphinxcontrib-serializinghtml | 2.0.0                  |
| pillow                        | 11.2.1                 |
| urllib3                       | 2.4.0                  |
| setuptools                    | 78.1.0                 |
| platformdirs                  | 4.3.7                  |
| Deprecated                    | 1.2.18                 |
| ipython                       | 9.1.0                  |
| seaborn                       | 0.13.2                 |
| zstandard                     | 0.23.0                 |
| six                           | 1.17.0                 |
| cffi                          | 1.17.1                 |
| Pygments                      | 2.19.1                 |
| imagecodecs                   | 2025.3.30              |
| dask                          | 2024.11.2              |
| geopandas                     | 1.0.1                  |
| debugpy                       | 1.8.14                 |
| soupsieve                     | 2.6                    |
| patsy                         | 1.0.1                  |
| charset-normalizer            | 3.4.1                  |
| pynndescent                   | 0.5.13                 |
| defusedxml                    | 0.7.1                  |
| MarkupSafe                    | 3.0.2                  |
| adjustText                    | 1.3.0                  |
| beautifulsoup4                | 4.13.4                 |
| lazy_loader                   | 0.4                    |
| msgpack                       | 1.1.0                  |
| PyYAML                        | 6.0.2                  |
| wrapt                         | 1.17.2                 |
| sphinxcontrib-applehelp       | 2.0.0                  |
| stack-data                    | 0.6.3                  |
| shapely                       | 2.1.0                  |
| pyzmq                         | 26.4.0                 |
| scipy                         | 1.15.2                 |
| sphinxcontrib-devhelp         | 2.0.0                  |
| wcwidth                       | 0.2.13                 |
| llvmlite                      | 0.44.0                 |
| tornado                       | 6.4.2                  |
| scikit-learn                  | 1.5.2                  |
| XPySom                        | 1.0.7                  |
| decorator                     | 5.2.1                  |
| torch                         | 2.6.0                  |
| cloudpickle                   | 3.1.1                  |
| astroid                       | 3.3.9                  |
| parso                         | 0.8.4                  |
| sphinxcontrib-qthelp          | 2.0.0                  |
| pyparsing                     | 3.2.3                  |
| simplejson                    | 3.20.1                 |
| toolz                         | 1.0.0                  |
| numcodecs                     | 0.15.1                 |
| Jinja2                        | 3.1.6                  |
| colorcet                      | 3.1.0                  |
| ipykernel                     | 6.29.5                 |
| tblib                         | 3.1.0                  |
| traitlets                     | 5.14.3                 |
| jupyter_client                | 8.6.3                  |
| sphinxcontrib-jsmath          | 1.0.1                  |
| numba                         | 0.61.2                 |
| typing_extensions             | 4.13.2                 |
| appnope                       | 0.1.4                  |
| packaging                     | 24.2                   |
| natsort                       | 8.4.0                  |
| jedi                          | 0.19.2                 |
| jupyter_core                  | 5.7.2                  |
| prompt_toolkit                | 3.0.51                 |
| pycparser                     | 2.22                   |
| python-dateutil               | 2.9.0.post0            |
| sphinxcontrib-htmlhelp        | 2.1.0                  |
| asciitree                     | 0.3.3                  |
| threadpoolctl                 | 3.6.0                  |
| cycler                        | 0.12.1                 |
| joblib                        | 1.4.2                  |
| texttable                     | 1.7.0                  |
| matplotlib-inline             | 0.1.7                  |
| umap-learn                    | 0.5.7                  |
| session-info2                 | 0.1.2                  |
| pyproj                        | 3.7.1                  |
| statsmodels                   | 0.14.4                 |
| ipywidgets                    | 8.1.6                  |
| executing                     | 2.2.0                  |

| Component | Info                                                                    |
| --------- | ----------------------------------------------------------------------- |
| Python    | 3.12.6 (main, Sep 29 2024, 12:25:04) [Clang 16.0.0 (clang-1600.0.26.3)] |
| OS        | macOS-15.3.1-arm64-arm-64bit                                            |
| CPU       | 12 logical CPU cores, arm                                               |
| GPU       | No GPU found                                                            |
| Updated   | 2025-04-17 13:34                                                        |
[ ]: