spatiomic.cluster

Make all clustering classes available in the cluster submodule.

Classes

agglomerative

Set the configuration for the agglomerative clustering class and initialise it.

kmeans

Set the configuration for the k-means clustering class and initialise it.

leiden

Initialise the Leiden clustering class.

som

Initialise a self-organising map with the provided configuration.

Package Contents

class spatiomic.cluster.agglomerative(cluster_count=10, distance_metric='euclidean', use_gpu=True)

Bases: spatiomic.cluster._base.ClusterInterface

Set the configuration for the agglomerative clustering class and initialise it.

Parameters:
  • cluster_count (int, optional) – The number of clusters to group the data into. Defaults to 10.

  • distance_metric (Literal["euclidean", "manhattan", "cosine"], optional) – The distance metric to use. Defaults to “euclidean”.

  • use_gpu (bool, optional) – Whether to use the cuml or the sklearn AgglomerativeClustering class. Defaults to True.

cluster_count = 10
connectivity: str | None
distance_metric = 'euclidean'
fit_predict(data, **kwargs)

Perform agglomerative hierarchical clustering on the data with the settings of the class.

Parameters:
  • data (Union[pd.DataFrame, NDArray]) – The data to be clustered, last dimension being features.

  • kwargs (dict)

Returns:

An array containing the cluster for each of the data points.

Return type:

NDArray

Raises:

ValueError – If the estimator does not have a fit_predict method.

set_estimator()

Set the AgglomerativeClustering estimator.

Return type:

None

use_gpu = True
class spatiomic.cluster.kmeans(cluster_count=20, run_count=1, iteration_count=300, tolerance=0.001, init='k-means++', seed=None, use_gpu=True)

Bases: spatiomic.cluster._base.ClusterInterface

Set the configuration for the k-means clustering class and initialise it.

Parameters:
  • cluster_count (int, optional) – The number of k-means clusters to split the data into. Defaults to 20.

  • run_count (int, optional) – The number of k-means runs to perform. Defaults to 1.

  • iteration_count (int, optional) – Maximum number of k-means iterations per run. Defaults to 300.

  • tolerance (float, optional) – Maximum tolerance in center shift to declare convergence. Defaults to 1e-3.

  • init (Literal["random", "k-means++"], optional) – How to initialise the centers, either “random” or “k-means++”. Defaults to “k-means++”.

  • seed (Optional[int], optional) – Random seed, either to be temporarily set as the numpy random seed or directly for libKmCuda. Defaults to None.

  • use_gpu (bool, optional) – Whether to use the cuml or the sklearn kMeans class. Defaults to True.

cluster_count = 20
fit_predict(data)

Perform k-means clustering on the data with the settings of the class.

Parameters:

data (NDArray) – The data to be clustered, last dimension being features.

Returns:

An array containing the cluster for each of the data points.

Return type:

NDArray

init = 'k-means++'
iteration_count = 300
run_count = 1
seed = None
set_estimator()

Set the KMeans estimator.

Return type:

None

tolerance = 0.001
use_gpu = True
class spatiomic.cluster.leiden

Initialise the Leiden clustering class.

graph: igraph.Graph | None = None
predict(graph, resolution=1.0, iteration_count=1000, seed=0, use_gpu=True, *args, **kwargs)

Create a neighborhood-graph based on the data and perform Leiden clustering on it.

Warning

The GPU version of Leiden may provide different results than the CPU version.

Parameters:
  • graph (Graph) – An igraph Graph to be optimised for community detection.

  • resolution (float, optional) – Scales the minimum interconnectedness for a positive modularity. Higher values result in more but deeper interconnected communities. Defaults to 1.0.

  • iteration_count (int, optional) – Iteration count to run the Leiden algorithm for. Defaults to 1000.

  • seed (int, optional) – Random seed to use for the Leiden algorithm. Defaults to 0.

  • use_gpu (bool, optional) – Whether to use the GPU or CPU for the Leiden algorithm. Defaults to True.

  • args (Any)

  • kwargs (dict)

Returns:

A Tuple of a list of the assigned communities, the final modularity

and the optimised igraph Graph.

Return type:

Tuple[List[int], float, Graph]

class spatiomic.cluster.som(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None, use_gpu=True)

Bases: spatiomic.dimension._base.LoadableDimensionReducer

Initialise a self-organising map with the provided configuration.

The advantage of self-organizing maps are, that they reduce the dimensionality of the data while preserving the feature dimensions. This potentially allows for a better interpretation of the data. The disadvantage is, that its output are representative nodes and not the actual data points. There are four things to keep in mind so that the SOM best represents the biology of your data: - The SOM node count should be large enough to capture the topography of the data. If you data is very uniform,

you can use a smaller node count. However, when working with tissue, very different imaging markers and multiple disease states, a larger node count is recommended.

  • The SOM should be trained for long enough to capture the topography of the data.

  • The SOM should be trained with a final learning rate that is not too high, so that the SOM can accurately

    represent small differences in the data.

  • The SOM should be trained with a final neighborhood size that is not too large. SOM nodes are not individually

    updated during training, but rather in a neighborhood. If your neighborhood is too large, other, perhaps more abundant biological signals will pull nodes that represent less abundant signals towards them, leading to a worse representation of the latter.

Parameters:
  • node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).

  • dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.

  • distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.

  • neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.

  • neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.

  • learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.

  • learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.

  • sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.

  • sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.

  • parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.

  • n_jobs (int, optional) – Jobs to perform simoustaneously when using the sklearn NearestNeighbor class. The value -1 means unlimited jobs. Defaults to -1.

  • seed (Optional[int], optional) – The random seed. Defaults to None.

  • use_gpu (bool, optional) – Whether to use cupy. Defaults to True.

fit(data, iteration_count=50, pca_init=False)

Fit the SOM on the data.

Parameters:
  • data (Union[pd.DataFrame, NDArray]) – The data (channel-last) of which the SOM should capture the topography.

  • iteration_count (int, optional) – The iterations to train the SOM for. Defaults to 50.

  • pca_init (bool, optional) – Whether to initialise the SOM through PCA. Defaults to False.

Return type:

None

fit_predict(data, iteration_count=50, pca_init=False, return_distance=False)

Fit the SOM on the data and return the id of the nearest SOM node for each data point.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • iteration_count (int, optional) – The iterations to train the SOM for. Defaults to 50.

  • pca_init (bool, optional) – Whether to initialise the SOM through PCA. Defaults to False.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

Returns:

The id of the nearest SOM node for each data point or the ids and

distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

get_config()

Get the config of the SOM class.

Returns:

The class configuration as a dictionary.

Return type:

dict

get_nodes(flatten=True)

Get the weights of a previously fitted SOM.

Parameters:

flatten (bool, optional) – Whether to flatten the SOM dimensions (but not the channel dimensionality). Defaults to True.

Raises:

ValueError – If no self-organizing map has been created or loaded.

Returns:

The weights of the SOM nodes.

Return type:

NDArray

get_quantization_error(data, return_distances=False)

Get the quantization error of the SOM on the provided data.

Uses the neighbor finder to find the nearest neighbor of each data point in the SOM and calculates the quantization error based on the distance between the data point and its nearest neighbor.

Parameters:
  • data (NDArray) – The data to get the quantization error for.

  • return_distances (bool)

Returns:

The mean quantization error and all distances if return_distances`

is set to True.

Return type:

Union[float, Tuple[float, np.ndarray]]

label(data, clusters, save_path=None, return_distance=False, flatten=False)

Get the label for each data point based on the label for its closest SOM node.

This function internally uses the predict method to get the nearest node for each data point and then assigns the label of the nearest node to the data point. Labels have to be provided for each SOM node based on a clustering of the SOM nodes.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • clusters (Union[List[int], NDArray]) – A list of clusters (one for each SOM node).

  • save_path (Optional[str], optional) – The path where to save the SOM and its configuration to. Defaults to None.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

  • flatten (bool, optional) – Whether to flatten the input data in every but the channel dimension. Defaults to False.

Returns:

The cluster for each data point or the clusters and distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

load(save_path)

Load a previously pickled SOM.

Parameters:

save_path (str) – The path where to load the SOM and its configuration from.

Return type:

None

predict(data, return_distance=False)

Get the id of the nearest SOM node for each data point and optionally the distance to the node.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

Returns:

The labels for each data point or the labels and distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

save(save_path)

Pickle and save a previously fit SOM.

Parameters:

save_path (str) – The path where to save the SOM and its configuration to.

Raises:

ValueError – If no self-organizing map has been created or loaded.

Return type:

None

set_config(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None)

Set the config of the SOM class.

Parameters:
  • node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).

  • dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.

  • distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.

  • neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.

  • neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.

  • learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.

  • learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.

  • sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.

  • sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.

  • parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.

  • seed (Optional[int], optional) – The random seed. Defaults to None.

  • n_jobs (int)

Return type:

None

set_estimators()

Set the XPySOM and nearest neighbor finder estimators.

Return type:

None

set_nodes(nodes)

Set the weights of a previously fit SOM.

Parameters:

nodes (NDArray) – The weights of the SOM nodes.

Return type:

None

use_gpu = True