spatiomic.dimension

Make all dimensionality reduction classes available in the dimension submodule.

Classes

pca

Initialise the PCA class, set the dimension count and the estimator.

som

Initialise a self-organising map with the provided configuration.

tsne

Initialise a tSNE estimator with the provided configuration.

umap

Initialise a UMAP estimator with the provided configuration.

Package Contents

class spatiomic.dimension.pca(dimension_count=2, batch_size=None, flavor='auto', use_gpu=True, **kwargs)

Bases: spatiomic.dimension._base.DimensionReducer

Initialise the PCA class, set the dimension count and the estimator.

Parameters:
  • dimension_count (int, optional) – The number of principal components that the data is to be reduced to. Defaults to 2.

  • batch_size (Optional[int], optional) – The batch size for the IncrementalPCA algorithm for a smaller memory profile. Defaults to None.

  • flavor (Literal["auto", "full", "incremental", "nmf"], optional) – The flavor of PCA to be used. Defaults to “auto”.

  • use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.

  • kwargs (dict)

batch_size = None
dimension_count = 2
fit(data)

Fit the PCA estimator of the class to the data.

Parameters:

data (NDArray) – The data (channel-last) to be reduced in dimensionality.

Return type:

None

fit_transform(data, flatten=True)

Perform principal-component analysis on the data with the settings of the class.

Parameters:
  • data (NDArray) – The data (channel-last) to be reduced in dimensionality.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The principal components of the data.

Return type:

NDArray

get_explained_variance_ratio()

Get the explained variance ratio of the principal components.

Returns:

The explained variance ratio of the principal components.

Return type:

NDArray

get_loadings()

Get the loadings of the principal components.

Returns:

The loadings of the principal components.

Return type:

NDArray

set_estimator(flavor='auto', **kwargs)

Set the IncrementalPCA estimator.

Parameters:
  • flavor (Literal["auto", "full", "incremental", "nmf"], optional) – The flavor of PCA to be used. When set to “auto” or “full” the full PCA implementation is used. When set to “incremental”, the IncrementalPCA implementation is used. The “auto” flavor may try to determine the best implementation based on free memory in the future. Defaults to “auto”.

  • kwargs (dict)

Raises:

ValueError – If the flavor is not supported.

Return type:

None

transform(data, flatten=True)

Perform principal-component analysis on the data with the settings of the class.

Parameters:
  • data (NDArray) – The data (channel-last) to be reduced in dimensionality.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The principal components of the data.

Return type:

NDArray

use_gpu = True
weigh_by_explained_variance_ratio(data, adapt_mean=False, flatten=True)

Weigh a PCA-transformed NDArray by the explained variance ratio of the principal components.

Parameters:
  • data (NDArray) – The principal components.

  • adapt_mean (bool, optional) – Whether to adapt the mean value from the first component. Defaults to False.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The weighted and optionally mean-adapted principal components.

Return type:

NDArray

class spatiomic.dimension.som(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None, use_gpu=True)

Bases: spatiomic.dimension._base.LoadableDimensionReducer

Initialise a self-organising map with the provided configuration.

The advantage of self-organizing maps are, that they reduce the dimensionality of the data while preserving the feature dimensions. This potentially allows for a better interpretation of the data. The disadvantage is, that its output are representative nodes and not the actual data points. There are four things to keep in mind so that the SOM best represents the biology of your data: - The SOM node count should be large enough to capture the topography of the data. If you data is very uniform,

you can use a smaller node count. However, when working with tissue, very different imaging markers and multiple disease states, a larger node count is recommended.

  • The SOM should be trained for long enough to capture the topography of the data.

  • The SOM should be trained with a final learning rate that is not too high, so that the SOM can accurately

    represent small differences in the data.

  • The SOM should be trained with a final neighborhood size that is not too large. SOM nodes are not individually

    updated during training, but rather in a neighborhood. If your neighborhood is too large, other, perhaps more abundant biological signals will pull nodes that represent less abundant signals towards them, leading to a worse representation of the latter.

Parameters:
  • node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).

  • dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.

  • distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.

  • neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.

  • neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.

  • learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.

  • learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.

  • sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.

  • sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.

  • parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.

  • n_jobs (int, optional) – Jobs to perform simoustaneously when using the sklearn NearestNeighbor class. The value -1 means unlimited jobs. Defaults to -1.

  • seed (Optional[int], optional) – The random seed. Defaults to None.

  • use_gpu (bool, optional) – Whether to use cupy. Defaults to True.

fit(data, iteration_count=50, pca_init=False)

Fit the SOM on the data.

Parameters:
  • data (Union[pd.DataFrame, NDArray]) – The data (channel-last) of which the SOM should capture the topography.

  • iteration_count (int, optional) – The iterations to train the SOM for. Defaults to 50.

  • pca_init (bool, optional) – Whether to initialise the SOM through PCA. Defaults to False.

Return type:

None

fit_predict(data, iteration_count=50, pca_init=False, return_distance=False)

Fit the SOM on the data and return the id of the nearest SOM node for each data point.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • iteration_count (int, optional) – The iterations to train the SOM for. Defaults to 50.

  • pca_init (bool, optional) – Whether to initialise the SOM through PCA. Defaults to False.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

Returns:

The id of the nearest SOM node for each data point or the ids and

distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

get_config()

Get the config of the SOM class.

Returns:

The class configuration as a dictionary.

Return type:

dict

get_nodes(flatten=True)

Get the weights of a previously fitted SOM.

Parameters:

flatten (bool, optional) – Whether to flatten the SOM dimensions (but not the channel dimensionality). Defaults to True.

Raises:

ValueError – If no self-organizing map has been created or loaded.

Returns:

The weights of the SOM nodes.

Return type:

NDArray

get_quantization_error(data, return_distances=False)

Get the quantization error of the SOM on the provided data.

Uses the neighbor finder to find the nearest neighbor of each data point in the SOM and calculates the quantization error based on the distance between the data point and its nearest neighbor.

Parameters:
  • data (NDArray) – The data to get the quantization error for.

  • return_distances (bool)

Returns:

The mean quantization error and all distances if return_distances`

is set to True.

Return type:

Union[float, Tuple[float, np.ndarray]]

label(data, clusters, save_path=None, return_distance=False, flatten=False)

Get the label for each data point based on the label for its closest SOM node.

This function internally uses the predict method to get the nearest node for each data point and then assigns the label of the nearest node to the data point. Labels have to be provided for each SOM node based on a clustering of the SOM nodes.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • clusters (Union[List[int], NDArray]) – A list of clusters (one for each SOM node).

  • save_path (Optional[str], optional) – The path where to save the SOM and its configuration to. Defaults to None.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

  • flatten (bool, optional) – Whether to flatten the input data in every but the channel dimension. Defaults to False.

Returns:

The cluster for each data point or the clusters and distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

load(save_path)

Load a previously pickled SOM.

Parameters:

save_path (str) – The path where to load the SOM and its configuration from.

Return type:

None

predict(data, return_distance=False)

Get the id of the nearest SOM node for each data point and optionally the distance to the node.

Parameters:
  • data (NDArray) – The data to be labelled based on the label of the nearest SOM node.

  • return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.

Returns:

The labels for each data point or the labels and distances.

Return type:

Union[NDArray, Tuple[NDArray, NDArray]]

save(save_path)

Pickle and save a previously fit SOM.

Parameters:

save_path (str) – The path where to save the SOM and its configuration to.

Raises:

ValueError – If no self-organizing map has been created or loaded.

Return type:

None

set_config(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None)

Set the config of the SOM class.

Parameters:
  • node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).

  • dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.

  • distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.

  • neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.

  • neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.

  • learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.

  • learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.

  • sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.

  • sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.

  • parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.

  • seed (Optional[int], optional) – The random seed. Defaults to None.

  • n_jobs (int)

Return type:

None

set_estimators()

Set the XPySOM and nearest neighbor finder estimators.

Return type:

None

set_nodes(nodes)

Set the weights of a previously fit SOM.

Parameters:

nodes (NDArray) – The weights of the SOM nodes.

Return type:

None

use_gpu = True
class spatiomic.dimension.tsne(dimension_count=2, distance_metric='euclidean', iteration_count=1000, iteration_count_without_progress=300, learning_rate=200.0, perplexity=50.0, seed=None, use_gpu=True)

Bases: spatiomic.dimension._base.DimensionReducer

Initialise a tSNE estimator with the provided configuration.

Parameters:
  • dimension_count (int, optional) – The dimensions to reduce the data to. Defaults to 2.

  • distance_metric (Literal[, optional) – tSNE distance metric. Defaults to “euclidean”.

  • iteration_count (int, optional) – tSNE algorithm iteration count. Defaults to 1000.

  • iteration_count_without_progress (int, optional) – Iterations to continue without progress is made. Defaults to 300.

  • learning_rate (float, optional) – tSNE learning rate. Defaults to 200.0.

  • perplexity (float, optional) – Determines the spread of the tSNE data points. Defaults to 50.0.

  • seed (Optional[int], optional) – Random seed. Defaults to None.

  • use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.

dimension_count = 2
distance_metric = 'euclidean'
fit_transform(data, flatten=True)

Fit a tSNE estimator and transform the tSNE dimensions for the data.

Parameters:
  • data (NDArray) – The data (channel-last) to be reduced in dimensionality.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The tSNE representation of the data.

Return type:

NDArray

iteration_count = 1000
iteration_count_without_progress = 300
learning_rate = 200.0
perplexity = 50.0
seed = None
set_estimator()

Set the tSNE estimator.

Return type:

None

use_gpu = True
class spatiomic.dimension.umap(dimension_count=2, distance_min=0.2, distance_metric='euclidean', spread=1.0, neighbor_count=100, seed=None, use_gpu=True, **kwargs)

Bases: spatiomic.dimension._base.DimensionReducer

Initialise a UMAP estimator with the provided configuration.

Keyword arguments are passed to the UMAP estimator, so that it is possible to use precomputed_knn for example.

Parameters:
  • dimension_count (int, optional) – The desired (reduced) dimensionality. Defaults to 2.

  • distance_min (float, optional) – A key paramter of the UMAP function. Defaults to 0.2.

  • distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – The distance metric to be used for nearest neighbor calculation. Defaults to “euclidean”.

  • spread (float, optional) – A key paramter of the UMAP function. Defaults to 1.0.

  • neighbor_count (int, optional) – A key paramter of the UMAP function. Defaults to 100.

  • seed (Optional[int], optional) – Random seed. Defaults to None.

  • use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.

  • kwargs (dict)

dimension_count = 2
distance_metric = 'euclidean'
distance_min = 0.2
fit(data)

Fit the UMAP estimator on the data.

Parameters:

data (NDArray) – The data (channel-last) to fit the UMAP by.

Return type:

None

fit_transform(data, flatten=True)

Fit a UMAP estimator and transform the UMAP dimensions for the data.

Parameters:
  • data (NDArray) – The data (channel-last) to be reduced in dimensionality.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The UMAP representation of the data.

Return type:

NDArray

neighbor_count = 100
seed = None
set_estimator(**kwargs)

Set the UMAP estimator.

Parameters:

kwargs (dict)

Return type:

None

spread = 1.0
transform(data, flatten=True)

Transform the UMAP dimensions for the data with a previously fit estimator.

Parameters:
  • data (NDArray) – The data (channel-last) to be reduced in dimensionality.

  • flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.

Returns:

The UMAP representation of the data.

Return type:

NDArray

use_gpu = True