spatiomic.dimension¶
Make all dimensionality reduction classes available in the dimension submodule.
Classes¶
Package Contents¶
- class spatiomic.dimension.pca(dimension_count=2, batch_size=None, flavor='auto', use_gpu=True, **kwargs)¶
Bases:
spatiomic.dimension._base.DimensionReducer
Initialise the PCA class, set the dimension count and the estimator.
- Parameters:
dimension_count (int, optional) – The number of principal components that the data is to be reduced to. Defaults to 2.
batch_size (Optional[int], optional) – The batch size for the IncrementalPCA algorithm for a smaller memory profile. Defaults to None.
flavor (Literal["auto", "full", "incremental", "nmf"], optional) – The flavor of PCA to be used. Defaults to “auto”.
use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.
kwargs (dict)
- batch_size = None¶
- dimension_count = 2¶
- fit(data)¶
Fit the PCA estimator of the class to the data.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
- Return type:
None
- fit_transform(data, flatten=True)¶
Perform principal-component analysis on the data with the settings of the class.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.
- Returns:
The principal components of the data.
- Return type:
NDArray
- get_explained_variance_ratio()¶
Get the explained variance ratio of the principal components.
- Returns:
The explained variance ratio of the principal components.
- Return type:
NDArray
- get_loadings()¶
Get the loadings of the principal components.
- Returns:
The loadings of the principal components.
- Return type:
NDArray
- set_estimator(flavor='auto', **kwargs)¶
Set the IncrementalPCA estimator.
- Parameters:
flavor (Literal["auto", "full", "incremental", "nmf"], optional) – The flavor of PCA to be used. When set to “auto” or “full” the full PCA implementation is used. When set to “incremental”, the IncrementalPCA implementation is used. The “auto” flavor may try to determine the best implementation based on free memory in the future. Defaults to “auto”.
kwargs (dict)
- Raises:
ValueError – If the flavor is not supported.
- Return type:
None
- transform(data, flatten=True)¶
Perform principal-component analysis on the data with the settings of the class.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.
- Returns:
The principal components of the data.
- Return type:
NDArray
- use_gpu = True¶
- weigh_by_explained_variance_ratio(data, adapt_mean=False, flatten=True)¶
Weigh a PCA-transformed NDArray by the explained variance ratio of the principal components.
- Parameters:
- Returns:
The weighted and optionally mean-adapted principal components.
- Return type:
NDArray
- class spatiomic.dimension.som(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None, use_gpu=True)¶
Bases:
spatiomic.dimension._base.LoadableDimensionReducer
Initialise a self-organising map with the provided configuration.
The advantage of self-organizing maps are, that they reduce the dimensionality of the data while preserving the feature dimensions. This potentially allows for a better interpretation of the data. The disadvantage is, that its output are
representative
nodes and not the actual data points. There are four things to keep in mind so that the SOM best represents the biology of your data: - The SOM node count should be large enough to capture the topography of the data. If you data is very uniform,you can use a smaller node count. However, when working with tissue, very different imaging markers and multiple disease states, a larger node count is recommended.
The SOM should be trained for long enough to capture the topography of the data.
- The SOM should be trained with a final learning rate that is not too high, so that the SOM can accurately
represent small differences in the data.
- The SOM should be trained with a final neighborhood size that is not too large. SOM nodes are not individually
updated during training, but rather in a neighborhood. If your neighborhood is too large, other, perhaps more abundant biological signals will pull nodes that represent less abundant signals towards them, leading to a worse representation of the latter.
- Parameters:
node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).
dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.
distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.
neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.
neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.
learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.
learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.
sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.
sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.
parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.
n_jobs (int, optional) – Jobs to perform simoustaneously when using the sklearn NearestNeighbor class. The value -1 means unlimited jobs. Defaults to -1.
seed (Optional[int], optional) – The random seed. Defaults to None.
use_gpu (bool, optional) – Whether to use cupy. Defaults to True.
- fit(data, iteration_count=50, pca_init=False)¶
Fit the SOM on the data.
- Parameters:
- Return type:
None
- fit_predict(data, iteration_count=50, pca_init=False, return_distance=False)¶
Fit the SOM on the data and return the id of the nearest SOM node for each data point.
- Parameters:
data (NDArray) – The data to be labelled based on the label of the nearest SOM node.
iteration_count (int, optional) – The iterations to train the SOM for. Defaults to 50.
pca_init (bool, optional) – Whether to initialise the SOM through PCA. Defaults to False.
return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.
- Returns:
- The id of the nearest SOM node for each data point or the ids and
distances.
- Return type:
Union[NDArray, Tuple[NDArray, NDArray]]
- get_config()¶
Get the config of the SOM class.
- Returns:
The class configuration as a dictionary.
- Return type:
- get_nodes(flatten=True)¶
Get the weights of a previously fitted SOM.
- Parameters:
flatten (bool, optional) – Whether to flatten the SOM dimensions (but not the channel dimensionality). Defaults to True.
- Raises:
ValueError – If no self-organizing map has been created or loaded.
- Returns:
The weights of the SOM nodes.
- Return type:
NDArray
- get_quantization_error(data, return_distances=False)¶
Get the quantization error of the SOM on the provided data.
Uses the neighbor finder to find the nearest neighbor of each data point in the SOM and calculates the quantization error based on the distance between the data point and its nearest neighbor.
- label(data, clusters, save_path=None, return_distance=False, flatten=False)¶
Get the label for each data point based on the label for its closest SOM node.
This function internally uses the predict method to get the nearest node for each data point and then assigns the label of the nearest node to the data point. Labels have to be provided for each SOM node based on a clustering of the SOM nodes.
- Parameters:
data (NDArray) – The data to be labelled based on the label of the nearest SOM node.
clusters (Union[List[int], NDArray]) – A list of clusters (one for each SOM node).
save_path (Optional[str], optional) – The path where to save the SOM and its configuration to. Defaults to None.
return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.
flatten (bool, optional) – Whether to flatten the input data in every but the channel dimension. Defaults to False.
- Returns:
The cluster for each data point or the clusters and distances.
- Return type:
Union[NDArray, Tuple[NDArray, NDArray]]
- load(save_path)¶
Load a previously pickled SOM.
- Parameters:
save_path (str) – The path where to load the SOM and its configuration from.
- Return type:
None
- predict(data, return_distance=False)¶
Get the id of the nearest SOM node for each data point and optionally the distance to the node.
- Parameters:
data (NDArray) – The data to be labelled based on the label of the nearest SOM node.
return_distance (bool, optional) – Whether to return the distance to the nearest node. Defaults to False.
- Returns:
The labels for each data point or the labels and distances.
- Return type:
Union[NDArray, Tuple[NDArray, NDArray]]
- save(save_path)¶
Pickle and save a previously fit SOM.
- Parameters:
save_path (str) – The path where to save the SOM and its configuration to.
- Raises:
ValueError – If no self-organizing map has been created or loaded.
- Return type:
None
- set_config(node_count=(50, 50), dimension_count=5, distance_metric='euclidean', neighborhood='gaussian', neighborhood_accuracy='fast', learning_rate_initial=0.2, learning_rate_final=0.003, sigma_initial=None, sigma_final=None, parallel_count=8096, n_jobs=-1, seed=None)¶
Set the config of the SOM class.
- Parameters:
node_count (Tuple[int, int], optional) – Tuple determining the SOM node size. Defaults to (50, 50).
dimension_count (int, optional) – Dimensionality of the original data. Defaults to 5.
distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – Distance metric to use. Defaults to “euclidean”.
neighborhood (str, optional) – The type of neighborhood used to determine related notes. Defaults to “gaussian”.
neighborhood_accuracy (Literal["fast", "accurate"], optional) – The accuracy to use for the neighborhood. Defaults to “fast”.
learning_rate_initial (float, optional) – The initial learning rate. Defaults to 0.2.
learning_rate_final (float, optional) – The learning rate at the end of the SOM training. Defaults to 3e-3.
sigma_initial (Optional[int], optional) – The initial size of the neighborhoods, higher values. Defaults to None.
sigma_final (Optional[int], optional) – The final size of the neighborhood, lower values. Defaults to None.
parallel_count (int, optional) – Data points to process concurrently. Defaults to 8096.
seed (Optional[int], optional) – The random seed. Defaults to None.
n_jobs (int)
- Return type:
None
- set_estimators()¶
Set the XPySOM and nearest neighbor finder estimators.
- Return type:
None
- set_nodes(nodes)¶
Set the weights of a previously fit SOM.
- Parameters:
nodes (NDArray) – The weights of the SOM nodes.
- Return type:
None
- use_gpu = True¶
- class spatiomic.dimension.tsne(dimension_count=2, distance_metric='euclidean', iteration_count=1000, iteration_count_without_progress=300, learning_rate=200.0, perplexity=50.0, seed=None, use_gpu=True)¶
Bases:
spatiomic.dimension._base.DimensionReducer
Initialise a tSNE estimator with the provided configuration.
- Parameters:
dimension_count (int, optional) – The dimensions to reduce the data to. Defaults to 2.
distance_metric (Literal[, optional) – tSNE distance metric. Defaults to “euclidean”.
iteration_count (int, optional) – tSNE algorithm iteration count. Defaults to 1000.
iteration_count_without_progress (int, optional) – Iterations to continue without progress is made. Defaults to 300.
learning_rate (float, optional) – tSNE learning rate. Defaults to 200.0.
perplexity (float, optional) – Determines the spread of the tSNE data points. Defaults to 50.0.
seed (Optional[int], optional) – Random seed. Defaults to None.
use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.
- dimension_count = 2¶
- distance_metric = 'euclidean'¶
- fit_transform(data, flatten=True)¶
Fit a tSNE estimator and transform the tSNE dimensions for the data.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.
- Returns:
The tSNE representation of the data.
- Return type:
NDArray
- iteration_count = 1000¶
- iteration_count_without_progress = 300¶
- learning_rate = 200.0¶
- perplexity = 50.0¶
- seed = None¶
- set_estimator()¶
Set the tSNE estimator.
- Return type:
None
- use_gpu = True¶
- class spatiomic.dimension.umap(dimension_count=2, distance_min=0.2, distance_metric='euclidean', spread=1.0, neighbor_count=100, seed=None, use_gpu=True, **kwargs)¶
Bases:
spatiomic.dimension._base.DimensionReducer
Initialise a UMAP estimator with the provided configuration.
Keyword arguments are passed to the UMAP estimator, so that it is possible to use
precomputed_knn
for example.- Parameters:
dimension_count (int, optional) – The desired (reduced) dimensionality. Defaults to 2.
distance_min (float, optional) – A key paramter of the UMAP function. Defaults to 0.2.
distance_metric (Literal["euclidean", "manhattan", "correlation", "cosine"], optional) – The distance metric to be used for nearest neighbor calculation. Defaults to “euclidean”.
spread (float, optional) – A key paramter of the UMAP function. Defaults to 1.0.
neighbor_count (int, optional) – A key paramter of the UMAP function. Defaults to 100.
seed (Optional[int], optional) – Random seed. Defaults to None.
use_gpu (bool, optional) – Whether to use the cuml implementation on the GPU. Defaults to True.
kwargs (dict)
- dimension_count = 2¶
- distance_metric = 'euclidean'¶
- distance_min = 0.2¶
- fit(data)¶
Fit the UMAP estimator on the data.
- Parameters:
data (NDArray) – The data (channel-last) to fit the UMAP by.
- Return type:
None
- fit_transform(data, flatten=True)¶
Fit a UMAP estimator and transform the UMAP dimensions for the data.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.
- Returns:
The UMAP representation of the data.
- Return type:
NDArray
- neighbor_count = 100¶
- seed = None¶
- spread = 1.0¶
- transform(data, flatten=True)¶
Transform the UMAP dimensions for the data with a previously fit estimator.
- Parameters:
data (NDArray) – The data (channel-last) to be reduced in dimensionality.
flatten (bool, optional) – Whether to flatten the data in every but the channel dimension. Defaults to True.
- Returns:
The UMAP representation of the data.
- Return type:
NDArray
- use_gpu = True¶