spatiomic.dimension =================== .. py:module:: spatiomic.dimension .. autoapi-nested-parse:: Make all dimensionality reduction classes available in the dimension submodule. Classes ------- .. autoapisummary:: spatiomic.dimension.pca spatiomic.dimension.som spatiomic.dimension.tsne spatiomic.dimension.umap Package Contents ---------------- .. py:class:: pca(dimension_count = 2, batch_size = None, flavor = 'auto', use_gpu = True, **kwargs) Bases: :py:obj:`spatiomic.dimension._base.DimensionReducer` Initialise the PCA class, set the dimension count and the estimator. :param dimension_count: The number of principal components that the data is to be reduced to. Defaults to 2. :type dimension_count: int, optional :param batch_size: The batch size for the IncrementalPCA algorithm for a smaller memory profile. Defaults to None. :type batch_size: Optional[int], optional :param flavor: The flavor of PCA to be used. Defaults to "auto". :type flavor: Literal["auto", "full", "incremental", "nmf"], optional :param use_gpu: Whether to use the cuml implementation on the GPU. Defaults to True. :type use_gpu: bool, optional .. py:attribute:: batch_size :value: None .. py:attribute:: dimension_count :value: 2 .. py:method:: fit(data) Fit the PCA estimator of the class to the data. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray .. py:method:: fit_transform(data, flatten = True) Perform principal-component analysis on the data with the settings of the class. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The principal components of the data. :rtype: NDArray .. py:method:: get_explained_variance_ratio() Get the explained variance ratio of the principal components. :returns: The explained variance ratio of the principal components. :rtype: NDArray .. py:method:: get_loadings() Get the loadings of the principal components. :returns: The loadings of the principal components. :rtype: NDArray .. py:method:: set_estimator(flavor = 'auto', **kwargs) Set the IncrementalPCA estimator. :param flavor: The flavor of PCA to be used. When set to "auto" or "full" the full PCA implementation is used. When set to "incremental", the IncrementalPCA implementation is used. The "auto" flavor may try to determine the best implementation based on free memory in the future. Defaults to "auto". :type flavor: Literal["auto", "full", "incremental", "nmf"], optional :raises ValueError: If the flavor is not supported. .. py:method:: transform(data, flatten = True) Perform principal-component analysis on the data with the settings of the class. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The principal components of the data. :rtype: NDArray .. py:attribute:: use_gpu :value: True .. py:method:: weigh_by_explained_variance_ratio(data, adapt_mean = False, flatten = True) Weigh a PCA-transformed NDArray by the explained variance ratio of the principal components. :param data: The principal components. :type data: NDArray :param adapt_mean: Whether to adapt the mean value from the first component. Defaults to False. :type adapt_mean: bool, optional :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The weighted and optionally mean-adapted principal components. :rtype: NDArray .. py:class:: som(node_count = (50, 50), dimension_count = 5, distance_metric = 'euclidean', neighborhood = 'gaussian', neighborhood_accuracy = 'fast', learning_rate_initial = 0.2, learning_rate_final = 0.003, sigma_initial = None, sigma_final = None, parallel_count = 8096, n_jobs = -1, seed = None, use_gpu = True) Bases: :py:obj:`spatiomic.dimension._base.LoadableDimensionReducer` Initialise a self-organising map with the provided configuration. The advantage of self-organizing maps are, that they reduce the dimensionality of the data while preserving the feature dimensions. This potentially allows for a better interpretation of the data. The disadvantage is, that its output are `representative` nodes and not the actual data points. There are four things to keep in mind so that the SOM best represents the biology of your data: - The SOM node count should be large enough to capture the topography of the data. If you data is very uniform, you can use a smaller node count. However, when working with tissue, very different imaging markers and multiple disease states, a larger node count is recommended. - The SOM should be trained for long enough to capture the topography of the data. - The SOM should be trained with a final learning rate that is not too high, so that the SOM can accurately represent small differences in the data. - The SOM should be trained with a final neighborhood size that is not too large. SOM nodes are not individually updated during training, but rather in a neighborhood. If your neighborhood is too large, other, perhaps more abundant biological signals will pull nodes that represent less abundant signals towards them, leading to a worse representation of the latter. :param node_count: Tuple determining the SOM node size. Defaults to (50, 50). :type node_count: Tuple[int, int], optional :param dimension_count: Dimensionality of the original data. Defaults to 5. :type dimension_count: int, optional :param distance_metric: Distance metric to use. Defaults to "euclidean". :type distance_metric: Literal["euclidean", "manhattan", "correlation", "cosine"], optional :param neighborhood: The type of neighborhood used to determine related notes. Defaults to "gaussian". :type neighborhood: str, optional :param neighborhood_accuracy: The accuracy to use for the neighborhood. Defaults to "fast". :type neighborhood_accuracy: Literal["fast", "accurate"], optional :param learning_rate_initial: The initial learning rate. Defaults to 0.2. :type learning_rate_initial: float, optional :param learning_rate_final: The learning rate at the end of the SOM training. Defaults to 3e-3. :type learning_rate_final: float, optional :param sigma_initial: The initial size of the neighborhoods, higher values. Defaults to None. :type sigma_initial: Optional[int], optional :param sigma_final: The final size of the neighborhood, lower values. Defaults to None. :type sigma_final: Optional[int], optional :param parallel_count: Data points to process concurrently. Defaults to 8096. :type parallel_count: int, optional :param n_jobs: Jobs to perform simoustaneously when using the sklearn NearestNeighbor class. The value -1 means unlimited jobs. Defaults to -1. :type n_jobs: int, optional :param seed: The random seed. Defaults to None. :type seed: Optional[int], optional :param use_gpu: Whether to use cupy. Defaults to True. :type use_gpu: bool, optional .. py:method:: fit(data, iteration_count = 50, pca_init = False) Fit the SOM on the data. :param data: The data (channel-last) of which the SOM should capture the topography. :type data: Union[pd.DataFrame, NDArray] :param iteration_count: The iterations to train the SOM for. Defaults to 50. :type iteration_count: int, optional :param pca_init: Whether to initialise the SOM through PCA. Defaults to False. :type pca_init: bool, optional .. py:method:: fit_predict(data, iteration_count = 50, pca_init = False, return_distance = False) Fit the SOM on the data and return the id of the nearest SOM node for each data point. :param data: The data to be labelled based on the label of the nearest SOM node. :type data: NDArray :param iteration_count: The iterations to train the SOM for. Defaults to 50. :type iteration_count: int, optional :param pca_init: Whether to initialise the SOM through PCA. Defaults to False. :type pca_init: bool, optional :param return_distance: Whether to return the distance to the nearest node. Defaults to False. :type return_distance: bool, optional :returns: The id of the nearest SOM node for each data point or the ids and distances. :rtype: Union[NDArray, Tuple[NDArray, NDArray]] .. py:method:: get_config() Get the config of the SOM class. :returns: The class configuration as a dictionary. :rtype: dict .. py:method:: get_nodes(flatten = True) Get the weights of a previously fitted SOM. :param flatten: Whether to flatten the SOM dimensions (but not the channel dimensionality). Defaults to True. :type flatten: bool, optional :raises ValueError: If no self-organizing map has been created or loaded. :returns: The weights of the SOM nodes. :rtype: NDArray .. py:method:: get_quantization_error(data, return_distances = False) Get the quantization error of the SOM on the provided data. Uses the neighbor finder to find the nearest neighbor of each data point in the SOM and calculates the quantization error based on the distance between the data point and its nearest neighbor. :param data: The data to get the quantization error for. :type data: NDArray :returns: The mean quantization error and all distances if `return_distances`` is set to True. :rtype: Union[float, Tuple[float, np.ndarray]] .. py:method:: label(data, clusters, save_path = None, return_distance = False, flatten = False) Get the label for each data point based on the label for its closest SOM node. This function internally uses the predict method to get the nearest node for each data point and then assigns the label of the nearest node to the data point. Labels have to be provided for each SOM node based on a clustering of the SOM nodes. :param data: The data to be labelled based on the label of the nearest SOM node. :type data: NDArray :param clusters: A list of clusters (one for each SOM node). :type clusters: Union[List[int], NDArray] :param save_path: The path where to save the SOM and its configuration to. Defaults to None. :type save_path: Optional[str], optional :param return_distance: Whether to return the distance to the nearest node. Defaults to False. :type return_distance: bool, optional :param flatten: Whether to flatten the input data in every but the channel dimension. Defaults to False. :type flatten: bool, optional :returns: The cluster for each data point or the clusters and distances. :rtype: Union[NDArray, Tuple[NDArray, NDArray]] .. py:method:: load(save_path) Load a previously pickled SOM. :param save_path: The path where to load the SOM and its configuration from. :type save_path: str .. py:method:: predict(data, return_distance = False) Get the id of the nearest SOM node for each data point and optionally the distance to the node. :param data: The data to be labelled based on the label of the nearest SOM node. :type data: NDArray :param return_distance: Whether to return the distance to the nearest node. Defaults to False. :type return_distance: bool, optional :returns: The labels for each data point or the labels and distances. :rtype: Union[NDArray, Tuple[NDArray, NDArray]] .. py:method:: save(save_path) Pickle and save a previously fit SOM. :param save_path: The path where to save the SOM and its configuration to. :type save_path: str :raises ValueError: If no self-organizing map has been created or loaded. .. py:method:: set_config(node_count = (50, 50), dimension_count = 5, distance_metric = 'euclidean', neighborhood = 'gaussian', neighborhood_accuracy = 'fast', learning_rate_initial = 0.2, learning_rate_final = 0.003, sigma_initial = None, sigma_final = None, parallel_count = 8096, n_jobs = -1, seed = None) Set the config of the SOM class. :param node_count: Tuple determining the SOM node size. Defaults to (50, 50). :type node_count: Tuple[int, int], optional :param dimension_count: Dimensionality of the original data. Defaults to 5. :type dimension_count: int, optional :param distance_metric: Distance metric to use. Defaults to "euclidean". :type distance_metric: Literal["euclidean", "manhattan", "correlation", "cosine"], optional :param neighborhood: The type of neighborhood used to determine related notes. Defaults to "gaussian". :type neighborhood: str, optional :param neighborhood_accuracy: The accuracy to use for the neighborhood. Defaults to "fast". :type neighborhood_accuracy: Literal["fast", "accurate"], optional :param learning_rate_initial: The initial learning rate. Defaults to 0.2. :type learning_rate_initial: float, optional :param learning_rate_final: The learning rate at the end of the SOM training. Defaults to 3e-3. :type learning_rate_final: float, optional :param sigma_initial: The initial size of the neighborhoods, higher values. Defaults to None. :type sigma_initial: Optional[int], optional :param sigma_final: The final size of the neighborhood, lower values. Defaults to None. :type sigma_final: Optional[int], optional :param parallel_count: Data points to process concurrently. Defaults to 8096. :type parallel_count: int, optional :param seed: The random seed. Defaults to None. :type seed: Optional[int], optional .. py:method:: set_estimators() Set the XPySOM and nearest neighbor finder estimators. .. py:method:: set_nodes(nodes) Set the weights of a previously fit SOM. :param nodes: The weights of the SOM nodes. :type nodes: NDArray .. py:attribute:: use_gpu :value: True .. py:class:: tsne(dimension_count = 2, distance_metric = 'euclidean', iteration_count = 1000, iteration_count_without_progress = 300, learning_rate = 200.0, perplexity = 50.0, seed = None, use_gpu = True) Bases: :py:obj:`spatiomic.dimension._base.DimensionReducer` Initialise a tSNE estimator with the provided configuration. :param dimension_count: The dimensions to reduce the data to. Defaults to 2. :type dimension_count: int, optional :param distance_metric: tSNE distance metric. Defaults to "euclidean". :type distance_metric: Literal[, optional :param iteration_count: tSNE algorithm iteration count. Defaults to 1000. :type iteration_count: int, optional :param iteration_count_without_progress: Iterations to continue without progress is made. Defaults to 300. :type iteration_count_without_progress: int, optional :param learning_rate: tSNE learning rate. Defaults to 200.0. :type learning_rate: float, optional :param perplexity: Determines the spread of the tSNE data points. Defaults to 50.0. :type perplexity: float, optional :param seed: Random seed. Defaults to None. :type seed: Optional[int], optional :param use_gpu: Whether to use the cuml implementation on the GPU. Defaults to True. :type use_gpu: bool, optional .. py:attribute:: dimension_count :value: 2 .. py:attribute:: distance_metric :value: 'euclidean' .. py:method:: fit_transform(data, flatten = True) Fit a tSNE estimator and transform the tSNE dimensions for the data. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The tSNE representation of the data. :rtype: NDArray .. py:attribute:: iteration_count :value: 1000 .. py:attribute:: iteration_count_without_progress :value: 300 .. py:attribute:: learning_rate :value: 200.0 .. py:attribute:: perplexity :value: 50.0 .. py:attribute:: seed :value: None .. py:method:: set_estimator() Set the tSNE estimator. .. py:attribute:: use_gpu :value: True .. py:class:: umap(dimension_count = 2, distance_min = 0.2, distance_metric = 'euclidean', spread = 1.0, neighbor_count = 100, seed = None, use_gpu = True, **kwargs) Bases: :py:obj:`spatiomic.dimension._base.DimensionReducer` Initialise a UMAP estimator with the provided configuration. Keyword arguments are passed to the UMAP estimator, so that it is possible to use `precomputed_knn` for example. :param dimension_count: The desired (reduced) dimensionality. Defaults to 2. :type dimension_count: int, optional :param distance_min: A key paramter of the UMAP function. Defaults to 0.2. :type distance_min: float, optional :param distance_metric: The distance metric to be used for nearest neighbor calculation. Defaults to "euclidean". :type distance_metric: Literal["euclidean", "manhattan", "correlation", "cosine"], optional :param spread: A key paramter of the UMAP function. Defaults to 1.0. :type spread: float, optional :param neighbor_count: A key paramter of the UMAP function. Defaults to 100. :type neighbor_count: int, optional :param seed: Random seed. Defaults to None. :type seed: Optional[int], optional :param use_gpu: Whether to use the cuml implementation on the GPU. Defaults to True. :type use_gpu: bool, optional .. py:attribute:: dimension_count :value: 2 .. py:attribute:: distance_metric :value: 'euclidean' .. py:attribute:: distance_min :value: 0.2 .. py:method:: fit(data) Fit the UMAP estimator on the data. :param data: The data (channel-last) to fit the UMAP by. :type data: NDArray .. py:method:: fit_transform(data, flatten = True) Fit a UMAP estimator and transform the UMAP dimensions for the data. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The UMAP representation of the data. :rtype: NDArray .. py:attribute:: neighbor_count :value: 100 .. py:attribute:: seed :value: None .. py:method:: set_estimator(**kwargs) Set the UMAP estimator. .. py:attribute:: spread :value: 1.0 .. py:method:: transform(data, flatten = True) Transform the UMAP dimensions for the data with a previously fit estimator. :param data: The data (channel-last) to be reduced in dimensionality. :type data: NDArray :param flatten: Whether to flatten the data in every but the channel dimension. Defaults to True. :type flatten: bool, optional :returns: The UMAP representation of the data. :rtype: NDArray .. py:attribute:: use_gpu :value: True