spatiomic.cluster
=================

.. py:module:: spatiomic.cluster

.. autoapi-nested-parse::

   Make all clustering classes available in the cluster submodule.


Classes
-------

.. autoapisummary::

   spatiomic.cluster.agglomerative
   spatiomic.cluster.kmeans
   spatiomic.cluster.leiden
   spatiomic.cluster.som


Package Contents
----------------

.. py:class:: agglomerative(cluster_count = 10, distance_metric = 'euclidean', use_gpu = True)

   Bases: :py:obj:`spatiomic.cluster._base.ClusterInterface`


   Set the configuration for the agglomerative clustering class and initialise it.

   :param cluster_count: The number of clusters to group the data into. Defaults to 10.
   :type cluster_count: int, optional
   :param distance_metric: The distance metric to use.
                           Defaults to "euclidean".
   :type distance_metric: Literal["euclidean", "manhattan", "cosine"], optional
   :param use_gpu: Whether to use the cuml or the sklearn AgglomerativeClustering class.
                   Defaults to True.
   :type use_gpu: bool, optional


   .. py:attribute:: cluster_count
      :value: 10


   .. py:attribute:: connectivity
      :type:  Optional[str]


   .. py:attribute:: distance_metric
      :value: 'euclidean'


   .. py:method:: fit_predict(data, **kwargs)

      Perform agglomerative hierarchical clustering on the data with the settings of the class.

      :param data: The data to be clustered, last dimension being features.
      :type data: Union[pd.DataFrame, NDArray]

      :returns: An array containing the cluster for each of the data points.
      :rtype: NDArray

      :raises ValueError: If the estimator does not have a fit_predict method.


   .. py:method:: set_estimator()

      Set the AgglomerativeClustering estimator.


   .. py:attribute:: use_gpu
      :value: True


.. py:class:: kmeans(cluster_count = 20, run_count = 1, iteration_count = 300, tolerance = 0.001, init = 'k-means++', seed = None, use_gpu = True)

   Bases: :py:obj:`spatiomic.cluster._base.ClusterInterface`


   Set the configuration for the k-means clustering class and initialise it.

   :param cluster_count: The number of k-means clusters to split the data into. Defaults to 20.
   :type cluster_count: int, optional
   :param run_count: The number of k-means runs to perform. Defaults to 1.
   :type run_count: int, optional
   :param iteration_count: Maximum number of k-means iterations per run. Defaults to 300.
   :type iteration_count: int, optional
   :param tolerance: Maximum tolerance in center shift to declare convergence. Defaults to 1e-3.
   :type tolerance: float, optional
   :param init: How to initialise the centers, either "random"
                or "k-means++". Defaults to "k-means++".
   :type init: Literal["random", "k-means++"], optional
   :param seed: Random seed, either to be temporarily set as the numpy random seed
                or directly for libKmCuda. Defaults to None.
   :type seed: Optional[int], optional
   :param use_gpu: Whether to use the cuml or the sklearn kMeans class. Defaults to True.
   :type use_gpu: bool, optional


   .. py:attribute:: cluster_count
      :value: 20


   .. py:method:: fit_predict(data)

      Perform k-means clustering on the data with the settings of the class.

      :param data: The data to be clustered, last dimension being features.
      :type data: NDArray

      :returns: An array containing the cluster for each of the data points.
      :rtype: NDArray


   .. py:attribute:: init
      :value: 'k-means++'


   .. py:attribute:: iteration_count
      :value: 300


   .. py:attribute:: run_count
      :value: 1


   .. py:attribute:: seed
      :value: None


   .. py:method:: set_estimator()

      Set the KMeans estimator.


   .. py:attribute:: tolerance
      :value: 0.001


   .. py:attribute:: use_gpu
      :value: True


.. py:class:: leiden

   Initialise the Leiden clustering class.


   .. py:attribute:: graph
      :type:  Union[igraph.Graph, None]
      :value: None


   .. py:method:: predict(graph, resolution = 1.0, iteration_count = 1000, seed = 0, use_gpu = True, *args, **kwargs)

      Create a neighborhood-graph based on the data and perform Leiden clustering on it.

      .. warning:: The GPU version of Leiden may provide different results than the CPU version.

      :param graph: An igraph Graph to be optimised for community detection.
      :type graph: Graph
      :param resolution: Scales the minimum interconnectedness for a positive modularity.
                         Higher values result in more but deeper interconnected communities. Defaults to 1.0.
      :type resolution: float, optional
      :param iteration_count: Iteration count to run the Leiden algorithm for. Defaults to 1000.
      :type iteration_count: int, optional
      :param seed: Random seed to use for the Leiden algorithm. Defaults to 0.
      :type seed: int, optional
      :param use_gpu: Whether to use the GPU or CPU for the Leiden algorithm. Defaults to True.
      :type use_gpu: bool, optional

      :returns:

                A Tuple of a list of the assigned communities, the final modularity
                    and the optimised igraph Graph.
      :rtype: Tuple[List[int], float, Graph]


.. py:class:: som(node_count = (50, 50), dimension_count = 5, distance_metric = 'euclidean', neighborhood = 'gaussian', neighborhood_accuracy = 'fast', learning_rate_initial = 0.2, learning_rate_final = 0.003, sigma_initial = None, sigma_final = None, parallel_count = 8096, n_jobs = -1, seed = None, use_gpu = True)

   Bases: :py:obj:`spatiomic.dimension._base.LoadableDimensionReducer`


   Initialise a self-organising map with the provided configuration.

   The advantage of self-organizing maps are, that they reduce the dimensionality of the data while preserving the
   feature dimensions. This potentially allows for a better interpretation of the data. The disadvantage is, that
   its output are `representative` nodes and not the actual data points. There are four things to keep in mind so
   that the SOM best represents the biology of your data:
   - The SOM node count should be large enough to capture the topography of the data. If you data is very uniform,
       you can use a smaller node count. However, when working with tissue, very different imaging markers and
       multiple disease states, a larger node count is recommended.
   - The SOM should be trained for long enough to capture the topography of the data.
   - The SOM should be trained with a final learning rate that is not too high, so that the SOM can accurately
       represent small differences in the data.
   - The SOM should be trained with a final neighborhood size that is not too large. SOM nodes are not individually
       updated during training, but rather in a neighborhood. If your neighborhood is too large, other, perhaps
       more abundant biological signals will pull nodes that represent less abundant signals towards them, leading
       to a worse representation of the latter.

   :param node_count: Tuple determining the SOM node size. Defaults to (50, 50).
   :type node_count: Tuple[int, int], optional
   :param dimension_count: Dimensionality of the original data. Defaults to 5.
   :type dimension_count: int, optional
   :param distance_metric: Distance metric to
                           use. Defaults to "euclidean".
   :type distance_metric: Literal["euclidean", "manhattan", "correlation", "cosine"], optional
   :param neighborhood: The type of neighborhood used to determine related notes.
                        Defaults to "gaussian".
   :type neighborhood: str, optional
   :param neighborhood_accuracy: The accuracy to use for the neighborhood.
                                 Defaults to "fast".
   :type neighborhood_accuracy: Literal["fast", "accurate"], optional
   :param learning_rate_initial: The initial learning rate. Defaults to 0.2.
   :type learning_rate_initial: float, optional
   :param learning_rate_final: The learning rate at the end of the SOM training. Defaults to 3e-3.
   :type learning_rate_final: float, optional
   :param sigma_initial: The initial size of the neighborhoods, higher values.
                         Defaults to None.
   :type sigma_initial: Optional[int], optional
   :param sigma_final: The final size of the neighborhood, lower values. Defaults to None.
   :type sigma_final: Optional[int], optional
   :param parallel_count: Data points to process concurrently. Defaults to 8096.
   :type parallel_count: int, optional
   :param n_jobs: Jobs to perform simoustaneously when using the sklearn NearestNeighbor class.
                  The value -1 means unlimited jobs. Defaults to -1.
   :type n_jobs: int, optional
   :param seed: The random seed. Defaults to None.
   :type seed: Optional[int], optional
   :param use_gpu: Whether to use cupy. Defaults to True.
   :type use_gpu: bool, optional


   .. py:method:: fit(data, iteration_count = 50, pca_init = False)

      Fit the SOM on the data.

      :param data: The data (channel-last) of which the SOM should capture the topography.
      :type data: Union[pd.DataFrame, NDArray]
      :param iteration_count: The iterations to train the SOM for. Defaults to 50.
      :type iteration_count: int, optional
      :param pca_init: Whether to initialise the SOM through PCA. Defaults to False.
      :type pca_init: bool, optional


   .. py:method:: fit_predict(data, iteration_count = 50, pca_init = False, return_distance = False)

      Fit the SOM on the data and return the id of the nearest SOM node for each data point.

      :param data: The data to be labelled based on the label of the nearest SOM node.
      :type data: NDArray
      :param iteration_count: The iterations to train the SOM for. Defaults to 50.
      :type iteration_count: int, optional
      :param pca_init: Whether to initialise the SOM through PCA. Defaults to False.
      :type pca_init: bool, optional
      :param return_distance: Whether to return the distance to the nearest node. Defaults to False.
      :type return_distance: bool, optional

      :returns:

                The id of the nearest SOM node for each data point or the ids and
                    distances.
      :rtype: Union[NDArray, Tuple[NDArray, NDArray]]


   .. py:method:: get_config()

      Get the config of the SOM class.

      :returns: The class configuration as a dictionary.
      :rtype: dict


   .. py:method:: get_nodes(flatten = True)

      Get the weights of a previously fitted SOM.

      :param flatten: Whether to flatten the SOM dimensions (but not the channel dimensionality).
                      Defaults to True.
      :type flatten: bool, optional

      :raises ValueError: If no self-organizing map has been created or loaded.

      :returns: The weights of the SOM nodes.
      :rtype: NDArray


   .. py:method:: get_quantization_error(data, return_distances = False)

      Get the quantization error of the SOM on the provided data.

      Uses the neighbor finder to find the nearest neighbor of each data point in the SOM and calculates the
      quantization error based on the distance between the data point and its nearest neighbor.

      :param data: The data to get the quantization error for.
      :type data: NDArray

      :returns:

                The mean quantization error and all distances if `return_distances``
                    is set to True.
      :rtype: Union[float, Tuple[float, np.ndarray]]


   .. py:method:: label(data, clusters, save_path = None, return_distance = False, flatten = False)

      Get the label for each data point based on the label for its closest SOM node.

      This function internally uses the predict method to get the nearest node for each data point and then assigns
      the label of the nearest node to the data point. Labels have to be provided for each SOM node based on a
      clustering of the SOM nodes.

      :param data: The data to be labelled based on the label of the nearest SOM node.
      :type data: NDArray
      :param clusters: A list of clusters (one for each SOM node).
      :type clusters: Union[List[int], NDArray]
      :param save_path: The path where to save the SOM and its configuration to.
                        Defaults to None.
      :type save_path: Optional[str], optional
      :param return_distance: Whether to return the distance to the nearest node. Defaults to False.
      :type return_distance: bool, optional
      :param flatten: Whether to flatten the input data in every but the channel dimension.
                      Defaults to False.
      :type flatten: bool, optional

      :returns: The cluster for each data point or the clusters and distances.
      :rtype: Union[NDArray, Tuple[NDArray, NDArray]]


   .. py:method:: load(save_path)

      Load a previously pickled SOM.

      :param save_path: The path where to load the SOM and its configuration from.
      :type save_path: str


   .. py:method:: predict(data, return_distance = False)

      Get the id of the nearest SOM node for each data point and optionally the distance to the node.

      :param data: The data to be labelled based on the label of the nearest SOM node.
      :type data: NDArray
      :param return_distance: Whether to return the distance to the nearest node. Defaults to False.
      :type return_distance: bool, optional

      :returns: The labels for each data point or the labels and distances.
      :rtype: Union[NDArray, Tuple[NDArray, NDArray]]


   .. py:method:: save(save_path)

      Pickle and save a previously fit SOM.

      :param save_path: The path where to save the SOM and its configuration to.
      :type save_path: str

      :raises ValueError: If no self-organizing map has been created or loaded.


   .. py:method:: set_config(node_count = (50, 50), dimension_count = 5, distance_metric = 'euclidean', neighborhood = 'gaussian', neighborhood_accuracy = 'fast', learning_rate_initial = 0.2, learning_rate_final = 0.003, sigma_initial = None, sigma_final = None, parallel_count = 8096, n_jobs = -1, seed = None)

      Set the config of the SOM class.

      :param node_count: Tuple determining the SOM node size. Defaults to (50, 50).
      :type node_count: Tuple[int, int], optional
      :param dimension_count: Dimensionality of the original data. Defaults to 5.
      :type dimension_count: int, optional
      :param distance_metric: Distance metric to
                              use. Defaults to "euclidean".
      :type distance_metric: Literal["euclidean", "manhattan", "correlation", "cosine"], optional
      :param neighborhood: The type of neighborhood used to determine related notes.
                           Defaults to "gaussian".
      :type neighborhood: str, optional
      :param neighborhood_accuracy: The accuracy to use for the neighborhood.
                                    Defaults to "fast".
      :type neighborhood_accuracy: Literal["fast", "accurate"], optional
      :param learning_rate_initial: The initial learning rate. Defaults to 0.2.
      :type learning_rate_initial: float, optional
      :param learning_rate_final: The learning rate at the end of the SOM training. Defaults to 3e-3.
      :type learning_rate_final: float, optional
      :param sigma_initial: The initial size of the neighborhoods, higher values.
                            Defaults to None.
      :type sigma_initial: Optional[int], optional
      :param sigma_final: The final size of the neighborhood, lower values. Defaults to None.
      :type sigma_final: Optional[int], optional
      :param parallel_count: Data points to process concurrently. Defaults to 8096.
      :type parallel_count: int, optional
      :param seed: The random seed. Defaults to None.
      :type seed: Optional[int], optional


   .. py:method:: set_estimators()

      Set the XPySOM and nearest neighbor finder estimators.


   .. py:method:: set_nodes(nodes)

      Set the weights of a previously fit SOM.

      :param nodes: The weights of the SOM nodes.
      :type nodes: NDArray


   .. py:attribute:: use_gpu
      :value: True