Skip to content

cluster

Short Description

sm.tl.cluster: This function is designed for clustering cells within the dataset, facilitating the identification of distinct cell populations based on their expression profiles or other relevant features. It supports three popular clustering algorithms:

  • kmeans: A partitioning method that divides the dataset into k clusters, each represented by the centroid of the data points in the cluster. It is suitable for identifying spherical clusters in the feature space.

  • phenograph: Based on community detection in graphs, Phenograph clusters cells by constructing a k-nearest neighbor graph and then detecting communities within this graph. This method is particularly effective for identifying clusters with varying densities and sizes.

  • leiden: An algorithm that refines the cluster partitioning by optimizing a modularity score, leading to the detection of highly connected communities. It is known for its ability to uncover fine-grained and highly cohesive clusters.

Each algorithm has its own set of parameters and assumptions, making some more suitable than others for specific types of dataset characteristics. Users are encouraged to select the clustering algorithm that best matches their data's nature and their analytical goals.

Function

cluster(adata, method='kmeans', layer='log', subset_genes=None, sub_cluster=False, sub_cluster_column='phenotype', sub_cluster_group=None, k=10, n_pcs=None, resolution=1, phenograph_clustering_metric='euclidean', nearest_neighbors=30, use_raw=True, log=True, random_state=0, collapse_labels=False, label=None, verbose=True, output_dir=None)

Parameters:

Name Type Description Default
adata AnnData

The input AnnData object containing single-cell data for clustering.

required
method str

Specifies the clustering algorithm to be used. Currently supported algorithms include 'kmeans', 'phenograph', and 'leiden'.

'kmeans'
subset_genes list of str

A list of gene names to be used specifically for clustering. If not provided, all genes in the dataset are used.

None
sub_cluster bool

Enables sub-clustering within an existing cluster or phenotype. Useful for further dissecting identified groups.

False
sub_cluster_column str

The column in adata.obs that contains the cluster or phenotype labels for sub-clustering. Required if sub_cluster is True.

'phenotype'
sub_cluster_group list of str

Specifies the clusters or phenotypes to be sub-clustered. If not provided, all groups in sub_cluster_column will be sub-clustered.

None
k int

The number of clusters to generate when using the K-Means algorithm.

10
n_pcs int

The number of principal components to use for Leiden clustering. Defaults to using all available PCs.

None
resolution float

Adjusts the granularity of clustering, applicable for Leiden clustering. Higher values yield more clusters.

1
phenograph_clustering_metric str

Defines the distance metric for nearest neighbor calculation in Phenograph. Choices include 'cityblock', 'cosine', 'euclidean', 'manhattan', and others. Note: 'correlation' and 'cosine' metrics may slow down the performance.

'euclidean'
nearest_neighbors int

The number of nearest neighbors to consider during the initial graph construction phase in both Leiden and Phenograph clustering.

30
use_raw bool

Determines whether raw data (adata.raw) or processed data (adata.X) should be used for clustering. Default is to use processed data.

True
log bool

If True, applies logarithmic transformation to raw data before clustering. Requires use_raw to be True.

True
random_state int

Seed for random number generation, ensuring reproducibility of clustering results.

0
collapse_labels bool

When sub-clustering a subset of groups, this merges all other groups into a single category, aiding in visualization.

False
label str

The key under which the clustering results are stored in adata.obs. Defaults to the name of the clustering method used.

None
verbose bool
True
output_dir str

Specifies the directory where output files, if any, should be saved.

None

Returns:

Name Type Description
AnnData modified AnnData

The input adata object, updated to include a new column in adata.obs corresponding to the clustering results. The column name matches the label parameter or defaults to the clustering method used.

Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Example 1: Basic K-Means clustering without sub-clustering
adata = sm.tl.cluster(adata, method='kmeans', k=10, use_raw=True, log=True, random_state=42)

# Example 2: Phenograph clustering with a specific subset of genes and increased nearest neighbors
subset_genes = ['CD3D', 'CD19', 'CD4', 'CD8A']
adata = sm.tl.cluster(adata, method='phenograph', subset_genes=subset_genes, nearest_neighbors=50, phenograph_clustering_metric='euclidean', use_raw=False)

# Example 3: Leiden clustering using principal components with a higher resolution for finer clusters
adata = sm.tl.cluster(adata, method='leiden', n_pcs=20, resolution=1, use_raw=False, log=False)

# Example 4: Sub-clustering within a specific phenotype group using Leiden, with results labeled distinctly
adata = sm.tl.cluster(adata, method='leiden', sub_cluster=True, sub_cluster_column='phenotype', sub_cluster_group=['B cells'], n_pcs=15, resolution=1, label='B_cell_subclusters', verbose=True)
Source code in scimap/tools/cluster.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
def cluster (adata, 
             method='kmeans', 
             layer='log',
             subset_genes=None,
             sub_cluster=False, 
             sub_cluster_column='phenotype', 
             sub_cluster_group = None,
             k= 10, 
             n_pcs=None, 
             resolution=1, 
             phenograph_clustering_metric='euclidean', 
             nearest_neighbors= 30, 
             use_raw=True, 
             log=True, 
             random_state=0, 
             collapse_labels= False,
             label=None, 
             verbose=True,
             output_dir=None):
    """

Parameters:
    adata (AnnData):  
        The input AnnData object containing single-cell data for clustering.

    method (str):  
        Specifies the clustering algorithm to be used. Currently supported algorithms include 'kmeans', 'phenograph', and 'leiden'.

    subset_genes (list of str, optional):  
        A list of gene names to be used specifically for clustering. If not provided, all genes in the dataset are used.

    sub_cluster (bool, optional):  
        Enables sub-clustering within an existing cluster or phenotype. Useful for further dissecting identified groups. 

    sub_cluster_column (str, optional):  
        The column in `adata.obs` that contains the cluster or phenotype labels for sub-clustering. Required if `sub_cluster` is `True`.

    sub_cluster_group (list of str, optional):  
        Specifies the clusters or phenotypes to be sub-clustered. If not provided, all groups in `sub_cluster_column` will be sub-clustered.

    k (int, optional):  
        The number of clusters to generate when using the K-Means algorithm.

    n_pcs (int, optional):  
        The number of principal components to use for Leiden clustering. Defaults to using all available PCs.

    resolution (float, optional):  
        Adjusts the granularity of clustering, applicable for Leiden clustering. Higher values yield more clusters.

    phenograph_clustering_metric (str, optional):  
        Defines the distance metric for nearest neighbor calculation in Phenograph. Choices include 'cityblock', 'cosine', 'euclidean', 'manhattan', and others. Note: 'correlation' and 'cosine' metrics may slow down the performance.

    nearest_neighbors (int, optional):  
        The number of nearest neighbors to consider during the initial graph construction phase in both Leiden and Phenograph clustering.

    use_raw (bool, optional):  
        Determines whether raw data (`adata.raw`) or processed data (`adata.X`) should be used for clustering. Default is to use processed data.

    log (bool, optional):  
        If True, applies logarithmic transformation to raw data before clustering. Requires `use_raw` to be True.

    random_state (int, optional):  
        Seed for random number generation, ensuring reproducibility of clustering results.

    collapse_labels (bool, optional):  
        When sub-clustering a subset of groups, this merges all other groups into a single category, aiding in visualization.

    label (str, optional):  
        The key under which the clustering results are stored in `adata.obs`. Defaults to the name of the clustering method used.

    verbose (bool):  
    If set to `True`, the function will print detailed messages about its progress and the steps being executed.

    output_dir (str, optional):  
        Specifies the directory where output files, if any, should be saved.

Returns:
    AnnData (modified AnnData):  
        The input `adata` object, updated to include a new column in `adata.obs` corresponding to the clustering results. The column name matches the `label` parameter or defaults to the clustering method used.

Example:
    ```python

    # Example 1: Basic K-Means clustering without sub-clustering
    adata = sm.tl.cluster(adata, method='kmeans', k=10, use_raw=True, log=True, random_state=42)

    # Example 2: Phenograph clustering with a specific subset of genes and increased nearest neighbors
    subset_genes = ['CD3D', 'CD19', 'CD4', 'CD8A']
    adata = sm.tl.cluster(adata, method='phenograph', subset_genes=subset_genes, nearest_neighbors=50, phenograph_clustering_metric='euclidean', use_raw=False)

    # Example 3: Leiden clustering using principal components with a higher resolution for finer clusters
    adata = sm.tl.cluster(adata, method='leiden', n_pcs=20, resolution=1, use_raw=False, log=False)

    # Example 4: Sub-clustering within a specific phenotype group using Leiden, with results labeled distinctly
    adata = sm.tl.cluster(adata, method='leiden', sub_cluster=True, sub_cluster_column='phenotype', sub_cluster_group=['B cells'], n_pcs=15, resolution=1, label='B_cell_subclusters', verbose=True)


    ```
    """

    # Load the andata object    
    if isinstance(adata, str):
        imid = str(adata.rsplit('/', 1)[-1])
        adata = anndata.read_h5ad(adata)
    else:
        adata = adata

    # dynamically adapt the number of neighbours
    if nearest_neighbors > adata.shape[0]:
        nearest_neighbors = adata.shape[0] - 3


    # Leiden clustering
    def leiden_clustering (pheno, adata, nearest_neighbors, n_pcs, resolution):

        # subset the data to be clustered
        if pheno is not None:
            cell_subset =  adata.obs[adata.obs[sub_cluster_column] == pheno].index
        else:
            cell_subset = adata.obs.index

        if use_raw == True:
            data_subset = adata[cell_subset]
            if log is True:
                data_subset.X = np.log1p(data_subset.raw.X)          
            else:
                data_subset.X = data_subset.raw.X
        else:
            data_subset = adata[cell_subset]

        # clustering
        if pheno is not None:
            if verbose: 
                print('Leiden clustering ' + str(pheno))
        else:
            if verbose:
                print('Leiden clustering')

        sc.tl.pca(data_subset)
        if n_pcs is None:
            n_pcs = len(adata.var)
        sc.pp.neighbors(data_subset, n_neighbors=nearest_neighbors, n_pcs=n_pcs)
        sc.tl.leiden(data_subset,resolution=resolution, random_state=random_state)

        # Rename the labels
        cluster_labels = list(map(str,list(data_subset.obs['leiden'])))
        if pheno is not None:
            cluster_labels = list(map(lambda orig_string: pheno + '-' + orig_string, cluster_labels))

        # Make it into a dataframe
        cluster_labels = pd.DataFrame(cluster_labels, index = data_subset.obs.index)

        # return labels
        return cluster_labels

    # Kmeans clustering
    def k_clustering (pheno, adata, k, sub_cluster_column, use_raw, random_state):

        # subset the data to be clustered
        if pheno is not None:
            cell_subset =  adata.obs[adata.obs[sub_cluster_column] == pheno].index
        else:
            cell_subset = adata.obs.index

        # Usage of scaled or raw data
        if use_raw == True:
            if log is True:
                data_subset = pd.DataFrame(np.log1p(adata.raw[cell_subset].X), columns =adata[cell_subset].var.index, index = adata[cell_subset].obs.index)
            else:
                data_subset = pd.DataFrame(adata.raw[cell_subset].X, columns =adata[cell_subset].var.index, index = adata[cell_subset].obs.index)
        else:
            data_subset = pd.DataFrame(adata[cell_subset].X, columns =adata[cell_subset].var.index, index = adata[cell_subset].obs.index)

        # K-means clustering
        if pheno is not None:
            if verbose:
                print('Kmeans clustering ' + str(pheno))
        else:
            if verbose:
                print('Kmeans clustering')

        kmeans = KMeans(n_clusters=k, random_state=random_state, n_init=10).fit(data_subset)

        # Rename the labels
        cluster_labels = list(map(str,kmeans.labels_))
        if pheno is not None:
            cluster_labels = list(map(lambda orig_string: pheno + '-' + orig_string, cluster_labels))

        # Make it into a 
        cluster_labels = pd.DataFrame(cluster_labels, index = data_subset.index)

        # return labels
        return cluster_labels

    # Phenograph clustering
    def phenograph_clustering (pheno, adata, primary_metric, nearest_neighbors):

        # subset the data to be clustered
        if pheno is not None:
            cell_subset =  adata.obs[adata.obs[sub_cluster_column] == pheno].index
        else:
            cell_subset = adata.obs.index

        # Usage of scaled or raw data
        if use_raw == True:
            data_subset = adata[cell_subset]
            if log is True:
                data_subset.X = np.log1p(data_subset.raw.X)          
            else:
                data_subset.X = data_subset.raw.X
        else:
            data_subset = adata[cell_subset]

        # Phenograph clustering
        if pheno is not None:
            if verbose:
                print('Phenograph clustering ' + str(pheno))
        else:
            if verbose:
                print('Phenograph clustering')

        sc.tl.pca(data_subset)
        result = sce.tl.phenograph(data_subset.obsm['X_pca'], k = nearest_neighbors, primary_metric=phenograph_clustering_metric)

        # Rename the labels
        cluster_labels = list(map(str,result[0]))
        if pheno is not None:
            cluster_labels = list(map(lambda orig_string: pheno + '-' + orig_string, cluster_labels))

        # Make it into a dataframe
        cluster_labels = pd.DataFrame(cluster_labels, index = data_subset.obs.index)

        # return labels
        return cluster_labels


    # Use user defined genes for clustering
    if subset_genes is not None:
        bdata = adata[:,subset_genes]
        bdata.raw = bdata[:,subset_genes]
    else:
        bdata = adata.copy()

    # IF sub-cluster is True
    # What cells to run the clustering on?
    if sub_cluster is True:
        if sub_cluster_group is not None:
            if isinstance(sub_cluster_group, list):
                pheno = sub_cluster_group
            else:
                pheno = [sub_cluster_group]         
        else:
            # Make sure number of clusters is not greater than number of cells available
            if method == 'kmeans':
                pheno = (bdata.obs[sub_cluster_column].value_counts() > k+1).index[bdata.obs[sub_cluster_column].value_counts() > k+1]
            if method == 'phenograph':
                pheno = (bdata.obs[sub_cluster_column].value_counts() > nearest_neighbors+1).index[bdata.obs[sub_cluster_column].value_counts() > nearest_neighbors+1]
            if method == 'leiden':
                pheno = (bdata.obs[sub_cluster_column].value_counts() > 1).index[bdata.obs[sub_cluster_column].value_counts() > 1]

    # Run the specified method
    if method == 'kmeans':
        if sub_cluster == True:  
            # Apply the Kmeans function
            r_k_clustering = lambda x: k_clustering(pheno=x, adata=bdata, k=k, sub_cluster_column=sub_cluster_column, use_raw=use_raw, random_state=random_state) # Create lamda function 
            all_cluster_labels = list(map(r_k_clustering, pheno)) # Apply function 
        else:
            all_cluster_labels = k_clustering(pheno=None, adata=bdata, k=k, sub_cluster_column=sub_cluster_column, use_raw=use_raw, random_state=random_state)

    if method == 'phenograph':
        if sub_cluster == True:
            r_phenograph_clustering = lambda x: phenograph_clustering(pheno=x, adata=bdata, primary_metric=phenograph_clustering_metric, nearest_neighbors=nearest_neighbors) # Create lamda function 
            all_cluster_labels = list(map(r_phenograph_clustering, pheno)) # Apply function      
        else:
            all_cluster_labels = phenograph_clustering(pheno=None, adata=bdata, primary_metric=phenograph_clustering_metric, nearest_neighbors=nearest_neighbors)


    if method == 'leiden':
        if sub_cluster == True:
            r_leiden_clustering = lambda x: leiden_clustering(pheno=x, adata=bdata, nearest_neighbors=nearest_neighbors, n_pcs=n_pcs, resolution=resolution) # Create lamda function 
            all_cluster_labels = list(map(r_leiden_clustering, pheno)) # Apply function 
        else:
            all_cluster_labels = leiden_clustering(pheno=None, adata=bdata, nearest_neighbors=nearest_neighbors, n_pcs=n_pcs, resolution=resolution)



    # Merge all the labels into one and add to adata
    if sub_cluster == True:
        sub_clusters = pd.concat(all_cluster_labels, axis=0, sort=False)
    else:
        sub_clusters = all_cluster_labels

    # Merge with all cells
    #sub_clusters = pd.DataFrame(bdata.obs[sub_cluster_column]).merge(sub_clusters, how='outer', left_index=True, right_index=True)
    sub_clusters = pd.DataFrame(bdata.obs).merge(sub_clusters, how='outer', left_index=True, right_index=True)


    # Transfer labels
    if collapse_labels is False and sub_cluster is True:
        sub_clusters = pd.DataFrame(sub_clusters[0].fillna(sub_clusters[sub_cluster_column]))


    # Get only the required column
    sub_clusters = sub_clusters[0]

    # re index the rows
    sub_clusters = sub_clusters.reindex(adata.obs.index)

    # Append to adata
    if label is None:
        adata.obs[method] = sub_clusters
    else:
        adata.obs[label] = sub_clusters

    # Save data if requested
    if output_dir is not None:
        output_dir = pathlib.Path(output_dir)
        output_dir.mkdir(exist_ok=True, parents=True)
        adata.write(output_dir / imid)
    else:    
        # Return data
        return adata