Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

SC3: consensus clustering of single-cell RNA-seq data

Abstract

Single-cell RNA-seq enables the quantitative characterization of cell types based on global transcriptome profiles. We present single-cell consensus clustering (SC3), a user-friendly tool for unsupervised clustering, which achieves high accuracy and robustness by combining multiple clustering solutions through a consensus approach (http://bioconductor.org/packages/SC3). We demonstrate that SC3 is capable of identifying subclones from the transcriptomes of neoplastic cells collected from patients.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The SC3 framework for consensus clustering of scRNA-seq data.
Figure 2: Benchmarking of SC3 against existing methods.
Figure 3: SC3 defines subclones from two patients with myeloproliferative neoplasm.

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

Referenced accessions

ArrayExpress

References

  1. Grün, D. et al. Nature 525, 251–255 (2015).

    PubMed  Google Scholar 

  2. Jaitin, D.A. et al. Science 343, 776–779 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Mahata, B. et al. Cell Rep. 7, 1130–1142 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Gentleman, R.C. et al. Genome Biol. 5, R80 (2004).

    PubMed  PubMed Central  Google Scholar 

  5. McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F. Bioinformatics https://doi.org/10.1093/bioinformatics/btw777 (2017).

  6. Biase, F.H., Cao, X. & Zhong, S. Genome Res. 24, 1787–1796 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Yan, L. et al. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).

    CAS  PubMed  Google Scholar 

  8. Goolam, M. et al. Cell 165, 61–74 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Science 343, 193–196 (2014).

    CAS  PubMed  Google Scholar 

  10. Pollen, A.A. et al. Nat. Biotechnol. 32, 1053–1058 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Kolodziejczyk, A.A. et al. Cell Stem Cell 17, 471–485 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Treutlein, B. et al. Nature 509, 371–375 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Ting, D.T. et al. Cell Rep. 8, 1905–1918 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Patel, A.P. et al. Science 344, 1396–1401 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Usoskin, D. et al. Nat. Neurosci. 18, 145–153 (2015).

    CAS  PubMed  Google Scholar 

  16. Klein, A.M. et al. Cell 161, 1187–1201 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Zeisel, A. et al. Science 347, 1138–1142 (2015).

    CAS  PubMed  Google Scholar 

  18. van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  19. Zurauskiene, J. & Yau, C. BMC Bioinformatics http://doi.org/10.1186/s12859-016-0984-y (2016).

  20. Xu, C. & Su, Z. Bioinformatics https://doi.org/10.1093/bioinformatics/btv088 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Guo, M., Wang, H., Potter, S.S., Whitsett, J.A. & Xu, Y. PLoS Comput. Biol. 11, e1004575 (2015).

    PubMed  PubMed Central  Google Scholar 

  22. Macosko, E.Z. et al. Cell 161, 1202–1214 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Jiang, L., Chen, H., Pinello, L. & Yuan, G.-C. Genome Biol. 17, 144 (2016).

    PubMed  PubMed Central  Google Scholar 

  24. Patterson, N., Price, A.L. & Reich, D. PLoS Genet. 2, e190 (2006).

    PubMed  PubMed Central  Google Scholar 

  25. Tracy, C.A. & Widom, H. Commun. Math. Phys. 159, 151–174 (1994).

    Google Scholar 

  26. Rousseeuw, P.J. J. Comput. Appl. Math. 20, 53–65 (1987).

    Google Scholar 

  27. Guo, G. et al. Dev. Cell 18, 675–685 (2010).

    CAS  PubMed  Google Scholar 

  28. Boroviak, T. et al. Dev. Cell 35, 366–382 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Chen, E., Staudt, L.M. & Green, A.R. Immunity 36, 529–541 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Ortmann, C.A. et al. N. Engl. J. Med. 372, 601–612 (2015).

    PubMed  PubMed Central  Google Scholar 

  31. Nangalia, J. et al. N. Engl. J. Med. 369, 2391–2405 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Hartigan, J.A. & Wong, M.A. J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100–108 (1979).

    Google Scholar 

  33. Strehl, A. & Ghosh, J. J. Mach. Learn. Res. 3, 583–617 (2003).

    Google Scholar 

  34. Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).

    Google Scholar 

  35. Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik, V. J. Mach. Learn. Res. 2, 125–137 (2001).

    Google Scholar 

  36. Hubert, M. & Debruyne, M. WIREs Comp Stat 2, 36–43 (2010).

    Google Scholar 

  37. Hubert, M., Rousseeuw, P.J. & Branden, K.V. Technometrics 47, 64–79 (2005).

    Google Scholar 

  38. Reimand, J. et al. Nucleic Acids Res. 44, W83–W89 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Goder, A. & Filkov, V. Consensus clustering algorithms: comparison and refinement. in Proceedings of the Meeting on Algorithm Engineering & Experiments 109–117 (Society for Industrial and Applied Mathematics, 2008).

  40. Petzer, A.L., Zandstra, P.W., Piret, J.M. & Eaves, C.J. J. Exp. Med. 183, 2551–2558 (1996).

    CAS  PubMed  Google Scholar 

  41. Picelli, S. et al. Nat. Protoc. 9, 171–181 (2014).

    CAS  PubMed  Google Scholar 

  42. Andrews, S. FastQC: A quality control tool for high throughput sequence data. Reference Source (2010).

  43. Bolger, A.M., Lohse, M. & Usadel, B. Bioinformatics 30, 2114–2120 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics 25, 1105–1111 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Love, M.I., Huber, W. & Anders, S. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  46. Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Nat. Biotechnol. 32, 896–902 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Ritchie, M.E. et al. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank B. Vangelov, J.-C. Delvenne and R. Lambiotte for fruitful discussions and for their help with computational methods. We also thank D. Flores Santa Cruz, D. Dimitropolou and J. Grinfeld for technical assistance with experiments. We thank I. Vasquez-Garcia, D. Harmin, M. Kosicki, D. Ramsköld and M. Huch for comments on the manuscript. V.Y.K., T.A., A.Y. and M.H. are supported by Wellcome Trust Grants. K.N.N. is supported by the Wellcome Trust Strategic Award 'Single cell genomics of mouse gastrulation'. M.T.S. acknowledges support from FRS-FNRS; the Belgian Network DYSCO (Dynamical Systems, Control and Optimisation), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian State Science Policy Office; and the ARC (Action de Recherche Concerte) on Mining and Optimization of Big Data Models, funded by the Wallonia-Brussels Federation. M.B. acknowledges support from EPSRC (grant EP/N014529/1). T.C. was funded through a core funded fellowship by the Sanger Institute and a Chancellor′s fellowship from the University of Edinburgh. K.K. and A.R.G. are supported by Bloodwise (grant ref. 13003), the Wellcome Trust (grant ref. 104710/Z/14/Z), the Medical Research Council, the Kay Kendall Leukaemia Fund, the Cambridge NIHR Biomedical Research Center, the Cambridge Experimental Cancer Medicine Centre, the Leukemia and Lymphoma Society of America (grant ref. 07037) and a core support grant from the Wellcome Trust and MRC to the Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute. W.R. was supported by BBSRC (grant ref. BB/K010867/1), the Wellcome Trust (grant ref. 095645/Z/11/Z), EU BLUEPRINT and EpiGeneSys.

Author information

Authors and Affiliations

Authors

Contributions

M.H. conceived the study; V.Y.K., M.H., M.T.S., M.B., T.A. and A.Y. contributed to the computational framework; K.K. and T.C. performed the experiments for the patient data; K.N.N. helped with the analysis of embryonic mouse data; M.B., W.R., A.R.G. and M.H. supervised the research; and V.Y.K. and M.H. led the writing of the manuscript with input from the other authors.

Corresponding author

Correspondence to Martin Hemberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Boxplots of 100 realizations of the SC3 clustering on the Biase, Yan and Goolam datasets.

For clarity, lines are drawn through the medians of the boxplots. The x-axis shows the number of eigenvectors d of the transformed distance matrix as a percentage of the total number of cells N in each dataset. The black vertical lines correspond to d = 4% of N and d = 7% of N. Dots represent outliers that are higher than the highest value (or lower than the lowest value) within 1.5 * IQR, where IQR is the inter-quartile range, or distance between the first and third quartiles.

Source data

Supplementary Figure 2 Boxplots of 100 realizations of the SC3 clustering on the Deng, Pollen and Kolodziejczyk datasets.

For clarity, lines are drawn through the medians of the boxplots. The x-axis shows the number of eigenvectors d of the transformed distance matrix as a percentage of the total number of cells N in each dataset. The black vertical lines correspond to d = 4% of N and d = 7% of N. Dots represent outliers that are higher than the highest value (or lower than the lowest value) within 1.5 * IQR, where IQR is the inter-quartile range, or distance between the first and third quartiles.

Source data

Supplementary Figure 3 Exploration of SC3 pipeline parameters

(a) Histogram of the d values where ARI>.95 is achieved for the downsampled (by a factor of 10, Methods) gold standard datasets from Fig. 1b. The black vertical lines indicate the interval d = 4-7% of the total number of cells N, showing high accuracy in the classification; (b) Histogram of the d values where ARI>.95 is achieved for the silver standard datasets from Fig. 1b. The black vertical lines indicate the interval d = 4-7% of the total number of cells N, showing high accuracy in the classification; (c) Exploration of the gene filter parameters (see Methods for more details). Dots represent individual clustering runs. Bars correspond to the median of the dots; (d) The effect of dropouts in the distance calculations step on the accuracy of SC3 clustering (Methods for more details). Dots represent individual clustering runs. Bars correspond to the median of the dots. Red and grey colours correspond to clustering with and without dropouts. The black line corresponds to ARI=0.8.

Source data

Supplementary Figure 4 Scalability, accuracy and rare cell-type detection rate of SC3 and benchmarking of the hybrid SC3

(a) Run times for different clustering methods as a function of the number of cells (N). All methods were run on a MacBook Pro (Mid 2014), OS X Yosemite 10.10.5 with 2.8 GHz Intel Core i7 processor, 16 GB 1600 MHz DDR3 of RAM. Two results shown for SC3 correspond to nstart=1000 and nstart=50, where nstart is the number of starting points for k-means clustering; (b) Reducing the number of k-means runs (nstart) from 1,000 to 50 results only in a slightly worse performance for SC3, yet with significant computational savings, as shown in (a). The black line indicates ARI = 0.8; (c) Using the hybrid SC3 based on reference labels provided by the authors. Same as Fig. 2c in the main text, but using the reference labels provided by the authors as inputs to the SVM. Dots represent outliers higher (lower) than the highest (lowest) value within 1.5 x IQR, where IQR is the interquartile range. The black line indicates ARI = 0.8; (d) Robustness of SC3 for the detection of rare cell-types. For two of the datasets, we remove different percentages of the cells in the rare cell-types. The figure shows the mean fraction of SC3 runs in which all the rare cells were clustered together as a function of the total number of cells in the rare cell-type; (e) Sensitivity of SC3 for identifying rare cell-types when the hybrid SC3 approach is used with 30% of cells to train the SVM. This figure was derived from (d) by correcting the mean fraction of times that the rare cells were located in the same cluster using the probability of drawing rare cells within the 30% of all cells (Methods).

Source data

Supplementary Figure 5 Analysis of SC3 clustering of the Macosko dataset

(a) The cells from the Macosko dataset were clustered 100 times using SC3. “Pairwise” indicates the ARIs between the different solutions (a sample of 100 ARIs was taken) obtained and “Reference” indicates the ARI as compared to the labels obtained by Macosko et al.; (b) Sankey diagram comparing the 39 clusters reported by Macosko ​et al (left) and the 39 clusters obtained with SC3 (right). The widths of the lines linking both sets of clusters correspond to the number of cells they have in common. Colors and cell types as in Macosko et al.

Source data

Supplementary Figure 6 Explanation of biological insights provided by SC3

(a) Illustration of the difference between marker genes and differentially expressed genes. In this small example, 20 cells containing 14 genes with binary expression values (blue for ‘off’, red for ‘on’) are clustered. Only genes 1-4 can be considered as marker genes, whereas all 14 genes are differentially expressed; (b) Density of distributions of AUROC (sample of 1000 values for each dataset) obtained from merging of 100 calculations of marker genes using randomly shuffled assignments of reference labels (provided by the authors, see Methods); (c) Outlier scores for all N= 268 cells of the Deng dataset as generated by SC3 (colors correspond to the 10 reference clusters provided by the authors – same as Stage in Fig. 2d). The nine cells with high outlier score in the red cluster (black arrow) were prepared using a different protocol (see text for details), and are thus assigned to a technical artifact.

Source data

Supplementary Figure 7 Cell sorting and genotyping procedures for patients

(a) Contour plots describing the sorting strategy for isolating HSCs in patient 2 (the same was done for patient 1). CD34, CD38, CD90 and CD45RA expression is displayed using a log scale; (b) Lineage negative, CD34+/CD38-/CD90+/CD45RA- single cells were sorted into individual wells for scRNA-Seq or colony growth in cytokine cocktail allowing progenitor cell expansion. For genotyping the JAK2V617F and the TET2 loci were characterised using Sanger sequencing. (c) Clonal composition of patients 1, 2 obtained by Sanger sequencing experiments as described in (b) of the JAK2V617F and the TET2 loci (Methods). Colors are the same as Cluster colors in Fig. 3.

Source data

Supplementary Figure 8 Quality control of cells in the patient data

(a) Number of cells with a given number of expressed genes in each patient. Cells on the left side of the red line were removed from further analysis as lowly expressed; (b) Number of cells with a given (# of ERCC reads)/(# endogenous reads) ratio in each patient. Cells on the right side of the red line were removed from further analysis as outliers.

Supplementary Figure 9 Clustering of scRNA-seq data from patient 1

Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.

Supplementary Figure 10 Clustering of scRNA-seq data from patient 2

Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.

Supplementary Figure 11 Clustering of scRNA-seq data using combined patient 1 and patient 2 datasets

Consensus matrices corresponding to different values of k. For average silhouette width and stability see Methods.

Supplementary Figure 12 Additional lines of evidence that SC3 can help to define subclonal composition

(a) Comparison of the coefficient of variation of gene expression in Tet2 and WT subclones of patient 1; (b) Sorting of haematopoietic stem and progenitor cells from patient 1 and 2 using antibodies that target surface markers identified using SC3. Our analysis suggests that CD83 should be specific for WT clones, CD127 and CD244 for the Tet2 only mutant clones, while CD82 is specific to double mutant clones. Percentages account for CD38+CD34+ cells positive for the indicated surface marker.

Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12, Supplementary Tables 2, 4 and Supplementary Results 1–5 (PDF 4591 kb)

Supplementary Table 1

SC3 analysis of Macosko dataset (XLSX 1211 kb)

Supplementary Table 3

Deng marker genes obtained by SC3 (XLSX 168 kb)

Supplementary Table 5

Marker genes analysis of patients 1 & 2 (XLSX 81 kb)

Supplementary Software 1

SC3 v.1.1.2 source files used to generate the analyses in this paper. (ZIP 1620 kb)

Supplementary Software 2

Source Rmd, python and text files used to generate Supplementary Results 1-4 (ZIP 285 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kiselev, V., Kirschner, K., Schaub, M. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486 (2017). https://doi.org/10.1038/nmeth.4236

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4236

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing