Multiplex immunohistochemistry (mIHC) and multiplexed ion beam imaging (MIBI) images are usually phenotyped using a manual thresholding process. The thresholding is prone to biases, especially when examining multiple images with high cellularity.
Unsupervised cell-phenotyping methods including PhenoGraph, flowMeans, and SamSPECTRAL, primarily used in flow cytometry data, often perform poorly or need elaborate tuning to perform well in the context of mIHC and MIBI data. We show that, instead, semi-supervised cell clustering using Random Forests, linear and quadratic discriminant analysis are superior. We test the performance of the methods on two mIHC datasets from the University of Colorado School of Medicine and a publicly available MIBI dataset. Each dataset contains a bunch of highly complex images.
Several multiplex tissue imaging technologies have recently been developed for probing single-cell spatial biology, including multiparameter immunofluorescence , multiplex immunohistochemistry (mIHC)  and multiplexed ion beam imaging (MIBI) .
The spatial capabilities of these new technologies offer up the potential for researchers to develop a novel understanding of the biological mechanisms underlying cellular and protein interactions in a wide array of scientific contexts. These platforms are rapidly developing and all produce data of a similar structure: two dimensional images of tissue at the resolution of cells and nuclei, where proteins in the sample have been labeled with antibodies called “markers” that attach to cell membranes.
mIHC data collected from platforms such as Vectra 3 or Vectra Polaris typically have 6–8 markers , while some platforms like MILAN can have around 40 markers . MIBI images have 40–50 markers .
mIHC and MIBI technologies have many data pre-processing and analyses steps that have not yet been uniformly implemented. Cell-phenotyping, defined as identification of cell populations based on marker expression, is a challenging process in this context. In most of the current cell-phenotyping approaches, researchers require to manually set a threshold intensity value for every marker, and
cells are then phenotyped based on the binarized expression of all the markers. For example, CD4 T cells are positive for markers CD3 and CD4 and negative for CD8. This manual phenotyping (gating) approach is cumbersome for high parameter panels and depends on the reliability and expert knowledge of the user selecting positive cells or choosing thresholds, which may differ between users. Thus, manual gating is not only prone to human error but also time consuming and costly. Algorithms have already been developed to tackle these same phenotyping issues for multiplex technologies that analyze single cells in a liquid suspension without spatial resolution, namely flow and mass cytometry . In particular, automated gating methods using machine learning algorithms have become more and more popular as the number of analyzed parameters has increased .
Our aim in this paper is to compare automated cell-phenotyping algorithms in the context of mIHC and MIBI datasets. We adapt approaches originally developed for two non-spatial technologies, flow and mass cytometry, and test our algorithms on two mIHC datasets [4, 8] obtained from the University of Colorado School of Medicine and one publicly available MIBI dataset .
Existing phenotyping algorithms
Unsupervised learning algorithms
Unsupervised cell-phenotyping algorithms partition cells into different classes based on their multiplex marker expression without using any prior knowledge . These methods are initially unbiased and usually time and memory efficient as well. In addition, novel cell types and populations can be discovered by not biasing clustering algorithms with prior information about marker expression. However, these methods suffer from several major limitations. For example, once the cells have been classified by an unsupervised algorithm, researchers manually gate the obtained classes to compare meaningful cell types (e.g. CD4 T cell, CD68+ macrophages etc.). This step can be cumbersome and again prone to human error. PhenoGraph , flowMeans  and SamSPECTRAL  are some of the most popular unsupervised cell-phenotyping algorithms [6, 7].
Semi-supervised learning algorithms
Semi-supervised cell-phenotyping approaches typically involve building a predictive model using multiplex marker expression from a subset of cells in a dataset, called the training set, that have been manually phenotyped . The built models are then used to phenotype the remaining cells, or the test set. Unlike unsupervised methods, the cells in this case are directly assigned to existing phenotypes which obviates the problem of matching arbitrary clusters to meaningful cell types. One can argue that the first step of manually phenotyping cells in the training set is subjected to human error. However, the size of the training set is usually just a fraction of the full dataset. Therefore, ensuring the purity of manual phenotyping of the training dataset should be easy relative to manually phenotyping all of the data; though this remains a practical limitation for all current approaches.
DeepCyTOF , CyTOF linear classifier  and ACDC  are popular semi-supervised methods in flow and mass cytometry . CyTOF linear classifier, which is based on linear discriminant analysis (LDA), has been shown to outperform more complex algorithms like DeepCyTOF, ACDC on several CyTOF datasets [7, 16]. All the above methods are briefly described further in Additional file 1: Table S1.
LDA assumes that the data has equal variance across groups and is normally distributed. Though these assumptions may hold for CyTOF data, in mIHC datasets both assumptions are violated. To address these problems, we consider more general machine-learning algorithms such as quadratic discriminant analysis (QDA)  and Random Forest . QDA is similar to LDA but does not require equal variance across groups. The decision tree-based Random Forest method is robust for non-normal data and has several additional advantages demonstrated by ; these include minimal tuning parameters, excellent off-the-shelf prediction, honest estimates of classification through out-of-bag samples, and stable prediction behavior. Therefore, in the context of mIHC and MIBI data, we propose to use Random Forest and compare its performance with LDA and QDA.
Our analysis incorporated three multiplex tissue imaging datasets: an ovarian cancer dataset  acquired on the mIHC Vectra Polaris platform (Akoya Biosciences), a lung cancer dataset  acquired on the mIHC Vectra 3.0 system (Akoya Biosciences), and a breast cancer dataset  collected on the MIBI platform (IonPath, Inc). The two mIHC datasets were segmented and phenotyped using inForm (v2.4.8, Akoya Biosciences), commercially available software for Vectra data , and the MIBI dataset was phenotyped in MATLAB using deep learning-based methods . For each cell, the expression data is available for multiple markers. The datasets are described in detail below and Table 1 lists the overall distribution of the cell types in different datasets.
mIHC ovarian cancer dataset
There are 302,147 cells from 132 subjects. There are five different cell types: CD19+, CD3+/CD8-, CD3+/CD8+, CD68+, CK+/Ki67+. There are six markers, CD19, CD3, CK, CD8, Ki67, CD68 observed in each of the cells. More details on this data can be found at .
mIHC lung cancer dataset
There are 1,590,327 cells from 153 subjects each with 3-5 images (in total, 761 images). There are six different cell types: CD14+, CD19+, CD4+, CD8+, CK+, Other+ (meaning they do not belong to any of the indicated phenotypes). There are five markers, CD19, CD3, CK, CD8, CD14. More details on this data can be found at .
MIBI breast cancer dataset
The triple-negative breast cancer (TNBC) MIBI dataset  has 201,656 cells from 43 subjects and one image per subject. It has six different cell groups: Immune, Endothellial, Mesenchymal-like, Tumor, Keratin-positive tumor and Unidentified. There are 44 markers available, such as CD3, CD8, CD63, Ki67, and Vimentin.
We primarily focused on the semi-supervised methods in this paper. First, we briefly highlighted some of the major problems of the unsupervised methods using the mIHC lung cancer dataset. Then, we compared the usability and performance of Random Forest with LDA and QDA in all three datasets.
In the mIHC lung cancer dataset, we clustered the cells of one subject at a time using the unsupervised methods, PhenoGraph, SamSPECTRAL and flowMeans. T-distributed stochastic neighbor embedding (t-SNE)  has been used by researchers to visualize high-dimensional data in various contexts including flow and mass cytometry [23, 24]. In Fig. 1, for a particular subject, we compared the true cell labels with the labels estimated using the unsupervised methods, overlaid on the first two t-SNEs of the marker data. PhenoGraph and SamSPECTRAL depend on the choice of several pre-specified hyper-parameters. PhenoGraph depends on the number of nearest neighbors (NN’s), whereas SamSPECTRAL depends on two quantities known as sigma and separation factor. For PhenoGraph, we considered 4 different NN sizes, namely \(0.5 \%, 1\%, 5\%\) and \(10 \%\) of the total number of cells. For most of the subjects, including the one depicted in Fig. 1, PhenoGraph classified the cells into a large number of clusters when NN size was small. For larger NN sizes, PhenoGraph generated around 6 clusters but it would require additional evaluation of the clusters to properly map them with true and meaningful cell-labels. Similarly, the performance of SamSPECTRAL was highly variable depending on the input values of the tuning parameters, and none of the combinations yielded clusters that remotely resembled the true cell labels. On the other hand, the result from flowMeans looked fairly close to the true cell-labels and it would require the least amount of post-clustering evaluation compared to the previous two methods.
We should reiterate that we did not provide a systematic comparison of the unsupervised methods here. Our goal was to briefly highlight the major difficulties with the unsupervised methods, namely that the results may vary significantly based on the choice of the tuning parameters and also, require additional evaluation of the obtained clusters for a meaningful mapping with the true cell-phenotypes.
For each dataset, we randomly selected m training images (out of the total size, M) to train the models on and evaluated their performance on the remainder of the images. We varied m and for every choice of m, we considered 5 repetitions. Results were aggregated across repetitions and summarized by prediction accuracy, adjusted rand index (ARI), and normalized mutual information (NMI).
mIHC ovarian cancer dataset
We considered four training set-sizes (m) which were fractions of the total size M, \(m = 7\) (\(5\%\)), 13 (\(10\%\)), 20 (\(15\%\)), and 26 (\(20\%\)). Table 2 lists the mean (and standard deviation) of prediction accuracy, ARI, and NMI. Even for the smallest m, all three methods performed well, with Random Forest having the highest mean prediction accuracy, ARI, and NMI. Random Forest also had significantly lower standard deviation which accentuated its high robustness. As m increased, prediction accuracy, ARI, and NMI marginally improved for all three methods.
mIHC lung cancer dataset
We considered m to be, 4 (\(0.5\%\)), 8 (\(1\%\)), 15 (\(2\%\)), 23 (\(3\%\)) and 76 (\(10\%\)). Random Forest again outperformed LDA and QDA (Table 2). However, the prediction accuracy was significantly lower for the smaller training set-sizes. Random Forest’s performance steadily improved as the training set-size (m) increased, whereas for LDA and QDA, the performance stayed nearly the same. We noticed a dip in the overall performance of all the methods in this dataset compared to the ovarian cancer dataset. Further details are provided in the Additional file 1. Additional file 1: Figs. S1–3 respectively show the accuracy of Random Forest for predicting every cell type, the proportion of predicted cell types vs every known cell type, and the overall intensity of CD19 marker in different images.
MIBI breast cancer dataset
We considered three values of m, 2 (\(5\%\)), 4 (\(10\%\)) and 8 (\(20\%\)). Even with the smallest m, Random Forest achieved great prediction accuracy (Table 2). LDA was consistently poorer than Random Forest but its accuracy increased steadily as m increased. We did not report the performance of QDA for this dataset since it often encountered an error due to “rank deficiency”, especially for small training sizes (refer to the Additional file 1: Table S2).
We have noticed that cells of certain types can get incorrectly phenotyped if the corresponding markers are not informative enough. For example, in some subjects from the lung cancer dataset, CD19 marker intensity is not distinctive across different cell types which makes identifying CD19+ cells hard. It shall also be kept in mind that the mIHC datasets we analyzed were originally phenotyped using the inForm software. It is a possibility that the original phenotyping was inaccurate and thus our “ground truth”itself was biased.
The run-time comparison of the methods are provided in Additional file 1: Table S2. We noted that LDA and QDA both took fractions of the time taken by Random Forest model. In the MIBI dataset, QDA encountered convergence error for some particular choices of the training set, especially with a smaller training set-size. Therefore, when there are large numbers of markers and cells, we recommend using LDA over Random Forest which would potentially sacrifice some degree of accuracy but be much more scalable. Besides, it should also be kept in mind that the semi-supervised methods in general can be unreliable for detecting rare cell-populations which would ideally require a specialist’s manual evaluation of the marker expression-profiles. In this study, all the datasets we considered, had 5–6 cell types. In future, we will check the applicability of the methods on multiplex imaging datasets which have a larger number of cell types.
Availability of data and materials
The MIBI breast cancer dataset used in the paper can be found at this link, https://www.angelolab.com/mibi-data. The mIHC datasets are available from the corresponding author on reasonable request. Our methods can be found as a R package named as VectraMIBI at this link. The package builds a Random Forest model on a given training dataset, and uses predictions from that model to annotate (phenotype) the cells of a test dataset. The package also provides visualization tools including heat-maps of the mean marker intensity over different cell types and image specific ridge-plots of the marker intensity for different cell types for basic exploration of the training dataset.
Multiplex Immuno Histochemistry
Multiplex Ion Beam Imaging
Linear Discriminant Analysis
Quadratic Discriminant Analysis
Bataille F, Troppmann S, et al. Multiparameter immunofluorescence on paraffin-embedded tissue sections. Appl Immunohistochem Mol Morphol. 2006;14(2):225–8.
Johnson Amber M, Bullock, et al. BonnieL Cancer cell-intrinsic expression of mhc class ii regulates the immune microenvironment and response to anti-pd-1 therapy in lung adenocarcinoma. J Immunol. 2020;204(8):2295–307.
Bosisio FM, Antoranz A, van Herck Y, Bolognesi MM, Marcelis L, Chinello C, Wouters J, Magni F, Alexopoulos L, Stas M, et al. Functional heterogeneity of lymphocytic patterns in primary melanoma dissected through single-cell multiplexing. Elife. 2020;9:e53008.
Jordan KR, Sikora MJ, Slansky J, et al. The capacity of the ovarian cancer tumor microenvironment to integrate inflammation signaling conveys a shorter disease-free interval. Clin Cancer Res. 2020;26(23):6362–73.
We thank the Human Immune Monitoring Shared Resource and support of the University of Colorado Human Immunology and Immunotherapy Initiative for their expert assistance in multiplex IHC and generation of the ovarian and lung datasets. We acknowledge the support of the University of Colorado Cancer Center Support Grant(P30CA046934).
B.G. is supported by the Department of Defense Award (OC170228) and an American Cancer Society Research Scholar Award (134106-RSG-19-129-01-DDC). E.L.S. is supported by NIH grant K12 CA086913 and ACS IRG #16-184-56 from the American Cancer Society to the University of Colorado Cancer Center, and a grant from the Cancer League of Colorado. S.S. is funded by the Grohne-Stepp Endowment from the University of Colorado Cancer Center. J.W. is supported by the NIH/NCATS Colorado CTSA (UL1TR002535).
Authors and Affiliations
Department of Biostatistics and Informatics, University of Colorado CU Anschutz Medical Campus, Aurora, Colorado, USA
Souvik Seal, Julia Wrobel & Debashis Ghosh
Department of Medicine, School of Medicine, University of Colorado CU Anschutz Medical Campus, Aurora, Colorado, USA
Amber M. Johnson & Raphael A. Nemenoff
Division of Medical Oncology, School of Medicine, University of Colorado CU Anschutz Medical Campus, Aurora, Colorado, USA
Erin L. Schenk
Department of Obstetrics and Gynecology, School of Medicine, University of Colorado CU Anschutz Medical Campus, Aurora, Colorado, USA
Benjamin G. Bitler
Department of Immunology and Microbiology, School of Medicine, University of Colorado CU Anschutz Medical Campus, Aurora, Colorado, USA
SS, JW and DG were involved with the conceptualization of the project, methodological development, analysis and writing of the first draft of the manuscript. All authors (SS, JW, AMJ, RAN, ELS, BGB, KRJ, DG) participated in the writing process. All authors read and approved the final manuscript.
Here, we provide a section explaining the overall dip in the performance of the methods in the mIHC lung cancer dataset. Figure S1–3. focus on the mIHC lung cancer dataset, and respectively show the scatter-plot of accuracy of Random Forest for predicting every cell type, the bar-plot of pro-portion of predicted cell types vs every known cell type, and the ridge-plot of overall CD19 marker intensity in the cells of different images. Table S1, 2. respectively list the summary of a few existing methods and the run-times of the methods in different datasets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Seal, S., Wrobel, J., Johnson, A.M. et al. On clustering for cell-phenotyping in multiplex immunohistochemistry (mIHC) and multiplexed ion beam imaging (MIBI) data.
BMC Res Notes15, 215 (2022). https://doi.org/10.1186/s13104-022-06097-x