# Theory of Operation

QTube is based upon a high-throughput computational representation of list-mode data created by Cira Discovery Sciences, called Cytometric Fingerprinting. The following section is excerpted from Rogers, WT, Moser AR, Holyst HA, Mohler E, Bantly A, Moore J, Cytometric Fingerprinting: Quantitative Characterization of Multivariate Distributions, submitted to Cytometry A (2007).

### Cytometric Fingerprinting Overview

The objective of Cytometric Fingerprinting (CF) is to represent the information in cytometric list-mode data in a form that enables quantitative comparison among samples. These fingerprints are capable of capturing and encoding the full multivariate correlations of complex, high dimensional cytometric data. This representation is particularly useful when cell populations are not clearly delineated by optimized assays and the distribution of events in the multi-parameter space is not bimodal.

Each event in list-mode data is described by a vector of coordinates in a multidimensional space. Thus, a complete mathematical description of a sample is the multivariate probability distribution function defining the density of events in this space. This distribution may be approximated by dividing the space into small volume elements, counting the number of events in each volume element, and normalizing the count by the total number of events in the sample. In the limit of an infinite number of events, the regions may be made infinitesimally small yielding the true probability distribution function. Of course, it is impossible to collect an infinite number of events, so the question of interest is, how does one accurately estimate the true probability distribution from a finite sample of events? Equally importantly, how does one represent this approximation of the multivariate probability density function in a form amenable to comparing disparate samples?

The first question, that of estimating the probability density function (PDF) from a finite sample of events, has been a subject of research since the early 20th century (1). The most common non-parametric means of estimating a PDF is a histogram where space is divided into equal width bins. For a complex (rapidly changing) PDF, one would like to choose small bins in order to accurately track the variation with respect to independent variables (low bias). On the other hand, one would like to choose bins of sufficient size to contain a large number of events in order to estimate the value of the density within a bin with high accuracy (low variance). This trade-off between number of bins and bin size is the classic bias-variance dilemma. For one independent variable and reasonably sized datasets, it is not difficult to balance the bias-variance requirements. However, for multidimensional data the curse of dimensionality is a severe limitation. Choosing bins of fixed width gives control over bias, but the problem of empty (or highly populated) bins, depending on the nature of the distribution, means that there is no control over the variance. An alternative approach is to control the variance by choosing bins that contain equal numbers of events. (This strategy is particularly useful if the ultimate goal is to utilize bin event densities as features in classification since the measurement accuracy for each feature should be the same.) In the case of univariate data, there is a set of bin boundaries that accomplishes this goal (2). For multivariate data, however, there is not a unique solution. While this indeterminacy might seem like a disadvantage, in fact, it creates an opportunity to find a specific set of bin boundaries that does a superior job of reducing bias.

Other methods of representing and analyzing multidimensional flow cytometry data have been developed (3-6). One that is most closely related to the present work is Probability Binning (PB) (7). PB represents a multidimensional probability distribution as a set of bins defining regions of the multidimensional space. The boundaries of these bins are chosen so that approximately equal numbers of events lie in each bin. Bins are found by selecting a coordinate dimension, determining the median in that coordinate, and dividing the data at the median value. In PB, the axis selection is made by calculating the variance of the data in the parent bin for each of the original coordinate dimensions and choosing the one dimension having the largest variance. Although the decision is made on the basis of the variance in each dimension, the split is not necessarily along the optimal direction since the direction of maximum variance may not coincide with one of the coordinate axes.

CF differs from PB in three important ways:
(i) CF forms bins by splitting the data in the direction of maximum variance rather than along the original coordinate axes. This involves first determining the direction of maximum variance and then rotating the data space such that the principle coordinate axis lies in the direction of maximum variance.
(ii) CF creates a hierarchical, multi-resolution representation of the data. This is done by retaining and utilizing information for bins at each level of recursion.
(iii) CF utilizes the binned data to develop a fingerprint that is a one-dimensional representation embodying the information contained in the multi-resolution, multidimensional representation.

Additionally, CF includes novel algorithms for finding and representing bins from one data set and utilizing this bin representation to process a second data set. It also includes a novel method of forming a differential fingerprint that represents the degree of dissimilarity of a given instance to two or more classes of instances.

### Cytometric Fingerprints in QTube

Basically, QTube computes fairly low-resolution fingerprints, based only on selected parameters common to a panel of tubes. It does not use the hierarchical nature of CF, but only the highest-resolution portion of the fingerprint. We then represent the fingerprint as deviations from the expected value, and then compute metrics from this representation. These are the numbers that appear in the QTube report. Simple, really.

#### References

1. Sturges, H.A. (1926). Journal of the American Statistical Association 21, 65-66.

2. Roederer, M., Treister, A., Moore, W. & Herzenberg, L.A. (2001). Cytometry 45, 37-46.

3. Murphy, R.F. (1985). Cytometry 6, 302-309.

4, Robinson, J.P., Durack, G. & Kelley, S. (1991). Cytometry 12, 82-90.

5. Robinson, J.P., Ragheb, K., Lawler, G., Kelley, S. & Durack, G. (1992). Cytometry 13, 75-82.

6. Lugli, E., Pinti, M., Nasi, M., Troiano, L., Ferraresi, R., Mussi, C., Salvioli, G., Patsekin, V., Robinson, J.P., Durante, C. et al. (2007). Cytometry A 71, 334-344.

7. Roederer, M., Moore, W., Treister, A., Hardy, R.R. & Herzenberg, L.A. (2001). Cytometry 45, 47-55.