Each event in list-mode data is described by a vector of coordinates in a multidimensional space. Thus, a complete mathematical description of a sample is the multivariate probability distribution function defining the density of events in this space. This distribution may be approximated by dividing the space into small volume elements, counting the number of events in each volume element, and normalizing the count by the total number of events in the sample. In the limit of an infinite number of events, the regions may be made infinitesimally small yielding the true probability distribution function. Of course, it is impossible to collect an infinite number of events, so the question of interest is, how does one accurately estimate the true probability distribution from a finite sample of events? Equally importantly, how does one represent this approximation of the multivariate probability density function in a form amenable to comparing disparate samples?

The first question, that of estimating the probability density function (PDF) from a finite sample of events, has been a subject of research since the early 20th century (1). The most common non-parametric means of estimating a PDF is a histogram where space is divided into equal width bins. For a complex (rapidly changing) PDF, one would like to choose small bins in order to accurately track the variation with respect to independent variables (low bias). On the other hand, one would like to choose bins of sufficient size to contain a large number of events in order to estimate the value of the density within a bin with high accuracy (low variance). This trade-off between number of bins and bin size is the classic bias-variance dilemma. For one independent variable and reasonably sized datasets, it is not difficult to balance the bias-variance requirements. However, for multidimensional data the curse of dimensionality is a severe limitation. Choosing bins of fixed width gives control over bias, but the problem of empty (or highly populated) bins, depending on the nature of the distribution, means that there is no control over the variance. An alternative approach is to control the variance by choosing bins that contain equal numbers of events. (This strategy is particularly useful if the ultimate goal is to utilize bin event densities as features in classification since the measurement accuracy for each feature should be the same.) In the case of univariate data, there is a set of bin boundaries that accomplishes this goal (2). For multivariate data, however, there is not a unique solution. While this indeterminacy might seem like a disadvantage, in fact, it creates an opportunity to find a specific set of bin boundaries that does a superior job of reducing bias.

Other methods of representing and analyzing multidimensional flow cytometry data have been developed (3-6). One that is most closely related to the present work is Probability Binning (PB) (7). PB represents a multidimensional probability distribution as a set of bins defining regions of the multidimensional space. The boundaries of these bins are chosen so that approximately equal numbers of events lie in each bin. Bins are found by selecting a coordinate dimension, determining the median in that coordinate, and dividing the data at the median value. In PB, the axis selection is made by calculating the variance of the data in the parent bin for each of the original coordinate dimensions and choosing the one dimension having the largest variance. Although the decision is made on the basis of the variance in each dimension, the split is not necessarily along the optimal direction since the direction of maximum variance may not coincide with one of the coordinate axes.

CF differs from PB in three important ways:

(i) CF forms bins by
splitting the data in the direction of maximum variance rather than
along the original coordinate axes. This involves first
determining the direction of maximum variance and then rotating the
data space such that the principle coordinate axis lies in the
direction of maximum variance.

(ii) CF creates a hierarchical, multi-resolution representation of the data. This is done by retaining and utilizing information for bins at each level of recursion.

(iii) CF utilizes the binned data to develop a fingerprint that is a one-dimensional representation embodying the information contained in the multi-resolution, multidimensional representation.

(ii) CF creates a hierarchical, multi-resolution representation of the data. This is done by retaining and utilizing information for bins at each level of recursion.

(iii) CF utilizes the binned data to develop a fingerprint that is a one-dimensional representation embodying the information contained in the multi-resolution, multidimensional representation.

Additionally, CF includes novel algorithms for finding and representing bins from one data set and utilizing this bin representation to process a second data set. It also includes a novel method of forming a differential fingerprint that represents the degree of dissimilarity of a given instance to two or more classes of instances.

1. Sturges, H.A. (1926). *Journal of the American Statistical
Association* **21**, 65-66.

2. Roederer,
M., Treister, A., Moore, W. & Herzenberg, *Cytometry* **45**, 37-46.

3. Murphy, R.F. (1985). *Cytometry* **6**, 302-309.

4, Robinson, J.P., Durack, G. &
Kelley, S. (1991). *Cytometry* **12**,
82-90.

5. Robinson,
J.P., Ragheb, K., Lawler, G., Kelley, S. & Durack, G. (1992). *Cytometry* **13**, 75-82.

6. Lugli, E., Pinti, M., Nasi, M.,
Troiano, L., Ferraresi, R., Mussi, C., Salvioli, G., Patsekin, V., Robinson,
J.P., Durante, C. et al. (2007).
*Cytometry A* **71**, 334-344.

7. Roederer,
M., *Cytometry* **45**, 47-55.