An important distinction between pattern discovery-based analysis of primary sequence data (either protein or DNA) and traditional methods of biosequence analysis is that pattern discovery does not require multiple sequence alignment as a means of pattern detection. Rather, it discovers patterns in an alignment-independent way.

For this reason, pattern discovery works well in the "twilight zone" (<20% protein sequence similarity), and thus is an important complement to—not a replacement for-alignment-based methods.

Pattern discovery is useful for:

  • sequence analysis problems in which some sequences have low or undetectable similarity;
  • differential analysis problems, in which one seeks small, localized or distributed, subtle similarities among proteins that may be related, for example to pharmacological liabilities; and
  • model-independent, data-centric data mining approaches to sequence analysis.

Data pertinent to drug discovery typifies complex, high-dimensionality data. Regardless of whether one considers the chemical or the biological domain, the entities themselves are extraordinarily complex, and their interactions are even more so. "Chemistry space" is itself huge, and descriptors capable of usefully discriminating molecular functionality frequently have dimensionalities from 103 to 107 or greater.

There are several approaches to dealing with high dimensionality data. These include methods that explicitly ignore interactions among variables, ranking individual inputs by their information content and creating models using only the most information-rich inputs. Others explicitly reduce the dimensionality of the data and look at interactions among a limited number of variables, often on a statistical basis.

Still other methods are "greedy," testing input variable interactions and either accepting or rejecting them. Once rejected, they are never revisited for possible recombination with other, as yet unconsidered, inputs thus giving solutions that are possibly sub-optimal.

Beyond the problems inherent in the analysis of high-dimensionality data, there is another equally important consideration. Often, data exists from multiple domains. Typically these data are analyzed in a domain-specific way. Chemistry data is analyzed in isolation from biological data, and vice versa. We believe that this approach ignores the extent to which analysis in one domain can usefully inform the analysis in another domain. This is another example of interactions of inputs, but one that is frequently overlooked.