myscience.org › science wire › Making big data a little smaller

Making big data a little smaller

19 October 2017

When we think about digital information, we often think about size. A daily email newsletter, for example, maybe 75 to 100 kilobytes in size. But data also has dimensions, based on the numbers of variables in a piece of data. An email, for example, can be viewed as a high-dimensional vector where there's one coordinate for each word in the dictionary and the value in that coordinate is the number of times that word is used in the email. So a 75 Kb email that is 1,000 words long would result in a vector in the millions. This geometric view on data is useful in some applications, such as learning spam classifiers, but, the more dimensions, the longer it can take for an algorithm to run and the more memory the algorithm uses. As data processing got more and more complex in the mid-to-late 1990s, computer scientists turned to pure mathematics to help speed up the algorithmic processing of data. In particular, researchers found a solution in a theorem proved in the 1980s by mathematics William B. Johnson and Joram Lindenstrauss working the area of functional analysis. Known as the Johnson-Lindenstrauss lemma (JL lemma), computer scientists have used the theorem to reduce the dimensionality of data and help speed up all types of algorithms across many different fields, from streaming and search algorithms, to fast approximation algorithms for statistical and linear algebra and even algorithms for computational biology. But as data has grown even larger and more complex, many computer scientists have asked: Is the JL lemma really the best approach to pre-process large data into a manageably low dimension for algorithmic processing?