myscience.org › science wire › Data diversity

Data diversity

16 December 2016

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems have designed a new algorithm that makes it much more practical to select diverse subsets from a much larger dataset.

When data sets get too big, sometimes the only way to do anything useful with them is to extract much smaller subsets and analyze those instead. Those subsets have to preserve certain properties of the full sets, however, and one property that's useful in a wide range of applications is diversity. If, for instance, you're using your data to train a machine-learning system, you want to make sure that the subset you select represents the full range of cases that the system will have to confront. Last week at the Conference on Neural Information Processing Systems, researchers from MIT's Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems presented a new algorithm that makes the selection of diverse subsets much more practical. Whereas the running times of earlier subset-selection algorithms depended on the number of data points in the complete data set, the running time of the new algorithm depends on the number of data points in the subset. That means that if the goal is to winnow a data set with 1 million points down to one with 1,000, the new algorithm is 1 billion times faster than its predecessors. 'We want to pick sets that are diverse,' says Stefanie Jegelka, the X-Window Consortium Career Development Assistant Professor in MIT's Department of Electrical Engineering and Computer Science and senior author on the new paper.