Somewhere, buried deep inside mountains of information, awaits the human dimension of data. It’s the small subset of material that, when properly selected, sheds light on something important, such as public policy or DNA sequencing.
This is the scientific territory where Yihong Wu, a Yale assistant professor of statistics and data science, has set up shop. He’s made it his mission to find communities and networks within high-dimensional data.
" All scientific fields deal with data, and more of it pours in all the time. But it’s not easy to make sense of it unless you do it in a principled, grounded way," Wu says.
Wu is part of a wave of new faculty joining the Department of Statistics and Data Science, as Yale continues to weave data science into the fabric of campus research in all disciplines. The University Science Strategy Committee recently named data science as a top priority and recommended that Yale invest in a university-wide initiative to integrate data science and mathematical modeling research across campus.
Wu joined the Yale faculty in 2016. Earlier this year, he earned a prestigious Sloan Research Fellowship, an award aimed at helping promising, early-career scientists.
" Yihong’s work lies at the intersection of high-dimensional statistics, information theory, and computer science," said Harrison Zhou, professor and chair of the Department of Statistics and Data Science. "He has made fundamental contributions in the important problem of estimating the number of unseen symbols in a population."
Researchers have been working on this for generations, Zhou noted. In 1943, Ronald Fisher, Alexander Steven Corbet, and Carrington Williams wrote a seminal study about estimating species diversity, based on moth and butterfly collections; in 1976, Bradley Efron and Ronald Thisted came up with an estimate of William Shakespeare’s vocabulary based on a dataset of his recorded works.
In today’s research, in virtually every discipline, there is the variable of volume: a flood of data that streams in continuously. This material offers a wealth of possibilities, but it also poses problems. How do you store the information? How do you sift through it in ways that are financially feasible and can be done in a reasonable amount of time?
Wu tries to answer these questions with two guiding ideas in mind: design algorithms that give you provable, guaranteed results and get the information quickly.
" The starting point is always a good statistical model on which we can determine the optimal procedure," Wu said. "However, due to the combinatorial nature of the problem, it might be computationally expensive to solve."
The problems run the gamut of academic, business, and social inquiry. For instance, Wu might be looking at data about protein interactions in the human body in order to select which proteins can help create new medicines. Or he might be analyzing ways to improve online social networks by incorporating certain user data into the decision-making process.
One approach to such challenges, he said, is to "relax" the problem. Wu and his collaborators sometimes use optimization techniques called convex relaxations to solve a relaxed version of the original problem. In other situations, Wu uses "belief propagation," an iterative algorithm normally used in statistical physics that essentially passes messages back and forth to fill in gaps in information.
" In my research, I focus on both the theoretical and the algorithmic aspects of statistical problems," Wu said. "For me, good research is achieved by finding methodologies that are theoretically grounded and computationally efficient. Equally important to me is proving the statistical optimality of the methods I propose."
A good example of this is Wu’s work that revisited the classic problem of predicting the number of unseen species based on a collection of samples. Published in the Proceedings of the National Academy of Sciences, the research broke new ground in making such estimates.
Previous studies had shown how to estimate the number of species for a population more than twice the size of the sample - but without a provable guarantee of accuracy. Wu and collaborators Alon Orlitsky and Ananda Theertha Suresh not only provided that provable guarantee, but they also expanded the size of the population that could be examined.
The possibilities for Wu’s research are equally expansive. Finding communities within data will have implications for looking at voting records, DNA chains, the blogosphere, traffic patterns, and an array of other data. Wu said the task requires expertise in both theory and practice, which is why he came to Yale.
" The main attraction to me was the people we have here," he said. "There is a very common interest in foundational studies, but with a firm focus on practical applications. You want your research to be useful, to have impact."