Research Assistant Professor
I am a Research Assistant Professor in the Department of Computer Science and Engineering at the University of Notre Dame. My research interests focus on the interdisciplinary applications of data mining and machine learning to large and imbalanced datasets.
- 384L Nieuwland Science Hall
Notre Dame, IN 46556
A widely used measure of scientific impact is citations. However, due to their heavy-tailed distribution, citations are fundamentally difficult to predict. Instead, to characterize scientific impact, we address two analogous questions asked by many scientific researchers: "How will my h-index evolve over time, and which of my previously or newly published papers will contribute to it?" To answer these questions, we perform two related tasks. First, we develop a model to predict authors' future h-indices based on their current scientific impact. Second, we examine the factors that drive papers—either previously or newly published—to increase their authors' predicted future h-indices. By leveraging relevant factors, we can predict an author's h-index in five years with an R2 value of 0.92 and whether a previously (newly) published paper will contribute to this future h-index with an F1 score of 0.99 (0.77). We find that topical authority and publication venue are crucial to these effective predictions, while topic popularity is surprisingly inconsequential. Further, we develop an online tool that allows users to generate informed h-index predictions. Our work demonstrates the predictability of scientific impact, and can help researchers to effectively leverage their scholarly position of "standing on the shoulders of giants."
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Some students, for a variety of factors, struggle to complete high school on time. To address this problem, school districts across the U.S. use intervention programs to help struggling students get back on track academically. Yet in order to best apply those programs, schools need to identify off-track students as early as possible and enroll them in the most appropriate intervention. Unfortunately, identifying and prioritizing students in need of intervention remains a challenging task. This paper describes work that builds on current systems by using advanced data science methods to produce an extensible and scalable predictive framework for providing partner U.S. public school districts with individual early warning indicator systems. Our framework employs machine learning techniques to identify struggling students and describe features that are useful for this task, evaluating these techniques using metrics important to school administrators. By doing so, our framework, developed with the common need of several school districts in mind, provides a common set of tools for identifying struggling students and the factors associated with their struggles. Further, by integrating data from disparate districts into a common system, our framework enables cross-district analyses to investigate common early warning indicators not just within a single school or district, but across the U.S. and beyond.
Collaboration is an integral element of the scientific process that often leads to findings with significant impact. While extensive efforts have been devoted to quantifying and predicting research impact, the question of how collaborative behavior influences scientific impact remains unaddressed. In this work, we study the interplay between scientists' collaboration signatures and their scientific impact. As the basis of our study, we employ an ArnetMiner dataset with more than 1.7 million authors and 2 million papers spanning over 60 years. We formally define a scientist’s collaboration signature as the distribution of collaboration strengths with each collaborator in his or her academic ego network, which is quantified by four measures: sociability, dependence, diversity, and self-collaboration. We then demonstrate that the collaboration signature allows us to effectively distinguish between researchers with dissimilar levels of scientific impact. We also discover that, even from the early stages of one’s researcher career, a scientist’s collaboration signature can help to reveal his or her future scientific impact. Finally, we find that as a representative group of outstanding computer scientists, Turing Award winners collectively produce distinctive collaboration signatures throughout the entirety of their careers. Our conclusions on the relationship between collaboration signatures and scientific impact give rise to important implications for researchers who wish to expand their scientific impact and more effectively stand on the shoulders of "collaborators."
The deployment of classification models is an integral component of many modern data mining and machine learning applications. A typical classification model is built with the tacit assumption that the deployment scenario by which it is evaluated is fixed and fully characterized. Yet, in the practical deployment of classification methods, important aspects of the application environment, such as the misclassification costs, may be uncertain during model building. Moreover, a single classification model may be applied in several different deployment scenarios. In this work, we propose a method to optimize a model for uncertain deployment scenarios. We begin by deriving a relationship between two evaluation measures, H measure and cost curves, that may be used to address uncertainty in classifier performance. We show that when uncertainty in classifier performance is modeled as a probabilistic belief that is a function of this underlying relationship, a natural definition of risk emerges for both classifiers and instances. We then leverage this notion of risk to develop a boosting-based algorithm—which we call RiskBoost—that directly mitigates classifier risk, and we demonstrate that it outperforms AdaBoost on a diverse selection of datasets.
Scientific impact plays a central role in the evaluation of the output of scholars, departments, and institutions. A widely used measure of scientific impact is citations, with a growing body of literature focused on predicting the number of citations obtained by any given publication. The effectiveness of such predictions, however, is fundamentally limited by the power-law distribution of citations, whereby publications with few citations are extremely common and publications with many citations are relatively rare. Given this limitation, in this work we instead address a related question asked by many academic researchers in the course of writing a paper, namely: "Will this paper increase my h-index?" Using a real academic dataset with over 1.7 million authors, 2 million papers, and 8 million citation relationships from the premier online academic service ArnetMiner, we formalize a novel scientific impact prediction problem to examine several factors that can drive a paper to increase the primary author's h-index. We find that the researcher’s authority on the publication topic and the venue in which the paper is published are crucial factors to the increase of the primary author's h-index, while the topic popularity and the co-authors' h-indices are of surprisingly little relevance. By leveraging relevant factors, we find a greater than 87.5% potential predictability for whether a paper will contribute to an author's h-index within five years. As a further experiment, we generate a self-prediction for this paper, estimating that there is a 76% probability that it will contribute to the h-index of the co-author with the highest current h-index in five years. We conclude that our findings on the quantification of scientific impact can help researchers to expand their influence and more effectively leverage their position of "standing on the shoulders of giants."
Hellinger Distance Decision Trees (HDDT) has been previously used for static datasets with skewed distributions. In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees (e.g. C4.5) to cope with the unbalanced problem. However it is not always possible to revisit/store old instances of a stream. In this paper we show how HDDT can be successfully applied in unbalanced and evolving stream data. Using HDDT allows us to remove instance propagations between batches with several benefits: i) improved predictive accuracy ii) speed iii) single-pass through the data. We use a Hellinger weighted ensemble of HDDTs to combat concept drift and increase accuracy of single classifiers. We test our framework on several streaming datasets with unbalanced classes and concept drift.
The concept of a negative class does not apply to many problems for which classification is increasingly utilized. In this study we investigate the reliability of evaluation metrics when the negative class contains an unknown proportion of mislabeled positive class instances. We examine how evaluation metrics can inform us about potential systematic biases in the data. We provide a motivating case study and a general framework for approaching evaluation when the negative class contains mislabeled positive class instances. We show that the behavior of evaluation metrics is unstable in the presence of uncertainty in class labels and that the stability of evaluation metrics depends on the kind of bias in the data. Finally, we show that the type and amount of bias present in data can have a significant effect on the ranking of evaluation metrics and the degree to which they over- or underestimate the true performance of classifiers.
Predicting the distributions of species is central to a variety of applications in ecology and conservation biology. With increasing interest in using electronic occurrence records, many modeling techniques have been developed to utilize this data and compute the potential distribution of species as a proxy for actual observations. As the actual observations are typically overwhelmed by non-occurrences, we approach the modeling of species’ distributions with a focus on the problem of class imbalance. Our analysis includes the evaluation of several machine learning methods that have been shown to address the problems of class imbalance, but which have rarely or never been applied to the domain of species distribution modeling. Evaluation of these methods includes the use of the area under the precision-recall curve (AUPR), which can supplement other metrics to provide a more informative assessment of model utility under conditions of class imbalance. Our analysis concludes that emphasizing techniques that specifically address the problem of class imbalance can provide AUROC and AUPR results competitive with traditional species distribution models.
An underlying assumption of biomedical informatics is that decisions can be more informed when professionals are assisted by analytical systems. For this purpose, we propose ALIVE, a multi-relational link prediction and visualization environment for the healthcare domain. ALIVE combines novel link prediction methods with a simple user interface and intuitive visualization of data to enhance the decision-making process for healthcare professionals. It also includes a novel link prediction algorithm, MRPF, which outperforms many comparable algorithms on multiple networks in the biomedical domain. ALIVE is one of the first attempts to provide an analytical and visual framework for healthcare analytics, promoting collaboration and sharing of data through ease of use and potential extensibility. We encourage the development of similar tools, which can assist in facilitating successful sharing, collaboration, and a vibrant online community.
Dept. of Computer Science and Engineering | Notre Dame
My research focused on building machine learning models to solve complex longitudinal problems with applications to areas as diverse as ecological informatics, healthcare analytics, scientific impact prediction, and predictive learning analytics. Specific contributions include:
- Created enhancements to algorithms for imbalanced data, with focus on two-class problems.
- Employed a novel algorithm for learning species distributions, drawing from techniques for class imbalance to meet and exceed the performance of the state of the art.
- Analyzed academic social networks to identify key publication and collaboration signatures predictive of scientific impact, and used the insights to develop an impact prediction tool.
- Collaborated in the creation and evaluation of a novel multi-relational link prediction algorithm to predict drug-disease interactions, and developed an interactive web-based framework to visualize the predictions generated by the method.
Data Science for Social Good Fellowship | University of Chicago
Data Science Fellow
- Worked as part of a small, interdisciplinary team to predict on-time high school graduation from student data provided by several partnering U.S. public school districts.
- Used PostgreSQL and standard Python tools and libraries to develop a data-driven predictive model to identify academically at-risk students that has the potential to increase identification of at-risk students by up to 80% if implemented.
Machine Learning (CSE 40625)
Data Science Online (CSE 44648)
- Instructed the inaugural course on data science, offered online.
- Helped develop the course content and held regular teaching live sessions.
Data Mining (CSE 40647)
- Co-instructed the annual offering on data mining to senior-level undergraduates and graduate students.
- Revised the syllabus to incorporate Python programming into the course.
- Integrated programmatic examples into lectures via IPython Notebook.
- Developed and delivered presentation material.
Ethics and Professional Issues (CSE 40175)
Discrete Mathematics (CSE 20110)
Ph.D. in Computer Science and Engineering
University of Notre Dame
- Dissertation Title: Data Science for Imbalanced Data: Methods and Applications
- Advisor: Prof. Nitesh V. Chawla
B.S. in Computer Science, Summa Cum Laude
University of Illinois at Springfield
St. Olaf College
Feel free to contact me
- 384L Nieuwland Science Hall
University of Notre Dame
Notre Dame, IN 46556