Reid Johnson

Reid Johnson

Research Assistant Professor

I am a Research Assistant Professor in the Department of Computer Science and Engineering at the University of Notre Dame. My research interests focus on the interdisciplinary applications of data mining and machine learning to large and imbalanced datasets.

384L Nieuwland Science Hall
Notre Dame, IN 46556


A widely used measure of scientific impact is citations. However, due to their heavy-tailed distribution, citations are fundamentally difficult to predict. Instead, to characterize scientific impact, we address two analogous questions asked by many scientific researchers: "How will my h-index evolve over time, and which of my previously or newly published papers will contribute to it?" To answer these questions, we perform two related tasks. First, we develop a model to predict authors' future h-indices based on their current scientific impact. Second, we examine the factors that drive papers—either previously or newly published—to increase their authors' predicted future h-indices. By leveraging relevant factors, we can predict an author's h-index in five years with an R2 value of 0.92 and whether a previously (newly) published paper will contribute to this future h-index with an F1 score of 0.99 (0.77). We find that topical authority and publication venue are crucial to these effective predictions, while topic popularity is surprisingly inconsequential. Further, we develop an online tool that allows users to generate informed h-index predictions. Our work demonstrates the predictability of scientific impact, and can help researchers to effectively leverage their scholarly position of "standing on the shoulders of giants."
Yuxiao Dong, Reid A. Johnson, Nitesh V. Chawla. Can Scientific Impact Be Predicted? IEEE Transactions on Big Data (TBD), 2016. Paper
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. Proceedings of the 6th IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2015. Paper
Some students, for a variety of factors, struggle to complete high school on time. To address this problem, school districts across the U.S. use intervention programs to help struggling students get back on track academically. Yet in order to best apply those programs, schools need to identify off-track students as early as possible and enroll them in the most appropriate intervention. Unfortunately, identifying and prioritizing students in need of intervention remains a challenging task. This paper describes work that builds on current systems by using advanced data science methods to produce an extensible and scalable predictive framework for providing partner U.S. public school districts with individual early warning indicator systems. Our framework employs machine learning techniques to identify struggling students and describe features that are useful for this task, evaluating these techniques using metrics important to school administrators. By doing so, our framework, developed with the common need of several school districts in mind, provides a common set of tools for identifying struggling students and the factors associated with their struggles. Further, by integrating data from disparate districts into a common system, our framework enables cross-district analyses to investigate common early warning indicators not just within a single school or district, but across the U.S. and beyond.
Reid A. Johnson, Ruobin Gong, Siobhan Greatorex-Voith, Anushka Anand, Alan Fritzler. A Data-Driven Framework for Identifying High School Students at Risk of Not Graduating On Time. Bloomberg Data for Good Exchange, 2015. Paper Presentation
Collaboration is an integral element of the scientific process that often leads to findings with significant impact. While extensive efforts have been devoted to quantifying and predicting research impact, the question of how collaborative behavior influences scientific impact remains unaddressed. In this work, we study the interplay between scientists' collaboration signatures and their scientific impact. As the basis of our study, we employ an ArnetMiner dataset with more than 1.7 million authors and 2 million papers spanning over 60 years. We formally define a scientist’s collaboration signature as the distribution of collaboration strengths with each collaborator in his or her academic ego network, which is quantified by four measures: sociability, dependence, diversity, and self-collaboration. We then demonstrate that the collaboration signature allows us to effectively distinguish between researchers with dissimilar levels of scientific impact. We also discover that, even from the early stages of one’s researcher career, a scientist’s collaboration signature can help to reveal his or her future scientific impact. Finally, we find that as a representative group of outstanding computer scientists, Turing Award winners collectively produce distinctive collaboration signatures throughout the entirety of their careers. Our conclusions on the relationship between collaboration signatures and scientific impact give rise to important implications for researchers who wish to expand their scientific impact and more effectively stand on the shoulders of "collaborators."
Yuxiao Dong, Reid A. Johnson, Yang Yang, Nitesh V. Chawla. Collaboration Signatures Reveal Scientific Impact. Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2015. Paper Presentation
The deployment of classification models is an integral component of many modern data mining and machine learning applications. A typical classification model is built with the tacit assumption that the deployment scenario by which it is evaluated is fixed and fully characterized. Yet, in the practical deployment of classification methods, important aspects of the application environment, such as the misclassification costs, may be uncertain during model building. Moreover, a single classification model may be applied in several different deployment scenarios. In this work, we propose a method to optimize a model for uncertain deployment scenarios. We begin by deriving a relationship between two evaluation measures, H measure and cost curves, that may be used to address uncertainty in classifier performance. We show that when uncertainty in classifier performance is modeled as a probabilistic belief that is a function of this underlying relationship, a natural definition of risk emerges for both classifiers and instances. We then leverage this notion of risk to develop a boosting-based algorithm—which we call RiskBoost—that directly mitigates classifier risk, and we demonstrate that it outperforms AdaBoost on a diverse selection of datasets.
Reid A. Johnson, Troy Raeder, Nitesh V. Chawla. Optimizing Classifiers for Hypothetical Scenarios. Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2015. Paper Presentation
Scientific impact plays a central role in the evaluation of the output of scholars, departments, and institutions. A widely used measure of scientific impact is citations, with a growing body of literature focused on predicting the number of citations obtained by any given publication. The effectiveness of such predictions, however, is fundamentally limited by the power-law distribution of citations, whereby publications with few citations are extremely common and publications with many citations are relatively rare. Given this limitation, in this work we instead address a related question asked by many academic researchers in the course of writing a paper, namely: "Will this paper increase my h-index?" Using a real academic dataset with over 1.7 million authors, 2 million papers, and 8 million citation relationships from the premier online academic service ArnetMiner, we formalize a novel scientific impact prediction problem to examine several factors that can drive a paper to increase the primary author's h-index. We find that the researcher’s authority on the publication topic and the venue in which the paper is published are crucial factors to the increase of the primary author's h-index, while the topic popularity and the co-authors' h-indices are of surprisingly little relevance. By leveraging relevant factors, we find a greater than 87.5% potential predictability for whether a paper will contribute to an author's h-index within five years. As a further experiment, we generate a self-prediction for this paper, estimating that there is a 76% probability that it will contribute to the h-index of the co-author with the highest current h-index in five years. We conclude that our findings on the quantification of scientific impact can help researchers to expand their influence and more effectively leverage their position of "standing on the shoulders of giants."
Yuxiao Dong, Reid A. Johnson, and Nitesh V. Chawla. Will This Paper Increase Your h-Index? Scientific Impact Prediction. Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM), 2015. Best Paper Award Nomination. Paper Poster Presentation Data
Hellinger Distance Decision Trees (HDDT) has been previously used for static datasets with skewed distributions. In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees (e.g. C4.5) to cope with the unbalanced problem. However it is not always possible to revisit/store old instances of a stream. In this paper we show how HDDT can be successfully applied in unbalanced and evolving stream data. Using HDDT allows us to remove instance propagations between batches with several benefits: i) improved predictive accuracy ii) speed iii) single-pass through the data. We use a Hellinger weighted ensemble of HDDTs to combat concept drift and increase accuracy of single classifiers. We test our framework on several streaming datasets with unbalanced classes and concept drift.
Andrea Dal Pozzolo, Reid A. Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V. Chawla, and Gianluca Bontempi. Using HDDT to Avoid Instances Propagation in Unbalanced and Evolving Data Streams. Proceedings of the 24th International Joint Conference on Neural Networks (IJCNN), 2014. Paper Presentation
The concept of a negative class does not apply to many problems for which classification is increasingly utilized. In this study we investigate the reliability of evaluation metrics when the negative class contains an unknown proportion of mislabeled positive class instances. We examine how evaluation metrics can inform us about potential systematic biases in the data. We provide a motivating case study and a general framework for approaching evaluation when the negative class contains mislabeled positive class instances. We show that the behavior of evaluation metrics is unstable in the presence of uncertainty in class labels and that the stability of evaluation metrics depends on the kind of bias in the data. Finally, we show that the type and amount of bias present in data can have a significant effect on the ranking of evaluation metrics and the degree to which they over- or underestimate the true performance of classifiers.
Andrew K. Rider, Reid A. Johnson, Darcy A. Davis, T. Ryan Hoens, and Nitesh V. Chawla. Classifier Evaluation with Missing Negative Class Labels. Proceedings of the 12th International Conference on Intelligent Data Analysis (IDA), 2013. Paper Presentation
Predicting the distributions of species is central to a variety of applications in ecology and conservation biology. With increasing interest in using electronic occurrence records, many modeling techniques have been developed to utilize this data and compute the potential distribution of species as a proxy for actual observations. As the actual observations are typically overwhelmed by non-occurrences, we approach the modeling of species’ distributions with a focus on the problem of class imbalance. Our analysis includes the evaluation of several machine learning methods that have been shown to address the problems of class imbalance, but which have rarely or never been applied to the domain of species distribution modeling. Evaluation of these methods includes the use of the area under the precision-recall curve (AUPR), which can supplement other metrics to provide a more informative assessment of model utility under conditions of class imbalance. Our analysis concludes that emphasizing techniques that specifically address the problem of class imbalance can provide AUROC and AUPR results competitive with traditional species distribution models.
Reid A. Johnson, Nitesh V. Chawla, and Jessica J. Hellmann. Species Distribution Modeling and Prediction: A Class Imbalance Problem. Proceedings of the Conference on Intelligent Data Understanding (CIDU), 2012. Paper Presentation
An underlying assumption of biomedical informatics is that decisions can be more informed when professionals are assisted by analytical systems. For this purpose, we propose ALIVE, a multi-relational link prediction and visualization environment for the healthcare domain. ALIVE combines novel link prediction methods with a simple user interface and intuitive visualization of data to enhance the decision-making process for healthcare professionals. It also includes a novel link prediction algorithm, MRPF, which outperforms many comparable algorithms on multiple networks in the biomedical domain. ALIVE is one of the first attempts to provide an analytical and visual framework for healthcare analytics, promoting collaboration and sharing of data through ease of use and potential extensibility. We encourage the development of similar tools, which can assist in facilitating successful sharing, collaboration, and a vibrant online community.
Reid A. Johnson, Yang Yang, Everaldo Aguiar, Andrew K. Rider, and Nitesh V. Chawla. ALIVE: A Multi-Relational Link Prediction Environment for the Healthcare Domain. Proceedings of the 3rd Workshop on Data Mining for Healthcare Management at the 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining (DMHM-PAKDD), 2012. Paper Presentation



Dept. of Computer Science and Engineering | Notre Dame

Research Assistant

My research focused on building machine learning models to solve complex longitudinal problems with applications to areas as diverse as ecological informatics, healthcare analytics, scientific impact prediction, and predictive learning analytics. Specific contributions include:

  • Created enhancements to algorithms for imbalanced data, with focus on two-class problems.
  • Employed a novel algorithm for learning species distributions, drawing from techniques for class imbalance to meet and exceed the performance of the state of the art.
  • Analyzed academic social networks to identify key publication and collaboration signatures predictive of scientific impact, and used the insights to develop an impact prediction tool.
  • Collaborated in the creation and evaluation of a novel multi-relational link prediction algorithm to predict drug-disease interactions, and developed an interactive web-based framework to visualize the predictions generated by the method.

Summer 2015

Data Science for Social Good Fellowship | University of Chicago

Data Science Fellow

  • Worked as part of a small, interdisciplinary team to predict on-time high school graduation from student data provided by several partnering U.S. public school districts.
  • Used PostgreSQL and standard Python tools and libraries to develop a data-driven predictive model to identify academically at-risk students that has the potential to increase identification of at-risk students by up to 80% if implemented.


Course Instructor

Spring 2017

Machine Learning (CSE 40625)

Summer 2016

Data Science Online (CSE 44648)

  • Instructed the inaugural course on data science, offered online.
  • Helped develop the course content and held regular teaching live sessions.

Course Co-Instructor

Spring 2014

Data Mining (CSE 40647)

  • Co-instructed the annual offering on data mining to senior-level undergraduates and graduate students.
  • Revised the syllabus to incorporate Python programming into the course.
  • Integrated programmatic examples into lectures via IPython Notebook.
  • Developed and delivered presentation material.
[Link to course website]

Teaching Assitant

Spring 2011

Ethics and Professional Issues (CSE 40175)

Fall 2010

Discrete Mathematics (CSE 20110)



Ph.D. in Computer Science and Engineering

University of Notre Dame

  • Dissertation Title: Data Science for Imbalanced Data: Methods and Applications
  • Advisor: Prof. Nitesh V. Chawla

B.S. in Computer Science, Summa Cum Laude

University of Illinois at Springfield


St. Olaf College

Professional Skills

Proficient with Python, Java, SQL, R, TensorFlow, Weka, Tableau, PHP, JavaScript, C/C++, LaTeX.

Contact Me

Feel free to contact me

384L Nieuwland Science Hall
University of Notre Dame
Notre Dame, IN 46556
United States
Loading ...