Abstract
A major branch of anomaly detection methods rely on dynamic networks: raw sequential data is first converted to a series of networks, then critical change points are identified in the evolving network structure. However, existing approaches use the first-order network (FON) to represent the underlying raw data, which may lose important higher-order sequence patterns, making higher-order anomalies undetectable in subsequent analysis. By replacing FON with higher-order network (HONs), we show that existing anomaly detection algorithms can better capture higher-order anomalies that may otherwise be ignored. We show that the existing HON construction algorithm cannot be used for the anomaly detection task due to the extra parameters and poor scalability; we introduce a parameter-free algorithm that constructs HON in big data sets. Using a large-scale synthetic data set with 11 billion web clickstreams, we demonstrate how the proposed method can capture variable orders of anomalies. Using a real-world taxi trajectory data, we show how the proposed method amplifies higher-order anomaly signals. Finally, we provide complexity analysis and benchmarking to show how one can incorporating higher-order dependencies with a small overhead.
Jian Xu
,
Mandana Saebi
, Bruno Ribeiro, Lance M. Kaplan,
Nitesh V. Chawla
. "
Detecting Anomalies in Sequential Data with Higher-order Networks
."
Abstract
That any two persons are separated by a relatively small number of intermediary contacts—the "small-world" phenomenon—is a surprising but well established regularity in human social networks. To date, network science has ignored the question of whether the small world phenomenon manifests itself in similar ways across dyadic classes defined by individual traits, such as age or sex. To address this gap in the literature, we explore the phenomenon of "age-specific small worlds." Rather than asking whether two random individuals are separated by a small number of links, we ascertain the extent to which individuals in specific age groups live in a small world in relation to individuals in other generational classes. We use data from a large-scale mobile communication network built from billions of voice calls and short messaging events approximating interaction patterns at a societal scale. We observe the average global distance between any random pair of users is 9.52, corresponding to nine-and-a-half degrees of separation. More importantly, we show that there is a systematic relationship between age and the average path distance connecting that person to others, with some age groups falling below this average quantity while others falling above. Young people live in the "smallest world, being separated from other young people and their parent’s generation via a comparatively small number of intermediaries. Older people live in the "least small world, being separated from their same age peers and their younger counterparts by a relatively large number of intermediaries. Middle age-people fall in between, being sociometrically close to both younger and older generations. However, there exists no significant difference of this age-effect on small world size between men and women. In all these results demonstrate that age-group heterogeneity of the small world can be traced to well-known social mechanisms affecting the way that age interacts with overall volume of connectivity and the relative prevalence of kin ties and non-kin ties, and may have important implications for our understanding of information cascades, diffusion phenomena, and the localized spread of fads and fashions.
Yuxiao Dong
, Omar Lizardo, and
Nitesh V. Chawla
. "
Do the Young Live in a "Smaller World" Than the Old? Age-Specific Degrees of Separation in a Large-Scale Mobile Communication Network
."
Abstract
It is well-known that many networks follow a power-law degree distribution; however, the factors that influence the formation of their distributions are still unclear. How can one model the connection between individual actions and network distributions? How can one explain the formation of group phenomena and their evolutionary patterns?

In this paper, we propose a unified framework, M3D, to model human dynamics in social networks from three perspectives: macro, meso, and micro. At the micro-level, we seek to capture the way in which an individual user decides whether to perform an action. At the meso-level, we study how group behavior develops and evolves over time, based on individual actions. At the macro-level, we try to understand how network distributions such as power-law (or heavy-tailed phenomena) can be explained by group behavior. We provide theoretical analysis for the proposed framework, and discuss the connection of our framework with existing work.

The framework offers a new, flexible way to explain the interplay between individual user actions and network distributions, and can benefit many applications. To model heavy-tailed distributions from partially observed individual actions and to predict the formation of group behaviors, we apply M3D to three different genres of networks: Tencent Weibo, Citation, and Flickr. We also use information-burst prediction as a particular application to quantitatively evaluate the predictive power of the proposed framework. Our results on the Weibo indicate that M3D's prediction performance exceeds that of several alternative methods by up to 30%.
Yang Yang, Jie Tang,
Yuxiao Dong
, Qiaozhu Mei,
Reid A. Johnson
, and
Nitesh V. Chawla
. "
Modeling the Interplay Between Individual Behavior and Network Distributions
."
Abstract
Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes "deep" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.
Chuxu Zhang
, Dongjin Song, Chao Huang, Ananthram Swami,
Nitesh V. Chawla
. "
Heterogeneous Graph Neural Network
."
25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
Abstract
Understanding the affect expressed by learners is essential for enriching the learning experience in Massive Open On-line Courses (MOOCs). However, online learning environ-ments, especially MOOCs, pose several challenges in un-derstanding the different types of affect experienced by a learner. In this paper, we define two categories of emotions, explicit emotions as those collected directly from the student through self-reported surveys, and implicit emotions as those inferred unobtrusively during the learning process. We also introduce positivity as a measure to study the valence reported by students chronologically, and use it to derive insights into their emotion patterns and their association with learning outcomes. We show that implicit and explicit emotions expressed by students within the context of a MOOC are independent of each other, however, they correlate better with students’ behavior compared to their valence.
Munira Syed
, Malolan Chetlur, Shazia Afzal, G. Alex Ambros,
Nitesh V. Chawla
. "
Implicit and Explicit Emotions in MOOCs
."
Educational Data Mining (EDM)
Abstract
Identifying non-thriving students and intervening to boost them are two processes that recent literature suggests should be more tightly integrated. We perform this integration over six semesters in a First Year Experience (FYE) course with the aim of boosting student success, by using an integrated closed-loop learning analytics scheme that consists of multiple steps broken into three main phases, as follows: Architecting for Collection (steps: design, build, capture), Analyzing for Action (steps: identify, notify, boost), and Assessing for Improvement (steps: evaluate, report). We close the loop by allowing later steps to inform earlier ones in real-time during a semester and iteratively year to year, thereby improving the course from data-driven insights. This process depends on the purposeful design of an integrated learning environment that facilitates data collection, storage, and analysis. Methods for evaluating the effectiveness of our analytics-based student interventions show that our criterion for identifying non-thriving students was satisfactory and that non-thriving students demonstrated more substantial changes from mid-term to final course grades than already-thriving students. Lastly, we make a case for using early performance in the FYE as an indicator of overall performance and retention of first-year students.
Munira Syed
, Trunojoyo Anggara, Alison Lanski, Xiaojing Duan, G. Alex Ambrose,
Nitesh V. Chawla
. "
Integrated Closed-loop Learning Analytics Scheme in a First Year Experience Course
."
Learning Analytics and Knowledge (LAK)
Abstract
Social networks influence health-related behavior, such as obesity and smoking. While researchers have studied social networks as a driver for diffusion of influences and behavior, it is less understood how the structure or topology of the network, in itself, impacts an individual’s health behavior and wellness state. In this paper, we investigate whether the structure or topology of a social network offers additional insight and predictability on an individual’s health and wellness. We develop a method called the Network-Driven health predictor (NetCARE) that leverages features representative of social network structure. Using a large longitudinal data set of students enrolled in the NetHealth study at the University of Notre Dame, we show that the NetCARE method improves the overall prediction performance over the baseline models—that use demographics and physical attributes—by 38%, 65%, 55%, and 54% for the wellness states—stress, happiness, positive attitude, and self-assessed health—considered in this paper.
Suwen Lin
,
Louis Faust
,
Pablo Robles-Granda
, Tomasz Kajdanowicz,
Nitesh V. Chawla
. "
Social Network Structure is Predictive of Health and Wellness
."
PLOS ONE
Abstract
Known colloquially as the “weekend effect,” the association between weekend admissions and increased mortality within hospital settings has become a highly contested topic over the last two decades. Drawing interest from practitioners and researchers alike, a sundry of works have emerged arguing for and against the presence of the effect across various patient cohorts. However, it has become evident that simply studying population characteristics is insufficient for understanding how the effect manifests. Rather, to truly understand the effect, investigations into its underlying factors must be considered. As such, the work presented in this manuscript serves to address this consideration by moving beyond identification of patient cohorts to examining the role of ICU performance. Employing a comprehensive, publicly available database of electronic medical records (EMR), we began by utilizing multiple logistic regression to identify and isolate a specific cohort in which the weekend effect was present. Next, we leveraged the highly detailed nature of the EMR to evaluate ICU performance using well-established ICU quality scorecards to assess differences in clinical factors among patients admitted to an ICU on the weekend versus weekday. Our results demonstrate the weekend effect to be most prevalent among emergency surgery patients (OR 1.53; 95% CI 1.19, 1.96), specifically those diagnosed with circulatory diseases. Differences between weekday and weekend admissions for this cohort included a variety of clinical factors such as ventilatory support and night-time discharges. This work reinforces the importance of accounting for differences in clinical factors as well as patient cohorts in studies investigating the weekend effect.
Louis Faust
,
Keith Feldman
,
Nitesh V. Chawla
. "
Examining the Weekend Effect Across ICU Performance Metrics
."
BMC Critical Care
Abstract
The ubiquity of wearable fitness trackers offers extensive opportunities for research on personal health. However, barriers specific to these trackers such as device abandonment and non-adherence often lead to substantial losses in data. As such, further research into adherence behaviors may derive the insights necessary to address these challenges and lead to more effective long-term studies. This paper serves to explore this approach: investigating the adherence behaviors of 617 college students belonging to a three-year observational study in which participants were monitored via Fitbit Charge HRs. Using this data, our objective was to assess the association between early adherence behaviors and device abandonment/long-term adherence. Adherence behavior from as early as participants' first 10 days in the study correlated with device abandonment and adherence over the next three years. Participants with unsatisfactory adherence in their first 30 days were twice as likely to abandon their devices and were, on average, 11% less adherent each month. The findings in this paper identify the stability of adherence behaviors, feasibility of their early detection and motivate the need to address non-adherent study participants early. Throughout these results, we discuss how the insights gathered from this work may shape the design of future long-term studies to minimize user attrition and promote prolonged engagement.
Louis Faust
,
Priscilla Jiménez-Pazmino
,
James K. Holland
, Omar Lizardo, David Hachen,
Nitesh V. Chawla
. "
What 30 Days Tells Us About 3 Years: Identifying Early Signs of User Abandonment and Non-Adherence
."
Proceedings of the 13th EAI International Conference on Pervasive Computing
Abstract
Moderate-vigorous physical activity (MVPA) offers extensive health benefits but is neglected by many. As a result, a wide body of research investigating physical activity behavior change has been conducted. As many of these studies transition from paper-based methods of MVPA data collection to fitness trackers, a series of challenges arise in extracting insights from these new data. The objective of this research was to develop a framework for preprocessing and extracting MVPA trends from wearable fitness tracker data to support MVPA behavior change studies. Using heart rate data collected from fitness trackers, we propose Physical Activity Trend eXtraction (PATX), a framework that imputes missing data, recalculates personalized target heart zones, and extracts MVPA trends. We tested our framework on a dataset of 123 college study participants observed across 2 academic years (18 months) using Fitbit Charge HRs. To demonstrate the value of our frameworks’ output in supporting MVPA behavior change studies, we applied it to 2 case studies. Among the 123 participants analyzed, PATX labeled 41 participants as experiencing a significant increase in MVPA and 44 participants who experienced a significant decrease in MVPA. Our first case study was consistent with previous works investigating the associations between MVPA and mental health. Whereas the second, exploring how individuals perceive their own levels of MVPA relative to their friends, led to a novel observation that individuals were less likely to notice changes in their own MVPA when close ties in their social network mimicked their changes. By providing meaningful and flexible outputs, PATX alleviates data concerns common with fitness trackers to support MVPA behavior change studies as they shift to more objective assessments of MVPA.
Louis Faust
, Cheng Wang, David Hachen, Omar Lizardo,
Nitesh V. Chawla
. "
PATX: A Framework for Extracting Moderate-Vigorous Physical Activity Trends From Wearable Fitness Tracker Data
."
JMIR mHealth and uHealth
Abstract
Technology-based solutions have been developed to enhance competencies of maternity health workers (MHWs) in the developing countries, but there has been a limited exploration of such tools in the developed world. We developed a web-based application to support care coordination tasks of the prenatal care coordinators (PNCCs) in U.S. using a 3-phase methodology. During the first phase, following key functionalities were identified in collaboration with an expert panel: assessment, information prescription, collaborative calendar, and secure messaging. The second phase was about iterative and incremental development of the app with the expert panel using modular design and easy to use interfaces. This led to the development of a password-protected PHP web application based on a MySQL database. An outside panel consisting of sixteen PNCCs evaluated the tool and provided feedback on perceived benefits, risks and concerns. The evaluators thought that the app could be help them save resources, empower clients, identify risks, maintain continuity of care and simplify documentation. However, they expressed concerns about feasibility of the tool, implementation cost, compliance with guidelines, security issues related to storage of client data and complexity of managing a paperless system. Concerns of MHWs in the developed and developing countries have a significant overlap, hence there is an opportunity for the developed world to learn from research targeted for the developing world. To address some specific concerns of the MHWs in the developed world, we recommend further ethnographic work assuming an activity theoretic approach with the target users.
Beenish Chaudhry
,
Louis Faust
,
Nitesh V. Chawla
. "
Development and Evaluation of a Web Application for Prenatal Care Coordinators in the United States
."
International Conference on Design Science Research in Information Systems
Abstract
Workload modeling in public cloud environments is challenging due to reasons such as infrastructure abstraction, workload heterogeneity and a lack of defined metrics for performance modeling. This paper presents an approach that applies statistical methods for distribution analysis, parameter estimation and Goodness-of-Fit (GoF) tests to develop theoretical (estimated) models of heterogeneous workloads on Amazon’s public cloud infrastructure using compute, memory and IO resource utilization data.
Frederick Nwanganga
,
Nitesh V. Chawla
, Gregory Madey
. "
Statistical Analysis and Modeling of Heterogeneous Workloads on Amazon’s Public Cloud Infrastructure
."
Proceedings of the 52nd Hawaii International Conference on System Sciences
Abstract
To seek answers to health queries, we often find ourselves on a quest to assimilate information from varied online sources. This information search and fusion from different sources elicits user preferences, which can be driven by demographics, context, and socio-economic factors. To that end, we study these factors as part of health-information seeking behavior of users on a large health and wellness-based knowledge sharing online platform. We begin by identifying the topical interests of users from different content consumption sources. Using these topical preferences, we explore information consumption and health-seeking behavior across three contextual dimensions: user-based demographic attributes, time-related features, and community-based socioeconomic factors. We then study how these context signals can be used to explain specific user health topic preferences. Our findings suggest that linking demographic features to user profiles is more effective in explaining health preferences than other features. Our work demonstrates the value of using contextual factors to characterize and understand the content consumption of users seeking health and wellness information online.
Aastha Nigam
,
Reid Johnson
, Dong Wang,
Nitesh V. Chawla
. "
Characterizing online health and wellness information consumption: A study
."
Information Fusion
Abstract
We present a novel visual representation and interface named the matrix of isosurface similarity maps (MISM) for effective exploration of large time-varying multivariate volumetric data sets. MISM synthesizes three types of similarity maps (i.e., self, temporal, and variable similarity maps) to capture the essential relationships among isosurfaces of different variables and time steps. Additionally, it serves as the main visual mapping and navigation tool for examining the vast number of isosurfaces and exploring the underlying time-varying multivariate data set. We present temporal clustering, variable grouping, and interactive filtering to reduce the huge exploration space of MISM. In conjunction with the isovalue and isosurface views, MISM allows users to identify important isosurfaces or isosurface pairs and compare them over space, time, and value range. More importantly, we introduce path recommendation that suggests, animates, and compares traversal paths for effectively exploring MISM under varied criteria and at different levels-of-detail. A silhouette-based method is applied to render multiple surfaces of interest in a visually succinct manner. We demonstrate the effectiveness of our approach with case studies of several time-varying multivariate data sets and an ensemble data set, and evaluate our work with two domain experts.
Jun Tao
, Martin Imre, Chaoli Wang,
Nitesh V. Chawla
, Hanqi Guo, Gökhan Sever, Seung Hyun Kim
. "
Exploring Time-Varying Multivariate Volume Data Using Matrix of Isosurface Similarity Maps
."
IEEE Transactions on Visualization and Computer Graphics
Abstract
Heterogeneous networks not only present a challenge of heterogeneity in the types of nodes and relations, but also the attributes and content associated with the nodes. While recent works have looked at representation learning on homogeneous and heterogeneous networks, there is no work that has collectively addressed the following challenges: (a) the heterogeneous structural information of the network consisting of multiple types of nodes and relations; (b) the unstructured semantic content (e.g., text) associated with nodes; and (c) online updates due to incoming new nodes in growing network. We address these challenges by developing a Content-Aware Representation Learning model (CARL). CARL performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network. Furthermore, an additional online update module is proposed for efficiently learning representations of incoming nodes. Extensive experiments demonstrate that CARL outperforms state-of-the-art baselines in various heterogeneous network mining tasks, such as link prediction, document retrieval, node recommendation and relevance search. We also demonstrate the effectiveness of the CARL's online update module through a category visualization study.
Chuxu Zhang
, Ananthram Swami,
Nitesh V. Chawla
. "
SHNE: Representation Learning for Semantic-Associated Heterogeneous Networks
."
The 12th ACM International Conference on Web Search and Data Mining (WSDM 2019)
Abstract
Nowadays, multivariate time series data are increasingly collected in various real world systems, e.g., power plants, wearable devices, etc. Anomaly detection and diagnosis in multivariate time series refer to identifying abnormal status in certain time steps and pinpointing the root causes. Building such a system, however, is challenging since it not only requires to capture the temporal dependency in each time series, but also need encode the inter-correlations between different pairs of time series. In addition, the system should be robust to noise and provide operators with different levels of anomaly scores based upon the severity of different incidents. Despite the fact that a number of unsupervised anomaly detection algorithms have been developed, few of them can jointly address these challenges. In this paper, we propose a Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED), to perform anomaly detection and diagnosis in multivariate time series data. Specifically, MSCRED first constructs multi-scale (resolution) signature matrices to characterize multiple levels of the system statuses in different time steps. Subsequently, given the signature matrices, a convolutional encoder is employed to encode the inter-sensor (time series) correlations and an attention based Convolutional Long-Short Term Memory (ConvLSTM) network is developed to capture the temporal patterns. Finally, based upon the feature maps which encode the inter-sensor correlations and temporal information, a convolutional decoder is used to reconstruct the input signature matrices and the residual signature matrices are further utilized to detect and diagnose anomalies. Extensive empirical studies based on a synthetic dataset and a real power plant dataset demonstrate that MSCRED can outperform state-of-the-art baseline methods.
Chuxu Zhang
, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen,
Nitesh V. Chawla
. "
A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data
."
The 33rd Conference on Artificial Intelligence (AAAI 2019)
Abstract
Neural collaborative filtering (NCF) and recurrent recommender systems (RRN) have been successful in modeling user-item relational data. However, they are also limited in their assumption of static or sequential modeling of relational data as they do not account for evolving users' preference over time as well as changes in the underlying factors that drive the change in user-item relationship over time. We address these limitations by proposing a Neural Tensor Factorization (NTF) model for predictive tasks on dynamic relational data. The NTF model generalizes conventional tensor factorization from two perspectives: First, it leverages the long short-term memory architecture to characterize the multi-dimensional temporal interactions on relational data. Second, it incorporates the multi-layer perceptron structure for learning the non-linearities between different latent factors. Our extensive experiments demonstrate the significant improvement in rating prediction and link prediction on dynamic relational data by our NTF model over both neural network based factorization models and other traditional methods.
Xian Wu
, Baoxu Shi,
Yuxiao Dong
,
Chao Huang
,
Nitesh V. Chawla
. "
Neural Tensor Factorization for Temporal Interaction Learning
."
The 12th ACM International Conference on Web Search and Data Mining (WSDM 2019)
Abstract
How do opinions of individuals on controversial issues such as marijuana and gay marriage and their underlying social network connections evolve over time? Do people alter their network to have more like-minded friends or do they change their own opinions? Does the society eventually develop echo chambers? In this paper, we study dynamically evolving networks and changing user opinions to answer these questions. Our contributions are as follows: (a) Discovering Evolution of Polarization in Networks: We present evidence of growing divide among users based on their opinions who eventually form homophilic groups (b) Studying Opinion and Network Co-Evolution: We present observations of how individuals change opinions and position themselves in dynamically changing networks (c) Forecasting Persistence and Change in Opinions and Network: We propose ONE-M to forecast individual beliefs and persistence or dissolution of social ties. Using a unique real-world network dataset including periodic user surveys, we show that ONE-M performs with high accuracy, while outperforming the baseline approaches.
Aastha Nigam
, Kijung Shin, Ashwin Bahulkar, Bryan Hooi, David Hachen, Boleslaw K. Szymanski, Christos Faloutsos,
Nitesh V. Chawla
. "
ONE-M: Modeling the Co-evolution of Opinions and Network Connections
."
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
2018
Abstract
We consider multi-label crowdsourcing learning in two scenarios. In the first scenario, we aim at inferring instances' groundtruth given the crowds' annotations. We propose two approaches NAM/RAM (Neighborhood/Relevance Aware Multi-label crowdsourcing) modeling the crowds' expertise and label correlations from different perspectives. Extended from single-label crowdsourcing methods, NAM models the crowds' expertise on individual labels, but based on the idea that for rational workers, their annotations for instances similar in the feature space should also be similar, NAM utilizes information from the feature space and incorporates the local influence of neighborhoods' annotations. Noting that the crowds tend to act in an effort-saving manner while labeling multiple labels, i.e., rather than carefully annotating every proper label, they would prefer scanning and tagging a few most relevant labels, RAM models the crowds' expertise as their ability to distinguish the relevance between label pairs. In the second scenario, we care about cost-efficient crowdsourcing where the labeling and learning process are conducted in tandem. We extend NAM/RAM to the active paradigm and propose instance, label and worker selection criteria such that the labeling cost is significantly saved compared to passive learning without labeling control. The proposals' effectiveness are validated on both simulated and real datasets.
Shao-Yuan Li, Yuan Jiang,
Nitesh V. Chawla
, and Zhi-Hua Zhou
. "
Multi-Label Learning from Crowds
."
IEEE Transactions on Knowledge and Data Engineering
2018
Abstract
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages - from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.
Alberto Fernández, Salvador Garcia, Francisco Herrera,
Nitesh V. Chawla
. "
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
."
Journal of Artificial Intelligence Research
2018
Abstract
Climate change threatens to affect all countries across the world. Country-specific challenges from climate change depend on a vast array of factors and thus demand customized decisions at scale from public-sector policymakers. However, judging the impact of such measures will be incomplete without considering the role of privatesector entities, which has gone largely under-investigated. In this chapter we focus on the impact of climate change on micro, small and medium-size enterprises (MSMEs), which constitute about 80% of the private sector in developing countries. We present and demonstrate the Cambio Score, an open-source, data-driven framework to help policy-makers and private-sector entities identify country-specific threats from climate change. Longitudinal data from multiple country-specific macro-scale indicators provides quantitative evidence of vulnerability to these disruptions, as well as of countries’ respective readiness to leverage funding and investment to introduce adaptive measures. Our framework, the Cambio Score, is intended to be used as a data-driven decision-making tool for policymakers, MSMEs and researchers alike. We demonstrate its usefulness through a case study of a medical supply MSME in Bangladesh, where our framework has been able to identify and prioritize areas of vulnerability and means of readiness in the face of climate change.
Saurabh Nagrecha
,
Nitesh V. Chawla
. "
Cambio Score: quantifying climate-change impacts for MSMEs in developing countries
."
Private-sector action in adaptation: Perspectives on the role of micro, small and medium size enterprises
2018
Abstract
Diabetes is a significant health concern with more than 30 million Americans living with diabetes. Onset of diabetes increases the risk for various complications, including kidney disease, myocardial infractions, heart failure, stroke, retinopathy, and liver disease. In this paper, we study and predict the onset of these complications using a network-based approach by identifying fast and slow progressors. That is, given a patient’s diagnosis of diabetes, we predict the likelihood of developing one or more of the possible complications, and which patients will develop complications quickly. This combination of "if a complication will be developed” with ”how fast it will be developed” can aid the physician in developing better diabetes management program for a given patient.
Pamela Bilo Thomas
, Daniel H. Robertson,
Nitesh V. Chawla
. "
Predicting onset of complications from diabetes: a graph based approach
."
Applied Network Science
Abstract
Contextual behavior modeling uses data from multiple contexts to discover patterns for predictive analysis. However, existing behavior prediction models often face difficulties when scaling for massive datasets. In this work, we formulate a behavior as a set of context items of different types (such as decision makers, operators, goals and resources), consider an observable itemset as a behavior success, and propose a novel scalable method, “multi-type itemset embedding”, to learn the context items’ representations preserving the success structures. Unlike most of existing embedding methods that learn pair-wise proximity from connection between a behavior and one of its items, our method learns item embeddings collectively from interaction among all multi-type items of a behavior, based on which we develop a novel framework, LearnSuc, for (1) predicting the success rate of any set of items and (2) finding complementary items which maximize the probability of success when incorporated into an itemset. Extensive experiments demonstrate both effectiveness and efficency of the proposed framework.
Daheng Wang
, Meng Jiang, Qingkai Zeng, Zachary Eberhart,
Nitesh V. Chawla
. "
Multi-Type Itemset Embedding for Learning Behavior Success
."
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining
Abstract
The goal of the research project is to develop a data-driven approach including a predictive model and a recommender system based on a novel representation learning method to facilitate research planning using public academic datasets.
Daheng Wang
, Meng Jiang, Xueying Wang, Tong Zhao, Qingkai Zeng,
Nitesh V. Chawla
. "
A Project Showcase for Planning Research Work towards Publishable Success
."
24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining Project Showcase Track
Abstract
We study the problem of author-paper correlationinference in big scholarly data, which is to ef-fectively infer potential correlated works for re-searchers using historical records. Unlike super-vised learning algorithms that predict relevancescore of author-paper pair via time and memoryconsuming feature engineering, network embed-ding methods automatically learn nodes’ represen-tations that can be further used to infer author-papercorrelation. However, most current models sufferfrom two limitations: (1) they produce general pur-pose embeddings that are independent of the spe-cific task; (2) they are usually based on networkstructure but out of content semantic awareness. Toaddress these drawbacks, we propose a task-guidedand semantic-aware ranking model. First, the his-torical interactions among all correlated author- paper pairs are formulated as a pairwise ranking loss. Next, the paper’s semantic embedding en-coded by gated recurrent neural network, togetherwith the author’s latent feature is used to scoreeach author-paper pair in ranking loss. Finally, aheterogeneous relations integrative learning mod-ule is designed to further augment the model. Theevaluation results of extensive experiments on thewell known AMiner dataset demonstrate that theproposed model reaches significant better perfor-mance, comparing to a number of baselines.
Chuxu Zhang
, Lu Yu, Xiangliang Zhang,
Nitesh V. Chawla
. "
TSR:Task-Guided and Semantic-Aware Ranking for Academic Author-Paper Correlation Inference
."
International Joint Conference on Artificial Intelligence (IJCAI2018)
Abstract
As urban crimes (e.g., burglary and robbery) negatively impact our everyday life and must be addressed in a timely manner, predicting crime occurrences is of great importance for public safety and urban sustainability. However, existing methods do not fully explore dynamic crime patterns as factors underlying crimes may change over time. In this paper, we develop a new crime prediction framework–DeepCrime, a deep neural network architecture that uncovers dynamic crime patterns and carefully explores the evolving inter-dependencies between crimes and other ubiquitous data in urban space. Furthermore, our DeepCrime framework is capable of automatically capturing the relevance of crime occurrences across different time periods. In particular, our DeepCrime framework enables predicting crime occurrences of different categories in each region of a city by i) jointly embedding all spatial, temporal, and categorical signals into hidden representation vectors, and ii) capturing crime dynamics with an attentive hierarchical recurrent network. Extensive experiments on real-world datasets demonstrate the superiority of our framework over many competitive baselines across various settings.
Chao Huang
, Junbo Zhange, Yu Zheng,
Nitesh V. Chawla
. "
DeepCrime: Attentive Hierarchical Recurrent Networks for Crime Prediction
."
International Conference on Information and Knowledge Management (CIKM'2018)
Abstract
Leveraging historical behavioral data (e.g., sales volume and email communication) for future prediction is of fundamental importance for practical domains ranging from sales to temporal link prediction. Current forecasting approaches often use only a single time resolution (e.g., daily or weekly), which truncates the range of observable temporal patterns. However, real-world behavioral time series typically exhibit patterns across multi-dimensional temporal patterns, yielding dependencies at each level. To fully exploit these underlying dynamics, this paper studies the forecasting problem for behavioral time series data with the consideration of multiple time resolutions and proposes a multi-resolution time series forecasting framework, RESolution-aware Time series Forecasting (RESTFul). In particular, we first develop a recurrent framework to encode the temporal patterns at each resolution. In the fusion process, a convolutional fusion framework is proposed, which is capable of learning conclusive temporal patterns for modeling behavioral time series data to predict future time steps. Our extensive experiments demonstrate that the RESTFul model significantly outperforms the state-of-the-art time series prediction techniques on both numerical and categorical behavioral time series data.
Xian Wu
, Baoxu Shi,
Yuxiao Dong
,
Chao Huang
,
Louis Faust
,
Nitesh V. Chawla
. "
RESTFul: Resolution-Aware Forecasting of Behavioral Time Series Data
."
International Conference on Information and Knowledge Management (CIKM'2018)
Abstract
The rise in popularity of physical activity trackers provides extensive opportunities for measuring personal health at scale. Coupled with renewed interest in the field of positive psychology, we aim to explore how the quantified self can inform us of well-being. In this paper, we examine Fitbit Charge HR data among a college cohort of 125 students who were measured for two academic years (56 weeks). We assess variations in step counts across a typical week through k-means clustering. From this, we observe a group of students whose step counts were highest during the weekend and a group with significantly fewer steps on weekends. We found these trends stable: persisting across the two academic years measured. Students cluster location correlated to aspects indicating sociability and subjective well-being. We discuss the correlations between these traits and propose that weekend activity levels may serve as a measure for informing one's subjective well-being.
Louis Faust
, David Hachen, Omar Lizardo,
Nitesh V. Chawla
. "
Quantifying Subjective Well-Being Using Trends in Weekend Activity
."
2018 IEEE International Conference on Healthcare Informatics (ICHI)
Abstract
A variety of actors with diverse perspectives such as maternity health workers (MHWs), pregnant women, providers and other community organizations constitute the perinatal services system designed for low-income communities. Yet, little is known about the nature of service delivery challenges that seem inevitable in this setting. We sought to understand prenatal services system from the perspective of MHWs and identify areas that can be improved by mHealth solutions. We collected qualitative data from twenty-eight target MHWs via an online survey, and used the open coding approach to identify themes. The findings showed that MHWs face a number of challenges while providing services like education, referrals etc. Notable challenges were lack of coordination with other stakeholders, unmet professional development needs, maintaining privacy and difficulties with client education, communication and record exchange. MHWs expressed the need for innovative technologies to facilitate client education and open communication with other maternity health actors. Our findings suggest that there is a great demand for communication and service delivery tools in community-based MHWs. But, new solution need to be integrated within existing tools and environment while empowering MHWs with professional and educational support.
Beenish Chaudhry
,
Louis Faust
,
Nitesh V. Chawla
. "
Towards an Integrated mHealth Platform for Community-based Maternity Health Workers in Low-Income Communities
."
12th EAI International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth 2018).
Abstract
Aims: This pilot study explored how maternal stress experienced in the neonatal intensive care unit (NICU) is affected by the individual nursing structure and the network that provides care to extremely preterm infants. Background: Mothers experience high stress when their extremely preterm infants are hospitalized in the NICU. This often translates into maladaptive parenting behaviours that negatively affect the long‐term cognitive, social, and emotional development of the infant. Efforts to identify modifiable sources of maternal stress in the NICU could lead to improvement in maternal engagement and, ultimately, long‐term neurodevelopmental outcomes. Method: Time‐ and date‐stamped nursing shift data were extracted from the medical record and transformed into five structural nursing metrics with resultant nurse data networks. These were then analysed for associations with maternal stress outcomes on the Parental Stressor Scale (PSS: NICU). Results: Infants experienced highly variable nursing care and networks of nurses throughout their hospitalization. This variability is associated with the PSS: NICU (a) Sights and Sounds and (b) Altered Parental Role subscales. Conclusion: Nursing structure and the resultant caregiving network have an impact on maternal stress. Implications for Nursing Management: Changing the pattern of nurse staffing may be a modifiable intervention target for reducing maternal stress in the NICU.
Jenn Gonya, Tondi Harrison,
Keith Feldman
, Melanie Stein,
Nitesh Chawla
. "
Nursing networks in the NICU and their association with maternal stress: A pilot study
."
Journal of Nursing Management, 2018
Abstract
Background: Extremely preterm infants represent one of the highest risk categories for impairments in social competence. Few studies have explored the impact of the neonatal intensive care unit (NICU) environment on social development. However, none have specifically analyzed the effects of the care structure the infant receives during hospitalization on later social competence indicators. Objective: To identify associations between the care structures received by extremely preterm infants in the NICU and scores on the Brief Infant-Toddler Social and Emotional Assessment (BITSEA) post-discharge. Participants: 50 extremely preterm infants (mean gestational age: 25 weeks during hospitalization; mean chronological age during follow-up assessment: 2 years, 4 months). Methods: A secondary analysis of BITSEA data was performed exploring its relation to care structure data we extracted from electronic medical records (i.e., how much time infants were engaged in human interaction during their first thirty days of hospitalization and what types of interaction they were exposed to). Results: Extremely preterm infants spend a considerable amount of time alone during hospitalization (80%) with nursing care comprising the majority of human interaction. Infants who experienced greater human interaction scored significantly higher on the Social Competence (p = 0.01) and lower on the Dysregulation (p = 0.03) BITSEA subscales. Conclusion: Human interaction and isolation in the NICU is associated with social competence and dysregulation outcomes in extremely preterm infants. Further research is needed to understand how various NICU care structures including centralized nursing teams, parental skin-to-skin care, and early therapy may synergistically play a positive role in developing social competence.
Jenn Gonya,
Keith Feldman
, Kristen Brown, Melanie Stein, Sarah Keim, Kelly Boone, Robert Rumpf, William Ray,
Nitesh V. Chawla
, Eric Butter
. "
Human interaction in the NICU and its association with outcomes on the Brief Infant-Toddler Social and Emotional Assessment (BITSEA)
."
Early Human Development, 2018
Abstract
Coupled with the rise of data science and machine learning, the increasing availability of digitized health and wellness data has provided an exciting opportunity for complex analyses of problems throughout the healthcare domain. Whereas many early works focused on a particular aspect of patient care, often drawing on data from a specific clinical or administrative source, it has become clear such a single-source approach is insufficient to capture the complexity of the human condition. Instead, adequately modeling health and wellness problems requires the ability to draw upon data spanning multiple facets of an individual’s biology, their care, and the social aspects of their life. Although such an awareness has greatly expanded the breadth of health and wellness data collected, the diverse array of data sources and intended uses often leave researchers and practitioners with a scattered and fragmented view of any particular patient. As a result, there exists a clear need to catalogue and organize the range of healthcare data available for analysis. This work represents an effort at developing such an organization, presenting a patient-centric framework deemed the Healthcare Data Spectrum (HDS). Comprised of six layers, the HDS begins with the innermost micro-level omics and macro-level demographic data that directly characterize a patient, and extends at its outermost to aggregate population-level data derived from attributes of care for each individual patient. For each level of the HDS, this manuscript will examine the specific types of constituent data, provide examples of how the data aid in a broad set of research problems, and identify the primary terminology and standards used to describe the data.
Keith Feldman
,
Reid A Johnson
,
Nitesh V Chawla
. "
The State of Data In Healthcare: Path towards standardization
."
Journal of Healthcare Informatics Research (JHIR)
Abstract
Nonstandard insurers suffer from a peculiar variant of fraud wherein an overwhelming majority of claims have the semblance of fraud. We show that state-of-the-art fraud detection performs poorly when deployed at underwriting. Our proposed framework “FraudBuster” represents a new paradigm in predicting segments of fraud at underwriting in an interpretable and regulation compliant manner. We show that the most actionable and generalizable profile of fraud is represented by market segments with high confidence of fraud and high loss ratio. We show how these segments can be reported in terms of their constituent policy traits, expected loss ratios, support, and confidence of fraud. Overall, our predictive models successfully identify fraud with an area under the precision–recall curve of 0.63 and an f-1 score of 0.769.
Saurabh Nagrecha
,
Reid A Johnson
,
Nitesh V Chawla
. "
FraudBuster: Reducing Fraud in an Auto Insurance Market
."
Big Data
Abstract
While healthcare has traditionally existed within the confines of formal clinical environments, the emergence of population health initiatives has given rise to a new and diverse set of community interventions. As the number of interventions continues to grow, the ability to quickly and accurately identify those most relevant to an individual’s specific need has become essential in the care process. However, due to the diverse nature of the interventions, the determination need often requires non-clinical social and behavioral information that must be collected from the individuals themselves. Although survey tools have demonstrated success in the collection of this data, time restrictions and diminishing respondent interest have presented barriers to obtaining up-to-date information on a regular basis. In response, researchers have turned to analytical approaches to optimize surveys and quantify the importance of each question. To date, the majority of these works have approached the task from a univariate standpoint, identifying the next most important question to ask. However, such an approach fails to address the interconnected nature of the health conditions inherently captured by the broader set of survey questions. Utilizing data mining and machine learning methodology, this work demonstrates the value of capturing these relations. We present a novel framework that identifies a variable-length subset of survey questions most relevant in determining the need for a particular health intervention for a given individual. We evaluate the framework using a large national longitudinal dataset centered on aging, demonstrating the ability to identify the questions with the highest impact across a variety of interventions.
Keith Feldman
, Spyros Kotoulas,
Nitesh V Chawla
. "
TIQS: Targeted Iterative Question Selection for Health Interventions
."
Journal of Healthcare Informatics Research (JHIR)
Abstract
The rates of childhood obesity are at an all-time high, both domestically and around the globe. In an effort to stem current trends, researchers have looked to understand the components of effective and sustainable interventions. Although obesity interventions have often focused on a single lifestyle factors such as activity or nutrition, recent focus has shifted to a broader set of interventions that address multiple aspects of a child's life concurrently. Yet, despite promising outcomes, an understanding of what drives the success or failure of an intervention for a specific child is not yet well understood. To this end we have designed FitSpace, a web-based health and wellness platform designed to help understand how goal setting, activity, and nutrition patterns are tied to a child's social-demographic profile and their overall success in achieving sustainable healthy behavior. The work presented here represents the first step in our research, detailing a usability pilot study of the platform run in partnership with a local high school.
Keith Feldman
,
Mayra Duarte
, Waldo Mikels-Carrasco,
Nitesh V Chawla
. "
Leveraging health and wellness platforms to understand childhood obesity: A usability pilot of FitSpace
."
2018 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI)
Abstract
Data sparsity and cold-start problems are prevalent in recommender systems. To address such problems, both the bservable explicit social information (e.g., user-user trust connections) and the inferable implicit correlations (e.g., implicit neighbors computed by similarity measurement) have been introduced to complement user-item ratings data for improving the performances of traditional model-based recommendation algorithms such as matrix factorization. Although effective, (1) the utilization of the explicit user-user social relationships suffers from the weakness of unavailability in real systems such as Netflix or the issue of sparse observable content like 0.03% trust density in Epinions, thus there is no or little explicit social information that can be employed to improve baseline model in real applications; (2) the current similarity measurement approaches focus on inferring implicit correlations between a user (item) and their direct neighbors or top-k similar neighbors based on user-item ratings bipartite network, so that they fail to comprehensively unfold the indirect potential relationships among users and items. To solve these issues regarding both explicit/implicit social recommendation algorithms, we design a joint model of matrix factorization and implicit walk integrative learning, i.e., ImWalkMF, which only uses explicit ratings information yet models both direct rating feedbacks and multiple direct/indirect implicit correlations among users and items from a random walk perspective. We further propose a combined strategy for training two independent components in the proposed model based on sampling. The experimental results on two real-world sparse datasets demonstrate that ImWalkMF outperforms the traditional regularized/probabilistic matrix factorization models as well as other competitive baselines that utilize explicit/implicit social information.
Chuxu Zhang
, Lu Yu, Xiangliang Zhang,
Nitesh Chawla
. "
ImWalkMF: Joint Matrix Factorization and Implicit Walk Integrative Learning for Recommendation
."
IEEE International Conference on Big Data (Big Data) 2018
,
January
2018
Abstract
Fake reviews have become a pervasive problem in online review systems, wherein fraudulent users manipulate the perception of an object (e.g., a restaurant) by fabricating fake reviews. Extensive work has been devoted to identifying fake reviews via modeling different factors separately, such as user features, object characteristics, and user-object bipartite relations. However, this problem remains challenging due to the fact that more advanced camouflage strategies are utilized by malicious users. In real-world scenarios, spammers may pretend to be normal users by giving fake reviews with the similar score distribution as normal users. To address these issues, we propose to explore the temporal patterns of users' review behavior, because spammers prefer to promote or demote the target businesses in a short period of time. In this work, we present a unified framework Reliable Fake Review Detection (RFRD) that explicitly models temporal patterns of users' review behavior into a probabilistic generative model. Moreover, the RFRD framework models users' underlying review credibility and objects' highly-skewed review distributions. We conduct experiments on two Yelp datasets, demonstrating the effectiveness of the proposed RFRD framework.
Xian Wu
,
Yuxiao dong
, Jun Tao,
Chao Huang
,
Nitesh Chawla
. "
Fake Review Detection via Modeling Temporal and Behavioral Patterns
."
IEEE International Conference on Big Data (Big Data) 2018
,
January
2018
Abstract
Event-based social network (EBSN) services have emerged as a new platform on which users can choose events of interest to attend in the physical world. Over years, there are growing research interests in predicting whether certain actors will participate in an event together. In this work, we refer to this task as the event attendance prediction problem and aim to address the predictability of individuals’ event attendance. In real-world settings, the factors that influence an individual’s attendance may change over time, leading to the dynamic nature of individuals’ behavior. However, existing event attendance prediction methods cannot deal with such dynamic scenarios. To address this issue, we propose an end-to-end Deep Event Attendance Prediction (DEAP) framework—a three-level hierarchical LSTM architecture—to explicitly model users’ multi-dimensional and evolving preferences. Extensive experiments on three real-world datasets demonstrate that DEAP significantly outperforms the state-of-theart techniques across various settings.
Xian Wu
,
Yuxiao dong
, Baoxu Shi,
Nitesh Chawla
. "
Who will Attend This Event Together? Event Attendance Prediction via Deep LSTM Network
."
Proceedings of the 18th SIAM International Conference on Data Mining (SDM)
,
January
2018
Abstract
In this paper, we study the problem of author identification in big scholarly data, which is to effectively rank potential authors for each anonymous paper by using historical data. Most of the existing de-anonymization approaches predict relevance score of paper-author pair via feature engineering, which is not only time and storage consuming, but also introduces irrelevant and redundant features or miss important attributes. Representation learning can automate the feature generation process by learning node embeddings in academic network to infer the correlation of paper-author pair. However, the learned embeddings are often for general purpose (independent of the specific task), or based on network structure only (without considering the node content). To address these issues and make a further progress in solving the author identification problem, we propose Camel, a content-aware and meta-path augmented metric learning model. Specifically, first, the directly correlated paper-author pairs are modeled based on distance metric learning by introducing a push loss function. Next, the paper content embedding encoded by the gated recurrent neural network is integrated into the distance loss. Moreover, the historical bibliographic data of papers is utilized to construct an academic heterogeneous network, wherein a meta-path guided walk integrative learning module based on the task-dependent and content-aware Skipgram model is designed to formulate the correlations between each paper and its indirect author neighbors, and further augments the model. Extensive experiments demonstrate that Camel outperforms the state-of-the-art baselines. It achieves an average improvement of 6.3% over the best baseline method.
Chuxu Zhang
,
Chao Huang
, Lu Yu, Xiangliang Zhang,
Nitesh Chawla
. "
Camel: Content-Aware and Meta-path Augmented Metric Learning for Author Identification
."
Proceedings of the Web Conference (WWW)
,
January
2018
Abstract
Social sensing has emerged as a new application paradigm in networked sensing communities where a colossal amount of observations about the physical world are contributed by people or devices they use. Our work solves a critical challenge in social sensing applications where the goal is to estimate the reliability of social sensors and the truthfulness of observed variables (typically known as claims) with little prior knowledge on either of them. This challenge is referred to as truth discovery. An important limitation in the previous truth discovery solutions is that they assume the relationship between source reliability and claim truthfulness can be represented by simplified functions (e.g., linear, quadratic and binomial). This assumption leads to suboptimal truth discovery results because the exact relational dependency between sources and claims is often unknown a priori. In this paper, we show that a neural network approach can learn the complex relational dependency better than the previous truth discovery methods. In particular, we develop a multi-layer neural network model that solves the truth discovery problem in social sensing without any assumption on the prior knowledge of the source-claim relational dependency distribution. The performance of our model is evaluated through two real-world events using data crawled from Twitter. The evaluation results show that our neural network approach significantly outperforms previous truth discovery methods.
Jermaine Marshall
, Arturo Argueta, Dong Wang
. "
A Neural Network Approach for Truth Discovery in Social Sensing
."
IEEE 14th International Conference on Mobile Ad Hoc and Sensor Systems (MASS) 2017
,
November
2017
Abstract
The social triad—a group of three people—is one of the simplest and most fundamental social groups. Extensive network and social theories have been developed to understand its structure, such as triadic closure and social balance. Over the course of a triadic closure—the transition from two ties to three among three users, the strength dynamics of its social ties, however, are much less well understood. Using two dynamic networks from social media and mobile communication, we examine how the formation of the third tie in a triad affects the strength of the existing two ties. Surprisingly, we find that in about 80% social triads, the strength of the first two ties are weakened although averagely the tie strength in the two networks maintains an increasing or stable trend. We discover that 1) the decrease in tie strength among three males is more sharply than that among females, and 2) the tie strength between celebrities are more likely to be weakened as the closure of a triad than those between ordinary people. Further, we formalize a triadic tie strength dynamics prediction problem to infer whether social ties of a triad will become weakened after its closure. We propose a TRIST method—a kernel density estimation (KDE) based graphical model—to solve the problem by incorporating user demographics, temporal effects, and structural information. Extensive experiments demonstrate that TRIST offers a greater than 82% potential predictability for inferring triadic tie strength dynamics in both networks. The leveraging of the kernel density estimation and s
Huang, Hong,
Yuxiao Dong
, Jie Tang, Hongxia Yang,
Nitesh V. Chawla
, and Xiaoming Fu
. "
Will Triadic Closure Strengthen Ties in Social Networks?
."
ACM Transactions on Knowledge Discovery from Data (TKDD), 12(3): 30.
,
2017
.
Abstract
People are discussing about events such as shootings, protests, flight crashes and new public policies across news media and social media. Different media sources reflect different facets of the events. In this positioning paper, we point out the importance of integrating various data from the multiple media sources for understanding/analyzing the events. Here the big data’s “Variety” issue can be resolved by extending the information network representation to a "cross-media information network" but how to generate comprehensive event analysis from the network’s components? We propose a multifaceted analysis framework that models four critical facets to fully understand the events across social media network and news media network. The facets include media type for differentiating event information sources; content for discovering events from unstructured text; sentiment on quantifying user reflecting on events; and, time for turning events into dynamic evolving objects. We performed our framework on a real data set from Twitter and Google News, which creates interesting and useful event description and visualization for human inspection.
Daheng Wang
, Meng Jiang, Xueying Wang,
Nitesh Chawla
, Paul Brunts
. "
Multifaceted Event Analysis on Cross-Media Network Data
."
1st International Workshop On Heterogenous Networks Analysis and Mining (HeteroNAM)
,
2017
.
Abstract
Visual exploration of flow fields is important for studying dynamic systems. We introduce semantic flow graph (SFG), a novel graph representation and interaction framework that enables users to explore the relationships among key objects (i.e., field lines, features, and spatiotemporal regions) of both steady and unsteady flow fields. The objects and their relationships are organized as a heterogeneous graph. We assign each object a set of attributes, based on which a semantic abstraction of the heterogeneous graph is generated. This semantic abstraction is SFG. We design a suite of operations to explore the underlying flow fields based on this graph representation and abstraction mechanism. Users can flexibly reconfigure SFG to examine the relationships among groups of objects at different abstraction levels. Three linked views are developed to display SFG, its node split criteria and history, and the objects in the spatial volume. For simplicity, we introduce SFG construction and exploration for steady flow fields with critical points being the only features. Then we demonstrate that SFG can be naturally extended to deal with unsteady flow fields and multiple types of features. We experiment with multiple data sets and conduct an empirical expert evaluation to demonstrate the effectiveness of our approach.
Jun Tao, Chaoli Wang,
Nitesh V. Chawla
, Lei Shi, and Seung Hyun Kim
. "
Semantic Flow Graph: A Framework for Discovering Object Relationships in Flow Fields
."
IEEE Transactions on Visualization and Computer Graphics
,
2017
.
Abstract
Users with demographic profiles in social networks offer the potential to understand the social principles that underpin our highly connected world, from individuals, to groups, to societies. In this article, we harness the power of network and data sciences to model the interplay between user demographics and social behavior and further study to what extent users’ demographic profiles can be inferred from their mobile communication patterns. By modeling over 7 million users and 1 billion mobile communication records, we find that during the active dating period (i.e., 18--35 years old), users are active in broadening social connections with males and females alike, while after reaching 35 years of age people tend to keep small, closed, and same-gender social circles. Further, we formalize the demographic prediction problem of inferring users’ gender and age simultaneously. We propose a factor graph-based WhoAmI method to address the problem by leveraging not only the correlations between network features and users’ gender/age, but also the interrelations between gender and age. In addition, we identify a new problem—coupled network demographic prediction across multiple mobile operators—and present a coupled variant of the WhoAmI method to address its unique challenges. Our extensive experiments demonstrate the effectiveness, scalability, and applicability of the WhoAmI methods. Finally, our study finds a greater than 80% potential predictability for inferring users’ gender from phone call behavior and 73% for users’ age from text messaging interactions.
Yuxiao Dong
,
Nitesh V. Chawla
, Jie Tang, Yang Yang, Yang Yang
. "
User Modeling on Demographic Attributes in Big Mobile Social Networks
."
ACM Transactions on Information Systems (TOIS), 35(4): 35
,
2017
.
Abstract
From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.
Keith Feldman
,
Louis Faust
,
Xian Wu
,
Chao Huang
,
Nitesh V. Chawla
. "
Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline
."
Towards Integrative Machine Learning and Knowledge Extraction. Springer, Cham: 150-169.
,
2017
.
Abstract
Predicting the onset of heart disease is of obvious importance as doctors try to improve the general health of their patients. If it were possible to identify high-risk patients before their heart failure diagnosis, doctors could use that information to implement preventative measures to keep a heart failure diagnosis from becoming a reality. Integration of Electronic Medical Records (EMRs) into clinical practice has enabled the use of computational techniques for personalized healthcare at scale. The larger goal of such modeling is to pivot from reactive medicine to preventative care and early detection of adverse conditions. In this paper, we present a trajectory-based disease progression model to detect chronic heart failure. We validate our work on a database of Medicare records of 1.1 million elderly US patients. Our supervised approach allows us to assign likelihood of chronic heart failure for an unseen patient’s disease history and identify key disease progression trajectories that intensify or diminish said likelihood. This information will be a tremendous help as patients and doctors try to understand what are the most dangerous diagnoses for those who are susceptible to heart failure. Using our model, we demonstrate some of the most common disease trajectories that eventually result in the development of heart failure.
Saurabh Nagrecha
,
Pamela Bilo Thomas
,
Keith Feldman
,
Nitesh V. Chawla
. "
Predicting Chronic Heart Failure Using Diagnoses Graphs
."
International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE). Springer, Cham
,
2017
.
Abstract
We report on a study of affective states of learners in a Massive Open Online Course (MOOC) and the interplay of Affect, Behavior and Cognition at various stages of the course. Affect is measured through a series of self-reports from learners at strategic time posts during the period of study. Behavior is characterized in terms of a learners’ engagement, interactivity, impatience and reflectivity, which constitute a set of novel high-level features derived from the clickstream of learner interactions. Cognition is evaluated from the performance of learners on assessments that are part of the course. We discover that learners in the MOOC experience multiple as well as mixed emotions as they go through the course, which we handle using the psychological dimensions of arousal and valence. This results in a set of emotional quadrants, whose co-occurrence analysis reveals a strong association with cognition and specific behavioral characteristics demonstrated by the learner. These results advance our understanding of the experience of MOOC learners to a more holistic level across the key dimensions of affect, behavior and cognition. They also have important implications for the design of the next generation MOOCs that can potentially leverage affect and behavior-aware interventions to drive greater personalization and eventually, improved learning outcomes.
Shazia Afzal, Bikram Sengupta,
Munira Syed
,
Nitesh Chawla
, G. Alex Ambrose and Malolan Chetlur
. "
The ABC of MOOCs: Affect and its Inter-play with Behaviour and Cognition
."
Proceedings of the 7th Affective Computing and Intelligent Interaction (ACII)
,
2017
.
Abstract
In recent decades, materials science literature and patents have grown exponentially. This has also contributed to an ever-growing challenge whether the literature is current, as there can be a gap between when the patent was filed and when it was approved. Moreover, it is difficult to ensure that a patent cites the appropriate prior art due to variety and volume of materials science data, especially when it is in two separate sources that have different curation mechanisms and purpose — publications and patents. The existing relational database schema, generally used to store publications, also presents challenges given the strict tabular schema, which may not be appropriate for organizing and querying highly interconnected information about materials in these publications and patents. For example, elements are chemically combined to form a compound, which can then be converted to other compounds via chemical reactions. Furthermore, relational database is not designed for handling combining data from multiple sources and with various formats, thus it makes discover relevance between publications and patents become difficult. In order to explore an alternative approach to represent materials data and combine data from multiple sources into the same repository, in this work, we propose a solution to integrate data from Open Quantum Materials Database (OQMD) and patent data from USPTO1 database into a network and named it heterogeneous materials information network (HMIN). We generalize prior work which based on using meta path-based topological features to explore the network, and we propose features to identify network noise and investigate relatedness between different-typed objects to meet our application needs. We built several machine learning models by using these features to explore relevance between materials science publications and patents. Experiment results show that HMIN can help researchers effectively discover related publications and patents originally kept in different sources. Our work exhibits to materials community a new way of appropriately representing materials data and discovering connections between data from multiple sources.
Pingjie Tang
, Jed Pitera, Dmitry Zubarev,
Nitesh V. Chawla
. "
Heterogeneous Materials Information Network Construction and Relevance Search
."
Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA)
Best Paper Award
,
2017
.
Abstract
Recent technology advancements in the areas of compute, storage and networking, along with the increased demand for organizations to cut costs while remaining responsive to increasing service demands have led to the growth in the adoption of cloud computing services. Cloud services provide the promise of improved agility, resiliency, scalability and a lowered Total Cost of Ownership (TCO). This research introduces a framework for minimizing cost and maximizing resource utilization by using an Integer Linear Programming (ILP) approach to optimize the assignment of workloads to servers on Amazon Web Servic (AWS) cloud infrastructure. The model is based on the classical minimum-cost flow model, known as the assignment model.
Frederick Nwanganga
,
Mandana Saebi
, Gregory Madey,
Nitesh Chawla
. "
A Minimum-Cost Flow Model for Workload Optimization on Cloud Infrastructure
."
Proceedings of the 10th International Conference on Cloud Computing (Cloud)
,
2017
.
Abstract
Chuxu Zhang
, Lu Yu, Xiangliang Zhang,
Nitesh Chawla
. "
ImWalkMF: Joint Matrix Factorization and Implicit Walk Integrative Learning for Recommendation
."
Proceedings of the IEEE International Conference on Big Data (Big Data)
,
2017
.
Abstract
Xian Wu
,
Yuxiao dong
, Jun Tao,
Chao Huang
,
Nitesh Chawla
. "
Fake Review Detection via Modeling Temporal and Behavioral Patterns
."
Proceedings of the IEEE International Conference on Big Data (Big Data)
,
2017
.
Abstract
Background: Hypoxia Inducible Factor 3 Alpha Subunit (HIF3A) DNA has been demonstrated to be associated with obesity in the methylation level, and it also has a Body Mass Index (BMI)-independent association with plasma alanine aminotransferase (ALT). However, the relation among obesity, plasma ALT, HIF3A polymorphism and methylation remains unclear. This study aims to identify the association between HIF3A polymorphism and plasma ALT, and further to determine whether the effect of HIF3A polymorphism on ALT could be modified by obesity or mediated by DNA methylation.
Methods: The HIF3A rs3826795 polymorphism was genotyped in a case-control study including 2030 Chinese children aged 7–18 years (705 obese cases and 1325 non-obese controls). Furthermore, the HIF3A DNA methylation of the peripheral blood was measured in 110 severely obese children and 110 age- and gender- matched normalweight controls.
Results: There was no overall association between the HIF3A rs3826795 polymorphism and ALT. A significant interaction between obesity and rs3826795 in relation with ALT was found (Pinter = 0.042), with rs3826795 G-allele number elevating ALT significantly only in obese children ($beta;’ = 0.075, P = 0.037), but not in non-obese children ($beta;’ = −0.009, P = 0.741). Additionally, a mediation effect of HIF3A methylation was found in the association between the HIF3A rs3826795 polymorphism and ALT among obese children ($beta;’ = 0.242, P = 0.014).
Conclusion: This is the first study to report the interaction between obesity and HIF3A gene in relation with ALT, and also to reveal a mediation effect among the HIF3A polymorphism, methylation and ALT. This study provides new evidence to the function of HIF3A gene, which would be helpful for future risk assessment and personalized treatment of liver diseases.
Shuo Wang, Jieyun Song, Yide Yang, Yining Zhang,
Nitesh V. Chawla,
, Jun Ma and Haijun Wang
. "
Interaction between obesity and the Hypoxia Inducible Factor 3 Alpha Subunit rs3826795 polymorphism in relation with plasma alanine aminotransferase
."
BMC medical genetics 18(1): 80
,
2017
.
Abstract
Urban city environments face the challenge of disturbances, which can create inconveniences for its citizens. These require timely detection and resolution, and more importantly timely preparedness on the part of city officials. We term these disturbances as anomalies, and pose the problem statement: if it is possible to also predict these anomalous events (proactive), and not just detect (reactive). While significant effort has been made in detecting anomalies in existing urban data, the prediction of future urban anomalies is much less well studied and understood. In this work, we formalize the future anomaly prediction problem in urban environments, such that those can be addressed in a more efficient and effective manner. We develop the Urban Anomaly PreDiction (UAPD) framework, which addresses a number of challenges, including the dynamic, spatial varieties of different categories of anomalies. Given the urban anomaly data to date, UAPD first detects the change point of each type of anomalies in the temporal dimension and then uses a tensor decomposition model to decouple the interrelations between the spatial and categorical dimensions. Finally, UAPD applies an autoregression method to predict which categories of anomalies will happen at each region in the future. We conduct extensive experiments in two urban environments, namely New York City and Pittsburgh. Experimental results demonstrate that UAPD outperforms alternative baselines across various settings, including different region and time-frame scales, as well as diverse categories of anomalies.
Xian Wu
,
Yuxiao dong
,
Chao Huang
,
Jian Xu
,
Nitesh Chawla
. "
UAPD: Predicting Urban Anomalies from Spatial-Temporal Data
."
Proceedings of the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
,
December
2017
.
Abstract
Peace processes are complex, protracted, and contentious involving significant bargaining and compromising among various societal and political stakeholders. In civil war terminations, it is pertinent to measure the pulse of the nation to ensure that the peace process is responsive to citizens' concerns. Social media yields tremendous power as a tool for dialogue, debate, organization, and mobilization, thereby adding more complexity to the peace process. Using Colombia's final peace agreement and national referendum as a case study, we investigate the influence of two important indicators: intergroup polarization and public sentiment toward the peace process. We present a detailed linguistic analysis to detect intergroup polarization and a predictive model that leverages Tweet structure, content, and user-based features to predict public sentiment toward the Colombian peace process. We demonstrate that had proaccord stakeholders leveraged public opinion from social media, the outcome of the Colombian referendum could have been different.
Aastha Nigam
,
Henry K. Dambanemuya
, Madhav Joshi,
Nitesh Chawla
. "
Harvesting Social Signals to Inform Peace Processes Implementation and Monitoring
."
Big data Journal, 5(4): 337-355
,
August
2017
.
Abstract
Progress in science has advanced the development of human society across history, with dramatic revolutions shaped by information theory, genetic cloning, and artificial intelligence, among the many scientific achievements produced in the 20th century. However, the way that science advances itself is much less well-understood. In this work, we study the evolution of scientific development over the past century by presenting an anatomy of 89 million digitalized papers published between 1900 and 2015. We find that science has benefited from the shift from individual work to collaborative effort, with over 90% of the world-leading innovations generated by collaborations in this century, nearly four times higher than they were in the 1900s. We discover that rather than the frequent myopic and self-referencing that was common in the early 20th century, modern scientists instead tend to look for literature further back and farther around. Finally, we also observe the globalization of scientific development from 1900 to 2015, including 25-fold and 7-fold increases in international collaborations and citations, respectively, as well as a dramatic decline in the dominant accumulation of citations by the US, the UK, and Germany, from ~95% to ~50% over the same period. Our discoveries are meant to serve as a starter for exploring the visionary ways in which science has developed throughout the past century, generating insight into and an impact upon the current scientific innovations and funding policies.
Yuxiao Dong
, Hao Ma, Zhihong Shen, and Kuansan Wang
. "
A Century of Science: Globalization of Scientific Collaborations, Citations, and Innovations
."
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
August
2017
.
Abstract
We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.
Yuxiao Dong
,
Nitesh V. Chawla
, and Ananthram Swami
. "
metapath2vec: Scalable Representation Learning for Heterogeneous Networks
."
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
August
2017
.
Abstract
A widely recognized organizing principle of networks is structural homophily, which suggests that people with more common neighbors are more likely to connect with each other. However, what influence the diverse structures embedded in common neighbors have on link formation is much less well-understood. To explore this problem, we begin by characterizing the structural diversity of common neighborhoods. Using a collection of 120 large-scale networks, we demonstrate that the impact of the common neighborhood diversity on link existence can vary substantially across networks. We find that its positive effect on Facebook and negative effect on LinkedIn suggest different underlying networking needs in these networks. We also discover striking cases where diversity violates the principle of homophily—that is, where fewer mutual connections may lead to a higher tendency to link with each other. We then leverage structural diversity to develop a common neighborhood signature (CNS), which we apply to a large set of networks to uncover unique network superfamilies not discoverable by conventional methods. Our findings shed light on the pursuit to understand the ways in which network structures are organized and formed, pointing to potential advancement in designing graph generation models and recommender systems.
Yuxiao Dong
,
Reid A. Johnson
,
Jian Xu
, and
Nitesh V. Chawla
. "
Structural Diversity and Homophily: A Study Across More Than One Hundred Big Networks
."
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
August
2017
.
Abstract
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a "de facto" solution. Basically, it carries out a "divide-and-conquer" distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.
Alberto Fernández, Sara del Río,
Nitesh V. Chawla
, and Francisco Herrera
. "
An Insight into Imbalanced Big Data Classification: Outcomes and Challenges
."
Complex & Intelligent Systems
,
3
(2)
:105–120,
March
2017
.
Abstract
Analysis of electroencephalogram (EEG) signal is crucial due to its non-stationary characteristics, which could lead the way to proper detection method for the treatment of patients with neurological abnormalities, especially for epilepsy. The performance of EEG-based epileptic seizure detection relies largely on the quality of selected features from an EEG data that characterize seizure activity. This paper presents a novel analysis method for detecting epileptic seizure from EEG signal using Improved Correlation-based Feature Selection method (ICFS) with Random Forest classifier (RF). The analysis involves, first applying ICFS to select the most prominent features from the time domain, frequency domain, and entropy based features. An ensemble of Random Forest (RF) classifiers is then learned on the selected set of features. The experimental results demonstrate that the proposed method shows better performance compared to the conventional Correlation-based method and also outperforms some other state-of-the-art methods of epileptic seizure detection using the same benchmark EEG dataset.
Md Mursalin, Yuan Zhang, Yuehui Chen, and
Nitesh V. Chawla
. "
Automated Epileptic Seizure Detection Using Improved Correlation-based Feature Selection with Random Forest Classifier
."
Neurocomputing
,
241
:204–214,
June
2017
.
Abstract
Social sensing is a new big data application paradigm for Cyber-Physical Systems (CPS), where a group of individuals volunteer (or are recruited) to report measurements or observations about the physical world at scale. A fundamental challenge in social sensing applications lies in discovering the correctness of reported observations and reliability of data sources without prior knowledge on either of them. We refer to this problem as truth discovery. While prior studies have made progress on addressing this challenge, two important limitations exist: (i) current solutions did not fully explore the uncertainty aspect of human reported data, which leads to sub-optimal truth discovery results; (ii) current truth discovery solutions are mostly designed as sequential algorithms that do not scale well to large-scale social sensing events. In this paper, we develop a Scalable Uncertainty-Aware Truth Discovery (SUTD) scheme to address the above limitations. The SUTD scheme solves a constraint estimation problem to jointly estimate the correctness of reported data and the reliability of data sources while explicitly considering the uncertainty on the reported data. To address the scalability challenge, the SUTD is designed to run a Graphic Processing Unit (GPU) with thousands of cores, which is shown to run two to three orders of magnitude faster than the sequential truth discovery solutions. In evaluation, we compare our SUTD scheme to the state-of-the-art solutions using three real world datasets collected from Twitter: Paris Attack, Oregon Shooting, and Baltimore Riots, all in 2015. The evaluation results show that our new scheme significantly outperforms the baselines in terms of both truth discovery accuracy and execution time.
Chao Huang, Dong Wang, and
Nitesh V. Chawla
. "
Scalable Uncertainty-Aware Truth Discovery in Big Data Social Sensing Applications for Cyber-Physical Systems
."
IEEE Transactions on Big Data (TBD)
,
May
2017
.
Abstract
Universities often draw from their student body when conducting human subject studies. Unfortunately, as with any longitudinal human studies project, data quality problems arise from student's waning compliance to the study. While incentive mechanisms may be employed to boost student compliance, such systems may not encourage all participants in the same manner. This paper coupled student's compliance rates with other personal data collected via Fitbits, smartphones, and surveys. Machine learning algorithms were then employed to explore factors that influence compliance. With such insight, universities may target groups in their studies who are more likely to become non-compliant and implement preventative strategies such as tailoring their incentive mechanisms to accommodate a diverse population. In doing so, data quality problems stemming from failing compliance can be minimized.
Louis Faust
, Rachael Purta, David Hachen, Aaron Striegel, Christian Poellabauer, Omar Lizardo, and
Nitesh V. Chawla
. "
Exploring Compliance: Observations from a Large Scale Fitbit Study
."
Proceedings of the 2nd International Workshop on Social Sensing (SocialSens)
,
April
2017
.
Abstract
Unlike the conventional first-order network (FoN), the higher-order network (HoN) provides a more accurate description of transitions by creating additional nodes to encode higher-order dependencies. However, there exists no visualization and exploration tool for the HoN. For applications such as the development of strategies to control species invasion through global shipping which is known to exhibit higher-order dependencies, the existing FoN visualization tools are limited. In this paper, we present HoNVis, a novel visual analytics framework for exploring higher-order dependencies of the global ocean shipping network. Our framework leverages coordinated multiple views to reveal the network structure at three levels of detail (i.e., the global, local, and individual port levels). Users can quickly identify ports of interest at the global level and specify a port to investigate its higher-order nodes at the individual port level. Investigating a larger-scale impact is enabled through the exploration of HoN at the local level. Using the global ocean shipping network data, we demonstrate the effectiveness of our approach with a real-world use case conducted by domain experts specializing in species invasion. Finally, we discuss the generalizability of this framework to other real-world applications such as information diffusion in social networks and epidemic spreading through air transportation.
Jun Tao
, Jian Xu, Chaoli Wang, and
Nitesh V. Chawla
. "
HoNVis: Visualizing and Exploring Higher-Order Networks
."
Proceedings of the 10th IEEE Pacific Visualization Symposium (PacificVis)
,
April
2017
.
Abstract
Dropout prediction in MOOCs is a well-researched problem where we classify which students are likely to persist or drop out of a course. Most research into creating models which can predict outcomes is based on student engagement data. Why these students might be dropping out has only been studied through retroactive exit surveys. This helps identify an important extension area to dropout prediction—how can we interpret dropout predictions at the student and model level? We demonstrate how existing MOOC dropout prediction pipelines can be made interpretable, all while having predictive performance close to existing techniques. We explore each stage of the pipeline as design components in the context of interpretability. Our end result is a layer which longitudinally interprets both predictions and entire classification models of MOOC dropout to provide researchers with in-depth insights of why a student is likely to dropout.
Saurabh Nagrecha
, John Z. Dillon, and
Nitesh V. Chawla
. "
MOOC Dropout Prediction: Lessons Learned from Making Pipelines Interpretable
."
Proceedings of the 26th International Conference on World Wide Web Companion
,
April
2017
.
Abstract
There are real applications that do not demand to classify or to make predictions about individual objects, but to estimate some magnitude about a group of them. For instance, one of these cases happens in sentiment analysis and opinion mining. Some applications require to classify opinions as positives or negatives, but there are also others, even more useful sometimes, that just need an estimation of which is the proportion of each class during a concrete period of time. “How many tweets about our new product were positive yesterday?” Practitioners should apply quantification algorithms to tackle this kind of problems, instead of just using off-the-shelf classification methods, because classifiers are suboptimal in the context of quantification tasks. Unfortunately, quantification learning is still relatively an under explored area in machine learning. The goal of this paper is to show that quantification learning is an interesting open problem. To support its benefits, we shall show an application to analyze Twitter comments in which even the most simple quantification methods outperform classification approaches.
Pablo González, Jorge Díez,
Nitesh V. Chawla
, and Juan José del Coz
. "
Why is Quantification an Interesting Learning Problem?
."
Progress in Artificial Intelligence (PRAI)
,
6
(1)
:53–58,
March
2017
.
Abstract
The patatin like phospholipase containing domain 3 gene (PNPLA3) rs738409 C > G polymorphism, one of the most important gene polymorphisms involved in hepatic steatosis, has been reported to interact with different nutrients and dietary patterns on Non-Alcoholic Fatty Liver Disease (NAFLD), but no studies have focused on its interaction with physical activity or sedentary behavior. Therefore, this study aims at determining whether physical activity or sedentary behavior could modulate the effect of the PNPLA3 variant on childhood NAFLD.
Shuo Wang, Jieyun Song, Xiaorui Shang,
Nitesh V. Chawla
, Yide Yang, Xiangrui Meng, Haijun Wang, and Jun Ma
. "
Physical Activity and Sedentary Behavior Can Modulate the Effect of the PNPLA3 Variant on Childhood NAFLD: A Case-Control Study in a Chinese Population
."
BMC Medical Genetics
.
December
2016
.
Abstract
Aging well consists of the following components: management of chronic conditions, maintenance of physical health and cognitive health, and active social engagement. However, the increasing life expectancy, rising healthcare costs, increasing number of older adults and decreasing number of care providers pose challenges to the maintenance of different components for aging well. To that end, technological innovation can help to augment human need and capability along the different components, and help overcome the potential barriers in aging well. In this paper, we summarize the published studies for different tablet-based applications designed specifically for older adults targeting different components of aging well. We discuss the strengths and weaknesses of the applications for each component, and identify the opportunities to develop a cohesive application for addressing the different components of aging well.
Dipanwita Dasgupta
,
Beenish Chaudhry
, Kimberly Green Reeves, and
Nitesh V. Chawla
. "
A Survey of Tablet Applications for Promoting Successful Aging in Older Adults
."
IEEE Access
.
December
2016
.
Abstract
Medication non-adherence is a pressing concern among seniors, leading to a lower quality of life and higher healthcare costs. While mobile applications provide a viable medium for medication management, their utility can be limited without tackling the specific needs of seniors and facilitating the active involvement of care providers. To address these limitations, we are developing a tablet-based application designed specifically for seniors to track their medications and a web portal for their care providers to track medication adherence. In collaboration with a local Aging in Place program, we conducted a three-month study with sixteen participants from an independent living facility. Our study found that the application helped participants to effectively track their medications and improved their sense of wellbeing. Our findings highlight the importance of catering to the needs of seniors and of involving care providers in this process, with specific recommendations for the development of future medication management applications.
Dipanwita Dasgupta
,
Reid A. Johnson
,
Beenish Chaudhry
,
Emily Koh
, and
Nitesh V. Chawla
. "
Design and Evaluation of Medication Adherence Application with Communication for Seniors in Independent Living Communities
."
Proceedings of the American Medical Informatics Association (AMIA) Annual Symposium
.
November
2016
.
Abstract
Twitter provides a platform for information sharing and diffusion, and has quickly emerged as a mechanism for organizations to engage with their consumers. A driving factor for engagement is providing relevant and timely content to users. We posit that the engagement via tweets offers a good potential to discover user interests and leverage that information to target specific content of interest. To that end, we have developed a framework that analyzes tweets to identify the interests of current followers and leverages topic models to deliver a personalized topic profile for each user. We validated our framework by partnering up with a local media company and analyzing the content gap between them and their followers. We also developed a mobile application that incorporates the proposed framework.
Aastha Nigam
, Salvador Aguinaga, and
Nitesh V. Chawla
. "
Connecting the Dots to Infer Followers' Topical Interest on Twitter
."
IEEE International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC)
.
November
2016
.
Abstract
The analysis of social, communication and information networks for identifying patterns, evolutionary characteristics and anomalies is a key problem for the military, for instance in the Intelligence community. Current techniques do not have the ability to discern unusual features or patterns that are not apriori known. We investigate the use of Deep Learning for network analysis. Over the last few years, Deep Learning has had unprecedented success in areas such as image classification, speech recognition, etc. However, research on the use of Deep Learning to network or graph analysis is limited. We discuss the benefits of and challenges in the use of DL for graph analysis, and present three preliminary techniques that we have developed as part of the ARL Network Science CTA program: (a) unsupervised classification using a very highly trained image recognizer, namely Caffe; (b) supervised classification using a variant of convolutional neural networks on node features such as degree and assortativity; and (c) a framework called node2vec for learning representations of nodes in a network using a mapping to natural language processing.
Siddharth Pal,
Yuxiao Dong
,
Nitesh V. Chawla
, Ananthram Swami, and Ram Ramanathan
. "
Deep Learning for Network Analysis: Problems, Approaches and Challenges
."
Proceedings of the 35th Annual International Conference for Military Communications (MILCOM)
.
November
2016
.
Abstract
The emergence of electronic health records or (EHRs) has made medical history including past and current diseases, and prescribed medications easily available. The abundance and richness of the data in EHRs has facilitated in the development of many disease prediction systems. Recent disease prediction systems use ICD-9-CM disease codes for calculating the similarity between patients, used as a basis for generating the list of probable diseases. However, these codes have certain limitations like codes being generated usually by coders and not by clinicians. Further, it requires a number of repeated visits to the clinician for accurate codes. Medications, on the other hand, have a stronger correlation with the underlying disease. To exploit this fact, we propose two disease prediction systems: one using medication-based similarity (medCARE) and the other using both disease and medication-based similarity (combinedCARE). combinedCARE provided a greater coverage but a higher average rank, when compared to that medCARE. Hence, medications can be used in improving the performance of disease prediction systems.
Dipanwita Dasgupta
and
Nitesh V. Chawla
. "
MedCare: Leveraging Medication Similarity for Disease Prediction
."
Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA)
.
October
2016
.
Abstract
The aging population (seniors) maintains a strong desire to remain independent as long as possible while maintaining their physical health and emotional well-being. We have witnessed a growth in technological solutions delivered as Apps on mobile devices for health and wellness. However, several challenges remain in their application to assist seniors in the dimensions of health and wellness. They fail to provide sufficient connectivity and a level of usability that is appropriate for seniors. To that end, we are developing a tablet-based application, eSeniorCare, to integrate various components of health and wellbeing targeting successful aging. Our design and development followed an iterative process involving feedback from the participants. We conducted a 31-week study of the application with 16 participants at an independent facility in collaboration with Memorial Hospital of South Bend. Our study found that through the use of the application, participants demonstrated improved technological health management skills and decreased risk for depression, suggesting that eSeniorCare represents an important step toward improving senior care and quality of life.
Dipanwita Dasgupta
, Kimberly Green Reeves,
Beenish Chaudhry
,
Mayra Duarte
, and
Nitesh V. Chawla
. "
eSeniorCare: Technology for Promoting Well-Being of Older Adults in Independent Living Facilities
."
Proceedings of the IEEE International Conference on Health Informatics (ICHI)
.
October
2016
.
Abstract
Over the past decade, the application of data science techniques to clinical data has allowed practitioners and researchers to develop a sundry of analytical models. These models have traditionally relied on structured data drawn from Electronic Medical Records (EMR). Yet, a large portion of EMR data remains unstructured, primarily held within clinical notes. While recent work has produced techniques for extracting structured features from unstructured text, this work generally operates under the untested assumption that all clinical text can be processed in a similar manner. This paper provides what we believe to be the first comprehensive evaluation of the differences between four major sources of clinical text, providing an evaluation of the structural, linguistic, and topical differences among notes of each category. Our conclusions support the premise that tools designed to extract structured data from clinical text must account for the categories of text they process.
Keith Feldman
, Nicholas Hazekamp, and
Nitesh V. Chawla
. "
Mining the Clinical Narrative: All Text Are Not Equal
."
Proceedings of the IEEE International Conference on Health Informatics (ICHI)
.
October
2016
.
Abstract
We study a unique behavioral network data set (based on periodic surveys and on electronic logs of dyadic contact via smartphones) collected at the University of Notre Dame. The participants are a sample of members of the entering class of freshmen in the fall of 2011 whose opinions on a wide variety of political and social issues and activities on campus were regularly recorded—at the beginning and end of each semester—for the first three years of their residence on campus. We create a communication activity network implied by call and text data, and a friendship network based on surveys. Both networks are limited to students participating in the NetSense surveys. We aim at finding student traits and activities on which agreements correlate well with formation and persistence of links while disagreements is highly correlated with non-existence or dissolution of links in the two social networks that we created. Using statistical analysis and machine learning, we observe several traits and activities displaying such correlations, thus being of potential use to predict social network evolution.
Ashwin Bahulkar, Boleslaw K. Szymanski, Omar Lizardo,
Yuxiao Dong
,
Yang Yang
, and
Nitesh V. Chawla
. "
Analysis of Link Formation, Persistence and Dissolution in NetSense Data
."
Proceedings of the 6th Workshop on Social Network Analysis in Applications (SNAA)
.
August
2016
.
Abstract
Organizations, irrespective of their size and type, are increasingly becoming data-driven or aspire to become data-driven. There is a rush to quantify value of their own internal data or the value of integrating their internal data with external data, and performing modeling on such data. A question that analytics teams often grapple with is whether to acquire more data or expend additional effort on more complex modeling, or both. If these decisions can be quantified a priori, it can be used to guide budget and investment decisions. To that end, we quantify the Net Present Value (NPV) of the tasks of additional data acquisition or more complex modeling, which are critical to the data science process. We develop a framework, NPVModel, for a comparative analysis of various external data acquisition and in-house model development scenarios using NPVs of costs and returns as a measure of feasibility. We then demonstrate the effectiveness of NPVModel in prescribing strategies for various scenarios. Our framework not only acts as a suggestion engine, but it also provides valuable insights into budgeting and roadmap planning for Big Data ventures.
Saurabh Nagrecha
and
Nitesh V. Chawla
. "
Quantifying Decision Making for Data Science: From Data Acquisition to Modeling
."
EPJ Data Science
,
5
:27,
August
2016
.
Abstract
The increasing availability of electronic health care records has provided remarkable progress in the field of population health. In particular the identification of disease risk factors has flourished under the surge of available data. Researchers can now access patient data across a broad range of demographics and geographic locations. Utilizing this Big healthcare data researchers have been able to empirically identify specific high-risk conditions found within differing populations. However to date the majority of studies approached the issue from the top down, focusing on the prevalence of specific diseases within a population. Through our work we demonstrate the power of addressing this issue bottom-up by identifying specifically which diseases are higher-risk for a specific population. In this work we demonstrate that network-based analysis can present a foundation to identify pairs of diagnoses that differentiate across population segments. We provide a case study highlighting differences between high and low income individuals in the United States. This work is particularly valuable when addressing population health management within resource-constrained environments such as community health programs where it can be used to provide insight and resource planning into targeted care for the population served.
Keith Feldman
, Gregor Stiglic,
Dipanwita Dasgupta
, Mark Kricheff, Zoran Obradovic, and
Nitesh V. Chawla
. "
Insights into Population Health Management Through Disease Diagnoses Networks
."
Scientific Reports
,
6
:30465,
July
2016
.
Abstract
Kenyans use mobile money services to transfer money to friends and relatives via mobile phone text messaging. Kenya's M-Pesa is one of the most successful examples of digital money for financial inclusion. This article uses social network analysis and ethnographic information to examine ties to and through women in 12 mobile money transfer networks of kin, drawn from field data collected in 2012, 2013, and 2014. The social networks are based on reciprocal and dense ties among siblings and parents, especially mothers. Men participate equally in social networks, but as brothers and mother's brothers more often than as fathers. The matrilineal ties of mobile money circulate value within the hearthhold (Ekejiuba 2005) of women, their children, and others connected to them. Using remittances, families negotiate investments in household farming or work, education, and migration. Money sending supports the diverse economic strategies, flexible kinship ties, and mobility of hearthholds. Gifts of e-money are said to express a natural love and caring among mothers and siblings and are often private and personal. Consequently, the money circulations of the hearthhold avoid disrupting widely shared ideals of patrilineal solidarity and household autonomy.
Sibel Kusimba,
Yang Yang
, and
Nitesh V. Chawla
. "
Hearthholds of Mobile Money in Western Kenya
."
Science Advances
,
3
(2)
:266–279,
June
2016
.
Abstract
To ensure the correctness of network analysis methods, the network (as the input) has to be a sufficiently accurate representation of the underlying data. However, when representing sequential data from complex systems such as global shipping traffic or web clickstream traffic as networks, conventional network representations that implicitly assume the Markov property (first-order dependency) can quickly become limiting. This assumption holds that when movements are simulated on the network, the next movement depends only on the current node, discounting the fact that the movement may depend on several previous steps. However, we show that data derived from many complex systems can show up to fifth-order dependencies. In these cases, the oversimplifying assumption of the first-order network representation can lead to inaccurate network analysis results. To address this problem, we propose the Higher-Order Network (HON) representation that can discover and embed variable orders of dependencies in a network representation. Through a comprehensive empirical evaluation and analysis, we establish several desirable characteristics of HON, including accuracy, scalability, and direct compatibility with the existing suite of network analysis methods. We illustrate how HON can be applied to a broad variety of tasks, such as random walking, clustering, and ranking, and we demonstrate that by using it as input, HON yields more accurate results without any modification to these tasks.
Jian Xu
, Thanuka L. Wickramarathne, and
Nitesh V. Chawla
. "
Representing Higher-Order Dependencies in Networks
."
Science Advances
,
2
(5)
:e1600028,
May
2016
.
Abstract
Low-income older adults are at risk of poor health outcomes due to rising healthcare costs. When designed to reflect their needs, technologies can provide a cost-effective solution to this problem. We investigated the issues and strategies low-income people encounter in managing their health by conducting interviews and focus groups with fifty older adults. The qualitative analysis revealed eight main themes participants considered important for successful aging. Based on these findings, we propose two design principles to promote successful aging in the target population: (a) support shift-and-persist strategies; and (b) combat impoverished mentalities by providing knowledge, resources and motivation.
Beenish Chaudhry
, Kimberly Green Reeves, and
Nitesh V. Chawla
. "
Successful Aging for Low-Income Older Adults: Towards Design Principles
."
Proceedings of the 10th EAI International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth)
.
May
2016
.
Abstract
There is an increasing trend amongst users to consume information from websites and social media. With the huge influx of content it becomes challenging for the consumers to navigate to topics or articles that interest them. Particularly in health care, the content consumed by a user is controlled by various factors such as demographics and lifestyle. In this paper, we use a semi-bipartite network model to capture the interactions between users and health topics that interest them. We use a supervised link prediction approach to recommend topics to users based on their past reading behavior and contextual data associated to a user such as demographics.
Aastha Nigam
and
Nitesh V. Chawla
. "
Link Prediction in a Semi-bipartite Network for Recommendation
."
Proceedings of the Asian Conference on Intelligent Information and Database Systems (ACIIDS)
,
pp. 127–135,
March
2016
.
Abstract
A widely used measure of scientific impact is citations. However, due to their heavy-tailed distribution, citations are fundamentally difficult to predict. Instead, to characterize scientific impact, we address two analogous questions asked by many scientific researchers: "How will my h-index evolve over time, and which of my previously or newly published papers will contribute to it?" To answer these questions, we perform two related tasks. First, we develop a model to predict authors' future h-indices based on their current scientific impact. Second, we examine the factors that drive papers—either previously or newly published—to increase their authors' predicted future h-indices. By leveraging relevant factors, we can predict an author's h-index in five years with an R2 value of 0.92 and whether a previously (newly) published paper will contribute to this future h-index with an F1 score of 0.99 (0.77). We find that topical authority and publication venue are crucial to these effective predictions, while topic popularity is surprisingly inconsequential. Further, we develop an online tool that allows users to generate informed h-index predictions. Our work demonstrates the predictability of scientific impact, and can help researchers to effectively leverage their scholarly position of "standing on the shoulders of giants."
Yuxiao Dong
,
Reid A. Johnson
, and
Nitesh V. Chawla
. "
Can Scientific Impact Be Predicted?
."
IEEE Transactions on Big Data (TBD)
,
2
(1)
:18–30,
March
2016
.
Abstract
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Andrea Dal Pozzolo, Olivier Caelen,
Reid A. Johnson
, and Gianluca Bontempi
. "
Calibrating Probability with Undersampling for Unbalanced Classification
."
Proceedings of the 6th IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
,
pp. 159–166,
IEEE,
December
2015
.
Abstract
Link prediction is a popular research area with important applications in a variety of disciplines, including biology, social science, security, and medicine. The fundamental requirement of link prediction is the accurate and effective prediction of new links in networks. While there are many different methods proposed for link prediction, we argue that the practical performance potential of these methods is often unknown because of challenges in the evaluation of link prediction, which impact the reliability and reproducibility of results. We describe these challenges, provide theoretical proofs and empirical examples demonstrating how current methods lead to questionable conclusions, show how the fallacy of these conclusions is illuminated by methods we propose, and develop recommendations for consistent, standard, and applicable evaluation metrics. We also recommend the use of precision-recall threshold curves and associated areas in lieu of receiver operating characteristic curves due to complications that arise from extreme imbalance in the link prediction classification problem.
Yang Yang
,
Ryan N. Lichtenwalter
, and
Nitesh V. Chawla
. "
Evaluating Link Prediction Methods
."
Knowledge and Information Systems (KAIS)
,
45
(3)
:751–782,
December
2015
.
Abstract
This paper develops a new principled framework for exploiting time-sensitive information to improve the truth discovery accuracy in social sensing applications. This work is motivated by the emergence of social sensing as a new paradigm of collecting observations about the physical environment from humans or devices on their behalf. These observations maybe true or false, and hence are viewed as binary claims. A fundamental problem in social sensing applications lies in ascertaining the correctness of claims and the reliability of data sources. We refer to this problem as truth discovery. Time is a critical dimension that needs to be carefully exploited in the truth discovery solutions. In this paper, we develop a new time-sensitive truth discovery scheme that explicitly incorporates the source responsiveness and the claim lifespan into a rigorous analytical framework. The new truth discovery scheme solves a maximum likelihood estimation problem to determine both the claim correctness and the source reliability. We compare our time-sensitive scheme with the state-of-the-art baselines through an extensive simulation study and a real world case study. The evaluation results showed that our new scheme outperforms all compared baselines and significantly improves the truth discovery accuracy in social sensing applications.
Chao Huang, Dong Wang, and
Nitesh V. Chawla
. "
Towards Time-Sensitive Truth Discovery in Social Sensing Applications
."
Proceedings of the 12th IEEE International Conference on Mobile Ad Hoc and Sensor Systems (MASS)
,
pp. 154–162,
October
2015
.
Abstract
As access to broadband continues to grow along with the now almost ubiquitous availability of mobile phones, the landscape of the e-content delivery space has never been so dynamic. To establish their position in the market, businesses are beginning to realize that understanding each of their customers' likes and dislikes is perhaps as important as the offered content itself. Further, a number of companies are also delivering content, product previews, advertisements, etc. via video on their sites. The question remains: how effective are video engagement channels on sites? Can that user engagement be quantified? Clickstream data can furnish important insight into those questions using videos as a communication or messaging medium. To that end, focusing on a large set of web portals owned and managed by a private media company, we propose methods using these sites' clickstream data that can be used to provide a deeper understanding of their visitors, as well as their interests and preferences. We further expand the use of this data to show that it can be effectively used to predict user engagement to video streams, quantifying that metric by means of a survival analysis assessment.
Everaldo Aguiar
,
Saurabh Nagrecha
, and
Nitesh V. Chawla
. "
Predicting Online Video Engagement Using Clickstreams
."
Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA)
,
pp. 1–10,
October
2015
.
Abstract
Today, advances in medical informatics brought on by the increasing availability of electronic medical records (EMR) have allowed for the proliferation of data-centric tools, especially in the context of personalized healthcare. While these tools have the potential to greatly improve the quality of patient care, the effective utilization of their techniques within clinical practice may encounter two significant challenges. First, the increasing amount of electronic data generated by clinical processes can impose scalability challenges for current computational tools, requiring parallel or distributed implementations of such tools to scale. Secondly, as technology becomes increasingly intertwined in clinical workflows these tools must not only operate efficiently, but also in an interpretable manner. Failure to identity areas of uncertainty or provide appropriate context creates a potentially complex situation for both physicians and patients. This paper will present a case study investigating the issues associated with first scaling a disease prediction algorithm to accommodate dataset sizes expected in large medical practices. It will then provide an analysis on the diagnoses predictions, attempting to provide contextual information to convey the certainty of the results to a physician. Finally it will investigate latent demographic features of the patient’s themselves, which may have an impact on the accuracy of the diagnosis predictions.
Keith Feldman
,
Darcy A. Davis
, and
Nitesh V. Chawla
. "
Scaling and Contextualizing Personalized Healthcare: A Case Study of Disease Prediction Algorithm Integration
."
Journal of Biomedical Informatics (JBI)
,
vol. 57,
pp. 377–385,
October
2015
.
Abstract
Some students, for a variety of factors, struggle to complete high school on time. To address this problem, school districts across the U.S. use intervention programs to help struggling students get back on track academically. Yet in order to best apply those programs, schools need to identify off-track students as early as possible and enroll them in the most appropriate intervention. Unfortunately, identifying and prioritizing students in need of intervention remains a challenging task. This paper describes work that builds on current systems by using advanced data science methods to produce an extensible and scalable predictive framework for providing partner U.S. public school districts with individual early warning indicator systems. Our framework employs machine learning techniques to identify struggling students and describe features that are useful for this task, evaluating these techniques using metrics important to school administrators. By doing so, our framework, developed with the common need of several school districts in mind, provides a common set of tools for identifying struggling students and the factors associated with their struggles. Further, by integrating data from disparate districts into a common system, our framework enables cross-district analyses to investigate common early warning indicators not just within a single school or district, but across the U.S. and beyond.
Reid A. Johnson
, Ruobin Gong, Siobhan Greatorex-Voith, Anushka Anand, and Alan Fritzler
. "
A Data-Driven Framework for Identifying High School Students at Risk of Not Graduating on Time
."
Bloomberg Data for Good Exchange
.
Bloomberg,
September
2015
.
Abstract
In this work, we unveil the evolution of social relationships across the lifespan. This evolution reflects the dynamic social strategies that people use to fulfill their social needs. For this work we utilize a large mobile network complete with user demographic information. We find that while younger individuals are active in broadening their social relationships, seniors tend to keep small but closed social circles. We further demonstrate that opposite-gender interactions between two young individuals are much more frequent than those between young same-gender people, while the situation is reversed after around 35 years old. We also discover that while same-gender triadic social relationships are persistently maintained over a lifetime, the opposite-gender triadic circles are unstable upon entering into middle-age. Finally we demonstrate a greater than 80% potential predictability for inferring users' gender and a 73% predictability for age from mobile communication behaviors.
Yuxiao Dong
,
Nitesh V. Chawla
, Jie Tang, and
Yang Yang
. "
The Evolution of Social Relationships and Strategies Across the Lifespan
."
Proceedings of the 26th European Conference on Machine Learning and the 19th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 245–249,
September
2015
.
Abstract
The pervasiveness and availability of mobile phone data offer the opportunity of discovering usable knowledge about crowd behaviors in urban environments. Cities can leverage such knowledge in order to provide better services (e.g., public transport planning, optimized resource allocation) and safer cities. Call Detail Record (CDR) data represents a practical data source to detect and monitor unusual events considering the high level of mobile phone penetration, compared with GPS equipped and open devices. In this paper, we provide a methodology that is able to detect unusual events from CDR data that typically has low accuracy in terms of space and time resolution. Moreover, we introduce a concept of unusual event that involves a large amount of people who expose an unusual mobility behavior. Our careful consideration of the issues that come from coarse-grained CDR data ultimately leads to a completely general framework that can detect unusual crowd events from CDR data effectively and efficiently. Through extensive experiments on real-world CDR data for a large city in Africa, we demonstrate that our method can detect unusual events with 16% higher recall and over 10 times higher precision, compared to state-of-the-art methods. We implement a visual analytics prototype system to help end users analyze detected unusual crowd events to best suit different application scenarios. To the best of our knowledge, this is the first work on the detection of unusual events from CDR data with considerations of its temporal and spatial sparseness and distinction between user unusual activities and daily routines.
Yuxiao Dong
, Fabio Pinelli, Yiannis Gkoufas, Zubair Nabi, Francesco Calabrese, and
Nitesh V. Chawla
. "
Inferring Unusual Crowd Events from Mobile Phone Call Detail Records
."
Proceedings of the 26th European Conference on Machine Learning and the 19th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 474–492,
September
2015
.
Abstract
This research examines the interplay between social networks and mobile money remittances in Western Kenya. Research was conducted in Kenya's Bungoma and Trans-Nzoia counties in 2012, 2013, and 2014, involving 12 family networks of between 8&endash;70 people. Using small and frequent digital money transfers, relatives provide for household and emergency needs, contribute to ceremonies, and help pay school fees and medical bills. We find that digital money transfers follow and reinforce preexisting forms of emotional support and social relationships. In these families, the transfers strengthen maternal kinship ties as well relationships among siblings and cousins. Money networks are reciprocal, such that senders are also receivers, and individuals have many connections through which to access resources. Some individuals are "central" in networks, having more connections; others broker flows of e-value from one group of relatives to another. Mobile money strengthens social bonds but can also disrupt social relationships as when hiding digital value and remittances from in-laws or spouses.
Sibel B. Kusimba,
Yang Yang
, and
Nitesh V. Chawla
. "
Family Networks of Mobile Money in Kenya
."
Information Technologies & International Development (ITID)
,
11
(3)
:1–21,
September
2015
.
Abstract
Collaboration is an integral element of the scientific process that often leads to findings with significant impact. While extensive efforts have been devoted to quantifying and predicting research impact, the question of how collaborative behavior influences scientific impact remains unaddressed. In this work, we study the interplay between scientists' collaboration signatures and their scientific impact. As the basis of our study, we employ an ArnetMiner dataset with more than 1.7 million authors and 2 million papers spanning over 60 years. We formally define a scientist’s collaboration signature as the distribution of collaboration strengths with each collaborator in his or her academic ego network, which is quantified by four measures: sociability, dependence, diversity, and self-collaboration. We then demonstrate that the collaboration signature allows us to effectively distinguish between researchers with dissimilar levels of scientific impact. We also discover that, even from the early stages of one’s researcher career, a scientist’s collaboration signature can help to reveal his or her future scientific impact. Finally, we find that as a representative group of outstanding computer scientists, Turing Award winners collectively produce distinctive collaboration signatures throughout the entirety of their careers. Our conclusions on the relationship between collaboration signatures and scientific impact give rise to important implications for researchers who wish to expand their scientific impact and more effectively stand on the shoulders of "collaborators."
Yuxiao Dong
,
Reid A. Johnson
,
Yang Yang
, and
Nitesh V. Chawla
. "
Collaboration Signatures Reveal Scientific Impact
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 480–487,
August
2015
.
Abstract
Interactions in dynamic networks often transcend the dyadic barrier and emerge as subgraphs. The evolution of these subgraphs cannot be completely predicted using a pairwise link prediction analysis. We propose a novel solution to the problem—"Prediction of Recurrent Subgraphs (PReSub)" which treats subgraphs as individual entities in their own right. PReSub predicts re-occurring subgraphs using the network's vector space embedding and a set of "early warning subgraphs" which act as global and local descriptors of the subgraph's behavior. PReSub can be used as an out-of-the-box pipeline method with user-provided subgraphs or even to discover interesting subgraphs in an unsupervised manner. It can handle missing network information and is parallelizable. We show that PReSub outperforms traditional pairwise link prediction for a variety of evolving network datasets. The goal of this framework is to improve our understanding of subgraphs and provide an alternative representation in order to characterize their behavior."
Saurabh Nagrecha
,
Nitesh V. Chawla
, and Horst Bunke
. "
Recurrent Subgraph Prediction
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 416–423,
August
2015
.
Abstract
We study the problem of link prediction in coupled networks, where we have the structure information of one (source) network and the interactions between this network and another (target) network. The goal is to predict the missing links in the target network. The problem is extremely challenging as we do not have any information of the target network. Moreover, the source and target networks are usually heterogeneous and have different types of nodes and links. How to utilize the structure information in the source network for predicting links in the target network? How to leverage the heterogeneous interactions between the two networks for the prediction task? We propose a unified framework, CoupledLP, to solve the problem. Given two coupled networks, we first leverage atomic propagation rules to automatically construct implicit links in the target network for addressing the challenge of target network incompleteness, and then propose a Coupled Factor Graph Model to incorporate the meta-paths extracted from the coupled part of the two networks for transferring heterogeneous knowledge. We evaluate the proposed framework on two different genres of datasets: disease-gene (DG) and mobile social networks. In the DG networks, we aim to use the disease network to predict the associations between genes. In the mobile networks, we aim to use the mobile communication network of one mobile operator to infer the network structure of its competitors. On both datasets, the proposed CoupledLP framework outperforms several alternative methods. The proposed problem of coupled link prediction and the corresponding framework demonstrate both the scientific and business applications in biology and social networks.
Yuxiao Dong
, Jing Zhang, Jie Tang,
Nitesh V. Chawla
, and Bai Wang
. "
CoupledLP: Link Prediction in Coupled Networks
."
Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 199–208,
August
2015
.
Abstract
Many school districts have developed successful intervention programs to help students graduate high school on time. However, identifying and prioritizing students who need those interventions the most remains challenging. This paper describes a machine learning framework to identify such students, discusses features that are useful for this task, applies several classification algorithms, and evaluates them using metrics important to school administrators. To help test this framework and make it practically useful, we partnered with two U.S. school districts with a combined enrollment of approximately 200,000 students. We together designed several evaluation metrics to assess the goodness of machine learning algorithms from an educator's perspective. This paper focuses on students at risk of not finishing high school on time, but our framework lays a strong foundation for future work on other adverse academic outcomes.
Himabindu Lakkaraju,
Everaldo Aguiar
, Carl Shan, David I. Miller, Rayid Ghani, and Kecia L. Addison
. "
A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes
."
Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 1908–1918,
August
2015
.
Abstract
On April 2nd, 2014, the Department of Health and Human Services (HHS) announced a historic policy in its effort to increase the transparency in the American healthcare system. The Center for Medicare and Medicaid Service (CMS) would publicly release a dataset containing information about the types of Medicare services, requested charges, and payments issued by providers across the country. In its release, HHS stated that the data would shed light on Medicare fraud, waste, and abuse. While this is most certainly true, we believe that it can provide so much more. Beyond the purely financial aspects of procedure charges and payments, the procedures themselves may provide us with additional information, not only about the Medicare population, but also about the physicians themselves. The procedures a physician performs are for the most part not novel, but rather recommended, observed, and studied. However, whether a physician decides on advocating a procedure is somewhat discretionary. Some patients require a clear course of action, while others may benefit from a variety of options. This article poses the following question: How does a physician's past experience in medical school shape his or her practicing decisions? This article aims to open the analysis into how data, such as the CMS Medicare release, can help further our understanding of knowledge transfer and how experiences during education can shape a physician's decision's over the course of his or her career. This work begins with an evaluation into similarities between medical school charges, procedures, and payments. It then details how schools' procedure choices may link them in other, more interesting ways. Finally, the article includes a geographic analysis of how medical school procedure payments and charges are distributed nationally, highlighting potential deviations.
Keith Feldman
and
Nitesh V. Chawla
. "
Does Medical School Training Relate to Practice? Evidence from Big Data
."
Big Data Journal
,
3
(2)
:103–113,
June
2015
.
Abstract
The deployment of classification models is an integral component of many modern data mining and machine learning applications. A typical classification model is built with the tacit assumption that the deployment scenario by which it is evaluated is fixed and fully characterized. Yet, in the practical deployment of classification methods, important aspects of the application environment, such as the misclassification costs, may be uncertain during model building. Moreover, a single classification model may be applied in several different deployment scenarios. In this work, we propose a method to optimize a model for uncertain deployment scenarios. We begin by deriving a relationship between two evaluation measures, H measure and cost curves, that may be used to address uncertainty in classifier performance. We show that when uncertainty in classifier performance is modeled as a probabilistic belief that is a function of this underlying relationship, a natural definition of risk emerges for both classifiers and instances. We then leverage this notion of risk to develop a boosting-based algorithm—which we call RiskBoost—that directly mitigates classifier risk, and we demonstrate that it outperforms AdaBoost on a diverse selection of datasets.
Reid A. Johnson
,
Troy Raeder
, and
Nitesh V. Chawla
. "
Optimizing Classifiers for Hypothetical Scenarios
."
Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
,
pp. 264–276,
April
2015
.
Abstract
Several hundred thousand students drop out of high school every year in the United States. Interventions can help those who are falling behind in their educational goals, but given limited resources, such programs must focus on the right students, at the right time, and with the right message. In this paper, we describe an incremental approach that can be used to select and prioritize students who may be at risk of not graduating high school on time, and to suggest what may be the predictors of particular students going off-track. These predictions can then be used to inform targeted interventions for these students, hopefully leading to better outcomes.
Everaldo Aguiar
, Himabindu Lakkaraju, Nasir Bhanpuri, David I. Miller, Ben Yuhas, and Kecia L. Addison
. "
Who, When, and Why: A Machine Learning Approach to Prioritizing Students at Risk of Not Graduating High School on Time
."
Proceedings of the 5th International Conference on Learning Analytics and Knowledge (LAK)
,
pp. 93–102,
March
2015
.
Abstract
The collection and analysis of student-level data is quickly becoming the norm across school campuses. More and more institutions are starting to use this resource as a window into better understanding the needs of their student population. In previous work, we described the use of electronic portfolio data as a proxy to measuring student engagement, and showed how it can be predictive of student retention. This paper highlights our ongoing efforts to explore and measure the valence of positive and negative emotions in student reflections and how they can serve as an early warning indicator of student disengagement.
Frederick Nwanganga,
Everaldo Aguiar
, G. Alex Ambrose, Victoria E. Goodrich, and
Nitesh V. Chawla
. "
Qualitatively Exploring Electronic Portfolios: A Text Mining Approach to Measuring Student Emotion as an Early Warning Indicator
."
Proceedings of the 5th International Conference on Learning Analytics and Knowledge (LAK)
,
pp. 422–423,
March
2015
.
Abstract
Social status, defined as the relative rank or position that an individual holds in a social hierarchy, is known to be among the most important motivating forces in social behaviors. In this paper, we consider the notion of status from the perspective of a position or title held by a person in an enterprise. We study the intersection of social status and social networks in an enterprise. We study whether enterprise communication logs can help reveal how social interactions and individual status manifest themselves in social networks. To that end, we use two enterprise datasets with three communication channels—voice call, short message, and email—to demonstrate the social-behavioral differences among individuals with different status. We have several interesting findings and based on these findings we also develop a model to predict social status. On the individual level, high-status individuals are more likely to be spanned as structural holes by linking to people in parts of the enterprise networks that are otherwise not well connected to one another. On the community level, the principle of homophily, social balance and clique theory generally indicate a "rich club" maintained by high-status individuals, in the sense that this community is much more connected, balanced and dense. Our model can predict social status of individuals with 93% accuracy.
Yuxiao Dong
, Jie Tang, and
Nitesh V. Chawla
. "
Inferring Social Status and Rich Club Effects in Enterprise Communication Networks
."
PLoS one
,
10
(3)
:e0119446,
February
2015
.
Abstract
Scientific impact plays a central role in the evaluation of the output of scholars, departments, and institutions. A widely used measure of scientific impact is citations, with a growing body of literature focused on predicting the number of citations obtained by any given publication. The effectiveness of such predictions, however, is fundamentally limited by the power-law distribution of citations, whereby publications with few citations are extremely common and publications with many citations are relatively rare. Given this limitation, in this work we instead address a related question asked by many academic researchers in the course of writing a paper, namely: "Will this paper increase my h-index?" Using a real academic dataset with over 1.7 million authors, 2 million papers, and 8 million citation relationships from the premier online academic service ArnetMiner, we formalize a novel scientific impact prediction problem to examine several factors that can drive a paper to increase the primary author's h-index. We find that the researcher’s authority on the publication topic and the venue in which the paper is published are crucial factors to the increase of the primary author's h-index, while the topic popularity and the co-authors' h-indices are of surprisingly little relevance. By leveraging relevant factors, we find a greater than 87.5% potential predictability for whether a paper will contribute to an author's h-index within five years. As a further experiment, we generate a self-prediction for this paper, estimating that there is a 76% probability that it will contribute to the h-index of the co-author with the highest current h-index in five years. We conclude that our findings on the quantification of scientific impact can help researchers to expand their influence and more effectively leverage their position of "standing on the shoulders of giants."
Yuxiao Dong
,
Reid A. Johnson
, and
Nitesh V. Chawla
. "
Will This Paper Increase Your h-Index? Scientific Impact Prediction
."
Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM)
,
pp. 149–158,
January
2015
.
Abstract
As providers of higher education begin to harness the power of big data analytics, one very fitting application for these new techniques is that of predicting student attrition. The ability to pinpoint students who might soon decide to drop out, or who may be following a suboptimal path to success, allows those in charge to not only understand the causes for this undesired outcome, but it also provides room for the development of early intervention systems. While making such inferences based on academic performance data alone is certainly possible, we claim that in many cases there is no substantial correlation between how well a student performs and his or her decision to withdraw. This is especially true when the overall set of students has a relatively similar academic performance. To address this issue, we derive measurements of engagement from students' electronic portfolios and show how these features can be effectively used to augment the quality of predictions.
Everaldo Aguiar
, G. Alex Ambrose,
Nitesh V. Chawla
, Victoria E. Goodrich, and Jay B. Brockman
. "
Engagement vs Performance: Using Electronic Portfolios to Predict First Semester Engineering Student Persistence
."
Journal of Learning Analytics
,
1
(3)
:7–33,
November
2014
.
Abstract
Centrality of a node measures its relative importance within a network. There are a number of applications of centrality, including inferring the influence or success of an individual in a social network, and the resulting social network dynamics. While we can compute the centrality of any node in a given network snapshot, a number of applications are also interested in knowing the potential importance of an individual in the future. However, current centrality is not necessarily an effective predictor of future centrality. While there are different measures of centrality, we focus on degree centrality in this paper. We develop a method that reconciles preferential attachment and triadic closure to capture a node's prominence profile. We show that the proposed node prominence profile method is an effective predictor of degree centrality. Notably, our analysis reveals that individuals in the early stage of evolution display a distinctive and robust signature in degree centrality trend, adequately predicted by their prominence profile. We evaluate our work across four real-world social networks. Our findings have important implications for the applications that require prediction of a node's future degree centrality, as well as the study of social network dynamics.
Yang Yang
,
Yuxiao Dong
, and
Nitesh V. Chawla
. "
Predicting Node Degree Centrality with Node Prominence Profile
."
Scientific Reports
,
4
:7236,
November
2014
.
Abstract
In today's healthcare environment, nurses play an integral role in determining patient outcomes. This role becomes especially clear in intensive care units such as the Neonatal Intensive Care Unit (NICU). In the NICU, critically ill infants rely almost completely on the care of these nurses for survival. Given the importance of their role, and the volatile conditions of the infants, it is imperative that nurses be able to focus on the infants in their charge. In order to provide this level of care there must be an appropriate infant to nurse ratio each day. However traditional staffing models often utilize a number of factors, including historical census counts, which when incorrect leave a NICU at risk of operating barely reaching, or even below the recommended staffing level. This work will present the novel ADMIT (Admission Duration Model for Infant Treatment) model, which yields personalized length of stay estimates for an infant, utilizing data available from time of admission to the NICU.
Keith Feldman
and
Nitesh V. Chawla
. "
Admission Duration Model for Infant Treatment (ADMIT)
."
Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
,
pp. 583–587,
November
2014
.
Abstract
Demographics are widely used in marketing to characterize different types of customers. However, in practice, demographic information such as age, gender, and location is usually unavailable due to privacy and other reasons. In this paper, we aim to harness the power of big data to automatically infer users' demographics based on their daily mobile communication patterns. Our study is based on a real-world large mobile network of more than 7,000,000 users and over 1,000,000,000 communication events (CALL and SMS). We discover several interesting social strategies that mobile users frequently use to maintain their social connections. First, young people are very active in broadening their social circles, while seniors tend to keep close but more stable connections. Second, female users put more attention on cross-generation interactions than male users, though interactions between male and female users are frequent. Third, a persistent same-gender triadic pattern over one's lifetime is discovered for the first time, while more complex opposite-gender triadic patterns are only exhibited among young people. We further study to what extent users' demographics can be inferred from their mobile communications. As a special case, we formalize a problem of double dependent-variable prediction—inferring user gender and age simultaneously. We propose the WhoAmI method to address this problem by considering not only the effects of features on gender/age, but also the interrelation between gender and age. Our experiments show that the proposed WhoAmI method significantly improves the prediction accuracy by up to 10% compared with several alternative methods.
Yuxiao Dong
, Yang Yang, Jie Tang,
Yang Yang
, and
Nitesh V. Chawla
. "
Inferring User Demographics and Social Strategies in Mobile Social Networks
."
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 15–24,
August
2014
.
Abstract
Invasive species are among the top three drivers of global environmental change, and also known to cause very high losses to social welfare. For instance, annual losses that occur from ship-borne invasions in the Laurentian Great Lakes could potentially be as high as USD 800 million. We focus on ship-borne invasions in this paper. Despite widespread negative impacts on social welfare, invasions are often poorly managed because the problem is perceived as too complex. Lack of data and characterizations of related phenomena have hampered the usefulness of quantitative invasion risk assessments in the task of supporting the government au- thorities who are battling to manage invasions with limited resources. Presented in this paper is a solid foundation for more effective and efficient risk assessment and management of aquatic species spread through the Global Shipping Network (GSN). Via an application on risk assessment of invasive species spread via GSN, we show how data mining can be used for solving crucial, yet very complex problems towards the social good. By experimenting on a multi-year shipping dataset, we have discovered crucial knowledge on how patterns in GSN affect aquatic invasions. Furthermore, we have also illustrated how such knowledge can be useful in a deployed setting to both efficiently and economically manage these harmful invasions.
Jian Xu
, Thanuka L. Wickramarathne,
Nitesh V. Chawla
, Erin K. Grey,
Karsten Steinhaeuser
, Reuben P. Keller, John M. Drake, and David M. Lodge
. "
Improving Management of Aquatic Invasions by Integrating Shipping Network, Ecological and Environmental Data: Data Mining for Social Good
."
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 1699–1708,
August
2014
.
Abstract
We study convergence of iterated belief revision in complex fusion environments, which may consist of a network of soft (i.e., human or human-based) and hard (i.e., conventional physics-based) sensors and where agent communications may be asynchronous and the link structure may be dynamic. In particular, we study the problem in which network agents exchange and revise belief functions (which generalize probability mass functions) and are more geared towards handling the uncertainty pervasive in soft/hard fusion environments. We focus on belief revision in which agents utilize a generalized fusion rule that is capable of generating a rational consensus. It includes the widely used weighted average consensus as a special case. By establishing this fusion scheme as a pool of paracontracting operators, we derive general convergence criteria that are relevant for a wide range of applications. Furthermore, we analyze the conditions for consensus for various social networks by simulating several network topologies and communication patterns that are characteristic of such networks.
Thanuka L. Wickramarathne, Kamal Premaratne, Manohar Murthi, and
Nitesh V. Chawla
. "
Convergence Analysis of Iterated Belief Revision in Complex Fusion Environments
."
IEEE Journal of Selected Topics in Signal Processing (J-STSP)
,
8
(4)
:598–612,
August
2014
.
Abstract
A fundamental goal of systems biology is to create models that describe relationships between biological components. Networks are an increasingly popular approach to this problem. However, a scientist interested in modeling biological (e.g., gene expression) data as a network is quickly confounded by the fundamental problem: how to construct the network? It is fairly easy to construct a network, but is it the network for the problem being considered? This is an important problem with three fundamental issues: How to weight edges in the network in order to capture actual biological interactions? What is the effect of the type of biological experiment used to collect the data from which the network is constructed? How to prune the weighted edges (or what cut-off to apply)? Differences in the construction of networks could lead to different biological interpretations. Indeed, we find that there are statistically significant dissimilarities in the functional content and topology between gene co-expression networks constructed using different edge weighting methods, data types, and edge cut-offs.We show that different types of known interactions, such as those found through Affinity Capture-Luminescence or Synthetic Lethality experiments, appear in significantly varying amounts in networks constructed in different ways. Hence, we demonstrate that different biological questions may be answered by the different networks. Consequently, we posit that the approach taken to build a network can be matched to biological questions to get targeted answers. More study is required to understand the implications of different network inference approaches and to draw reliable conclusions from networks used in the field of systems biology.
Andrew K. Rider
, Tijana Milenković, Geoffrey H. Siwo, Richard S. Pinapati, Scott J. Emrich, Michael T. Ferdig, and
Nitesh V. Chawla
. "
Networks' Characteristics are Important for Systems Biology
."
Network Science
,
2
(02)
:139–161,
August
2014
.
Abstract
The emergence of electronic health records (EHRs) has made the medical histories easily accessible. They are also presenting a unique opportunity to look at disease progressions and interactions at a deep population level and over time. One can construct disease diagnoses networks, which provide a richer expression beyond EHRs can be used to construct various diagnosis-diagnoses, drug-drug networks for studying their interactions. In addition, one can construct rich disease-drug networks, which indicate not only the prescriptions that are assigned to treat those diseases. This analysis can illuminate not only side-effects, but also potential candidates for disease targets. We propose a network-based approach for studying disease-drug. In this approach, we construct bipartite network where nodes are drugs and diagnoses. This network is an integrated version of diagnosis-diagnoses and drug-drug networks. We find that the integrated network provided more information on interactions, compared to the component or parent networks.
Dipanwita Dasgupta
and
Nitesh V. Chawla
. "
Disease and Medication Networks: An Insight into Disease-Drug Interactions
."
Proceedings of the 2nd International Conference on Big Data Analytics in Healthcare (BDAH)
,
July
2014
.
Abstract
Today the healthcare industry is undergoing one of the most important and challenging transitions to date, the move from paper to electronic healthcare records. While the healthcare industry has generally been an incrementally advancing field, this change has the potential to be revolutionarily. Using the data collected from these electronic records exciting tools such as disease recommendation systems have been created to deliver personalized models of an individual's health profile. However despite their early success, tools such as these will soon encounter a significant problem. The amount of healthcare encounter data collected is increasing drastically, and the computational time for these applications will soon reach a point at which these systems can no longer function in a practical timeframe for clinical use. This paper will begin by analyzing the performance limitations of the personalized disease prediction engine CARE (Collaborative Assessment and Recommendation Engine). Next it will detail the creation and performance of a new single patient implementation of the algorithm. Finally this work will demonstrate a novel parallel implementation of the CARE algorithm, and demonstrate the performance benefits on Big patient data.
Keith Feldman
and
Nitesh V. Chawla
. "
Scaling Personalized Healthcare with Big Data
."
Proceedings of the 2nd International Conference on Big Data Analytics in Healthcare (BDAH)
,
July
2014
.
Abstract
Hellinger Distance Decision Trees (HDDT) has been previously used for static datasets with skewed distributions. In unbalanced data streams, state-of-the-art techniques use instance propagation and standard decision trees (e.g. C4.5) to cope with the unbalanced problem. However it is not always possible to revisit/store old instances of a stream. In this paper we show how HDDT can be successfully applied in unbalanced and evolving stream data. Using HDDT allows us to remove instance propagations between batches with several benefits: i) improved predictive accuracy ii) speed iii) single-pass through the data. We use a Hellinger weighted ensemble of HDDTs to combat concept drift and increase accuracy of single classifiers. We test our framework on several streaming datasets with unbalanced classes and concept drift.
Andrea Dal Pozzolo,
Reid A. Johnson
, Olivier Caelen, Serge Waterschoot,
Nitesh V. Chawla
, and Gianluca Bontempi
. "
Using HDDT to Avoid Instances Propagation in Unbalanced and Evolving Data Streams
."
Proceedings of the 24th International Joint Conference on Neural Networks (IJCNN)
,
pp. 588–594,
July
2014
.
Abstract
For the past 3 years, the First-Year Engineering Program at the University of Notre Dame has used electronic portfolios (ePortfolios) as an assessment tool for their Introduction to Engineering course sequence. While each year the ePortfolio assignments have expanded, they have been focused largely in three types of reflections: (1) student experiences within the college but outside of the course, (2) the skills gained specifically through course projects, and (3) their four year plan to be a successful engineering student as defined by the ABET a-k criteria. ePortfolio assignments were initially included to allow students to reflect on their education, develop evidence of their blossoming skills, and take control of their graduation plan. After the first year of practice, there was a clear secondary benefit to the faculty and student advisors. Anecdotally, student reflections provide faculty with a measure of student engagement with the course and even possibly indicated retention into their sophomore year. For this reason, a more formal assessment was needed to determine if student engagement could be measured through ePortfolio reflections.

While students are engaged with their ePortfolios, data that describes their interaction with the ePortfolio (number of times they log in, number of artifacts they submit, hits, comments, etc.) is continually collected. This paper will focus on: (1) how ePortfolios have been folded into the Introduction to Engineering course sequence, (2) if student reflections can be used as a measure of engagement and engineering interest, and (3) what markers within an ePortfolio can predict student retention. Specifically, study measures will consist of: (1) instructor categorization of student reflections and evidence types used and (2) assessment of the ePortfolio data features described above as they relate to student retention through the first year. Specifically, ePortfolio data features are continually collected through the use of a computational mining tool. In an initial study of the fall 2012 semester data, students who left the engineering track after one semester had an average of 12.7 logins to the ePortfolio system. Students who left the engineering track after two semesters and students who persisted into the sophomore year had a significantly larger average number of logins, with 18.7 and 19.1, respectively. Future plans include deeper exploration of ePortfolio features as markers of student interest in engineering, specifically identifying students by mid-semester that are at risk for prematurely leaving the engineering track and deploying intervention strategies for those students.
Victoria E. Goodrich,
Everaldo Aguiar
, G. Alex Ambrose, Leo H. McWilliams, Jay B. Brockman, and
Nitesh V. Chawla
. "
Integration of ePortfolios in a First-Year Engineering Course for Measuring Student Engagement
."
Proceedings of the American Society for Engineering Education Annual Conference (ASEE)
,
pp. 24.785.1–24.785.16,
June
2014
.
Abstract
Successful aging or aging well is a very complex process. This process is usually defined as having the following components: managing chronic conditions, maintaining physical and mental health, social engagement. There exist many solutions in isolation for these individual components that assist in aging well. However, while these components in isolation can work, we believe that the real strength of these components is in their synergism—in their combined power to help seniors in aging well. We are developing a unified framework "eSeniorCare", which addresses all of these components together. It is a mobile platform that has the following features: medication scheduling, daily activity tracker. As a part of this framework, sessions are conducted to provide motivational lectures in aging. In this paper, we present the first phase of the framework.
Dipanwita Dasgupta
,
Keith Feldman
, Disha Waghray, W. A. Mikels-Carrasco, Patty Willaert, Debra A. Raybold, and
Nitesh V. Chawla
. "
An Integrated and Digitized Care Framework for Successful Aging
."
Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI)
,
pp. 440–443,
June
2014
.
Abstract
As providers of higher education begin to harness the power of big data analytics, one very fitting application for these new techniques is that of predicting student attrition. The ability to pinpoint students who might soon decide to drop out1 of a given academic program allows those in charge to not only understand the causes for this undesired outcome, but it also provides room for the development of early in- tervention systems. While making such inferences based on academic performance data alone is certainly possible, we claim that in many cases there is no substantial correlation between how well a student performs and his or her decision to withdraw. This is specially true when the overall set of students has a relatively similar academic performance. To address this issue, we derive measurements of engagement from students' electronic portfolios and show how these features can be effectively used to augment the quality of predictions.
Everaldo Aguiar
,
Nitesh V. Chawla
, Jay B. Brockman, G. Alex Ambrose, and Victoria E. Goodrich
. "
Engagement vs Performance: Using Electronic Portfolios to Predict First Semester Engineering Student Retention
."
Proceedings of the 4th International Conference on Learning Analytics and Knowledge (LAK)
,
pp. 103–112,
March
2014
.
Abstract
We describe the vertex collocation profile (VCP) concept. VCPs provide rich information about the surrounding local structure of embedded vertex pairs. VCP analysis offers a new tool for researchers and domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other context. The same resolution that gives the VCP method its analytical power also enables it to perform well when used to accomplish link prediction. We first develop the theory, mathematics, and algorithms underlying VCPs. We provide timing results to demonstrate that the algorithms scale well even for large networks. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across different network families. Unlike many analytical tools, VCPs inherently generalize to multirelational data, which provides them with unique power in complex modeling tasks. To demonstrate this, we apply the VCP method to longitudinal networks by encoding temporally resolved information into different relations. In this way, the transitions between VCP elements represent temporal evolutionary patterns in the longitudinal network data. Results show that VCPs can use this additional data, typically challenging to employ, to improve predictive model accuracies. We conclude with our perspectives on the VCP method and its future in network science, particularly link prediction.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
Vertex Collocation Profiles: Theory, Computation, and Results
."
SpringerPlus
,
3
(1)
:116,
February
2014
.
Abstract
High-throughput techniques have become a primary approach to gathering biological data. These data can be used to explore relationships between genes and guide development of drugs and other research. However, the deluge of data contains an overwhelming amount of unknown information about the organism under study. Therefore, clustering is a common first step in the exploratory analysis of high-throughput biological data. We present a supervised learning approach to clustering that utilizes known gene-gene interaction data to improve results for already commonly used clustering techniques. The approach creates an ensemble similarity measure that can be used as input to any clustering technique and provides results with increased biological significance while not altering the clustering method.
Andrew K. Rider
, Geoffrey H. Siwo, Scott J. Emrich, Michael T. Ferdig, and
Nitesh V. Chawla
. "
A Supervised Learning Approach to the Ensemble Clustering of Genes
."
International Journal of Data Mining and Bioinformatics (IJDMB)
,
9
(2)
:199–219,
January
2014
.
Abstract
In this paper we introduce and study the properties of certain kind of interdependent networks that we collectively call a Red Black Network two intertwined social networks that work together towards a series of events missions or performances. More specifically, members of one of the two networks is responsible for planning and organizing the events. They will generally be referred to as artists. Members of the other network, henceforth called actors, are responsible for the execution of the events. Using temporal data from the performing arts industry, we study the coevolution of two such codependent social networks. We find that the statistical properties of two such networks are highly correlated, and use that finding to devise a prediction mechanism for such properties in a scenario when one of the two networks is invisible or only partially visible. This also sets up a framework for our ultimate goal of temporal, semiblind, multirelational link prediction.
Saurav Pandit
,
Jonathan Koch
,
Yang Yang
, Brian Uzzi, and
Nitesh V. Chawla
. "
Red Black Network: Temporal and Topological Analysis of Two Intertwined Social Networks
."
Proceedings of the 32nd Annual International Conference for Military Communications (MILCOM)
,
pp. 719–724,
November
2013
.
Abstract
The Gene Promoter Expression Prediction challenge consisted of predicting gene expression from promoter sequences in a previously unknown experimentally generated data set. The challenge was presented to the community in the framework of the sixth Dialogue for Reverse Engineering Assessments and Methods (DREAM6), a community effort to evaluate the status of systems biology modeling methodologies. Nucleotide-specific promoter activity was obtained by measuring fluorescence from promoter sequences fused upstream of a gene for yellow fluorescence protein and inserted in the same genomic site of yeast Saccharomyces cerevisiae. Twenty-one teams submitted results predicting the expression levels of 53 different promoters from yeast ribosomal protein genes. Analysis of participant predictions shows that accurate values for low-expressed and mutated promoters were difficult to obtain, although in the latter case, only when the mutation induced a large change in promoter activity compared to the wild-type sequence. As in previous DREAM challenges, we found that aggregation of participant predictions provided robust results, but did not fare better than the three best algorithms. Finally, this study not only provides a benchmark for the assessment of methods predicting activity of a specific set of promoters from their sequence, but it also shows that the top performing algorithm, which used machine-learning approaches, can be improved by the addition of biological features such as transcription factor binding sites.
Pablo Meyer, Geoffrey H. Siwo, Danny Zeevi, Eilon Sharon, Raquel Norel,
DREAM6 Promoter Prediction Consortium
, Eran Segal, and Gustavo Stolovitzky
. "
Inferring Gene Expression from Ribosomal Promoter Sequences, A Crowdsourcing Approach
."
Genome Research
,
23
(11)
:1928–1937,
November
2013
.
Abstract
The concept of a negative class does not apply to many problems for which classification is increasingly utilized. In this study we investigate the reliability of evaluation metrics when the negative class contains an unknown proportion of mislabeled positive class instances. We examine how evaluation metrics can inform us about potential systematic biases in the data. We provide a motivating case study and a general framework for approaching evaluation when the negative class contains mislabeled positive class instances. We show that the behavior of evaluation metrics is unstable in the presence of uncertainty in class labels and that the stability of evaluation metrics depends on the kind of bias in the data. Finally, we show that the type and amount of bias present in data can have a significant effect on the ranking of evaluation metrics and the degree to which they over- or underestimate the true performance of classifiers.
Andrew K. Rider
,
Reid A. Johnson
,
Darcy A. Davis
,
T. Ryan Hoens
, and
Nitesh V. Chawla
. "
Classifier Evaluation with Missing Negative Class Labels
."
Advances in Intelligent Data Analysis XII (IDA)
,
pp. 380–391,
October
2013
.
Abstract
The selection of hardware to support big data systems is complex. Even defining the term "big data" is difficult. "Big data" can mean a large volume of data in a database, a MapReduce cluster that processes data, analytics and reporting applications that must access large datasets to operate, algorithms that can effectively operate on large datasets, or even basic scripts that produce a needed resulted by leveraging data. Big data systems can be composed of many component systems. For these reasons, it appears difficult to create a universal, representative benchmark that approximates a "big data" workload. Along with the trend to utilize large datasets and sophisticated tools to analyze data, the trend of cloud computing has emerged as an effective method of leasing compute time. This chapter explores some of the issues at the intersection of virtualized computing (since cloud computing often uses virtual machines), metadata stores and big data. Metadata is important because it enables many applications and users to access datasets and effectively use them without relying on extensive knowledge from humans about the data.
Nathan Regola
,
David A. Cieslak
, and
Nitesh V. Chawla
. "
The Need to Consider Hardware Selection When Designing Big Data Applications Supported by Metadata
."
Big Data Management, Technologies, and Applications
,
pp. 381–396,
October
2013
.
Abstract
Call duration analysis is a key issue for understanding underlying patterns of (mobile) phone users. In this paper, we study to which extent the duration of a call between users can be predicted in a dynamic mobile network. We have collected a mobile phone call data from a mobile operating company, which results in a network of 272,345 users and 3.9 million call records during two months. We first examine the dynamic distribution properties of the mobile network including periodicity and demographics. Then we study several important social theories in the call network including strong/weak ties, link homophily, opinion leader and social balance. The study reveals several interesting phenomena such as people with strong ties tend to make shorter calls and young females tend to make long calls, in particular in the evening. Finally, we present a time-dependent factor graph model to model and infer the call duration between users, by incorporating our observations in the distribution analysis and the social theory analysis. Experiments show that the presented model can achieve much better predictive performance compared to several baseline methods. Our study offers evidences for social theories and also unveils several different patterns in the call network from online social networks.
Yuxiao Dong
, Jie Tang, Tiancheng Lou, Bin Wu, and
Nitesh V. Chawla
. "
How Long Will She Call Me? Distribution, Social Theory and Duration Prediction
."
Proceedings of the 24th European Conference on Machine Learning and the 17th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 16–31,
September
2013
.
Abstract
With the recent signing of the Patient Protections and Affordable Care Act into law, the use of electronic medical data is set to become ubiquitous in the United States. This presents an unprecedented opportunity to use data mining for the benefit of patient health. However, there are two major hurdles to utilizing this wealth of data. First, medical data is not centrally located but is often divided into regions by companies that warehouse the data for local hospitals or by research organizations. Second, federal and state laws prevent sharing specific or identifiable information. It would be of great benefit to share information between these distinct populations, as a data set concerning cancer patients for example could greatly increase the accuracy of related computationally derived disease risks for everyone else. Still, organizations may have a vested interest in keeping their data sets private as they may have been gathered and curated at great cost. We propose an approach to allow the sharing of beneficial information while staying within the bounds of the law and maintaining the privacy of the data. We show that the use of a probabilistic graphical model can facilitate effective transfer learning between distinct healthcare data sets by parameter sharing while simultaneously allowing us to construct a network for interpretation use by domain experts and the discovery of disease relationships. Our method utilizes aggregate information from distinct populations in order to improve the estimation of patient disease risk.
Andrew K. Rider
and
Nitesh V. Chawla
. "
An Ensemble Topic Model for Sharing Healthcare Data and Predicting Disease Risk
."
Proceedings of the ACM Conference on Bioinformatics, Computational Biology, and Biomedical Informatics (ACM-BCB)
,
pp. 333–340,
September
2013
.
Abstract
Faced with unsustainable costs and enormous amounts of under-utilized data, health care needs more efficient practices, research, and tools to harness the full benefits of personal health and healthcare-related data. Imagine visiting your physician’s office with a list of concerns and questions. What if you could walk out the office with a personalized assessment of your health? What if you could have personalized disease management and wellness plan? These are the goals and vision of the work discussed in this paper. The timing is right for such a research direction—given the changes in health care, reimbursement, reform, meaningful use of electronic health care data, and patient-centered outcome mandate. We present the foundations of work that takes a Big Data driven approach towards personalized healthcare, and demonstrate its applicability to patient-centered outcomes, meaningful use, and reducing re-admission rates.
Nitesh V. Chawla
and
Darcy A. Davis
. "
Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework
."
Journal of General Internal Medicine (JGIM)
,
28
(3)
:660–665,
September
2013
.
Abstract
One of the concerns patients have when confronted with a medical condition is which physician to trust. Any recommendation system that seeks to answer this question must ensure that any sensitive medical information collected by the system is properly secured. In this article, we codify these privacy concerns in a privacy-friendly framework and present two architectures that realize it: the Secure Processing Architecture (SPA) and the Anonymous Contributions Architecture (ACA). In SPA, patients submit their ratings in a protected form without revealing any information about their data and the computation of recommendations proceeds over the protected data using secure multiparty computation techniques. In ACA, patients submit their ratings in the clear, but no link between a submission and patient data can be made. We discuss various aspects of both architectures, including techniques for ensuring reliability of computed recommendations and system performance, and provide their comparison.
T. Ryan Hoens
, Marina Blanton, Aaron Steele, and
Nitesh V. Chawla
. "
Reliable Medical Recommendation Systems with Patient Privacy
."
ACM Transactions on Intelligent Systems and Technology (TIST)
,
4
(4)
:67,
September
2013
.
Abstract
The goal of systems biology is to gain a more complete understanding of biological systems by viewing all of their components and the interactions between them simultaneously. Until recently, the most complete global view of a biological system was through the use of gene expression or protein-protein interaction data. With the increasing number of high-throughput technologies for measuring genomic, proteomic, and metabolomic data, scientists now have the opportunity to create complex network-based models for drug discovery, protein function annotation, and many other problems. Each technology used to measure a biological system inherently presents a limited view of the system. However, the combination of multiple technologies can provide a more complete picture. Much recent work has studied integrating these heterogeneous data types into single networks. Here we provide a survey of integrative network-based approaches to problems in systems biology. We focus on describing the variety of algorithms used in integrative network inference. Ultimately, the survey of current approaches leads us to the conclusion that there is an urgent need for a standard set of evaluation metrics and data sets in this field.
Andrew K. Rider
,
Nitesh V. Chawla
, and Scott J. Emrich
. "
A Survey of Current Integrative Network Algorithms for Systems Biology
."
Systems Biology
,
pp. 479–495,
August
2013
.
Abstract
The understanding of how humans move is a long-standing challenge in the natural science. An important question is, to what degree is human behavior predictable? The ability to foresee the mobility of humans is crucial from predicting the spread of human to urban planning. Previous research has focused on predicting individual mobility behavior, such as the next location prediction problem. In this paper we study the human mobility behaviors from the perspective of network science. In the human mobility network, there will be a link between two humans if they are physically proximal to each other. We perform both microscopic and macroscopic explorations on the human mobility patterns. From the microscopic perspective, our objective is to answer whether two humans will be in proximity of each other or not. While from the macroscopic perspective, we are interested in whether we can infer the future topology of the human mobility network. In this paper we explore both problems by using link prediction technology, our methodology is demonstrated to have a greater degree of precision in predicting future mobility topology.
Yang Yang
,
Nitesh V. Chawla
, Prithwish Basu, Bhaskar Prabhala, and Thomas La Porta
. "
Link Prediction in Human Mobility Networks
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 380–387,
August
2013
.
Abstract
Many important real-world systems, modeled naturally as complex networks, have heterogeneous interactions and complicated dependency structures. Link prediction in such networks must model the influences between heterogenous relationships and distinguish the formation mechanisms of each link type, a task which is beyond the simple topological features commonly used to score potential links. In this paper, we introduce a novel probabilistically weighted extension of the Adamic/Adar measure for heterogenous information networks, which we use to demonstrate the potential benefits of diverse evidence, particularly in cases where homogeneous relationships are very sparse. However, we also expose some fundamental flaws of traditional unsupervised link prediction. We develop supervised learning approaches for relationship (link) prediction in multi-relational networks, and demonstrate that a supervised approach to link prediction can enhance performance. We present results on three diverse, real-world heterogeneous information networks and discuss the trends and tradeoffs of supervised and unsupervised link prediction in a multi-relational setting.
Darcy A. Davis
,
Ryan N. Lichtenwalter
, and
Nitesh V. Chawla
. "
Supervised Methods for Multi-Relational Link Prediction
."
Social Networks Analysis and Mining (SNAM)
,
3
(2)
:127–141,
June
2013
.
Abstract
Classification is one of the most fundamental tasks in the machine learning and data mining communities. One of the most common challenges faced when trying to perform classification is the class imbalance problem. A data set is considered imbalanced if the class of interest (positive or minority class) is relatively rare as compared to the other classes (negative or majority classes). As a result, the classifier can be heavily biased towards the majority class. A number of sampling approaches, ranging from under-sampling to over-sampling, have been developed to solve the problem of class imbalance. One challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. While a wrapper approach may be used to discover the sampling strategy and amounts, it can quickly become computationally prohibitive. To that end, recent research has also focused on developing novel classification algorithms that are class imbalance (skew) insensitive. In this chapter, we provide an overview of the sampling strategies as well as classification algorithms developed for countering class imbalance. In addition, we consider the issues of correctly evaluating the performance of a classifier on imbalanced datasets and present a discussion on various metrics.
T. Ryan Hoens
and
Nitesh V. Chawla
. "
Imbalanced Datasets: From Sampling to Classifiers
."
Imbalanced Learning: Foundations, Algorithms, and Applications
,
pp. 43–59,
May
2013
.
Abstract
So far, electronic prescription management has failed to sufficiently improve physician's knowledge of a patient's current and previous medications. Additionally, prescription management methods have failed at encouraging patients to adhere to medications. Between medication errors resulting from incomplete medication histories and poor patient adherence to drug regiments, significant challenges remain in prescription management. In this paper, we propose a unique use of quick response (QR) codes that should improve completeness of medication history, allowing patients to more accurately treat a patient's diagnosis. This QR code system outlines a method for patients, physicians, and pharmacists to all view the medication history and for pharmacists to accurately update the history when a patient retrieves new medication. We also outline a method for incentivizing patient adherence based on redemption codes, a method proven successful in other industries. Between this and the QR code system, physicians should also be able to access important information about a patient's adherence consistency, allowing him or her to better understand the patient and more effectively treat a diagnosis.
Robert Thompson
and
Nitesh V. Chawla
. "
Addressing Challenges in Prescription Management
."
Proceedings of the 24th Annual Conference of the Production and Operations Management Society (POMS)
,
no. 1610,
May
2013
.
Abstract
We investigate how people and objects that they create (artifacts) gain prominence in collaborative networks. As an example, consider academic research communities where people and their artifacts (research papers) both have prominence. But, these prominence values are linked to each other and evolve together. In particular, for an author to make an impact on the scientific community, she has to interact with diverse sets of individuals. The results of this is that her research will be known by a large number of communities. However, the research communities and topics have to be aligned along her research interests as well. Peer reviewers have to know and understand the research field and the results must be disseminated to the larger community through sustained interest in the given research field. Hence, the prominence of individuals and their artifacts evolve simultaneously. In this paper, we develop novel methods to study both types of evolution processes and show their effectiveness using the DBLP dataset.
Yang Yang
,
Nitesh V. Chawla
, Xiaohui Lu, and Sibel Adal
. "
Prominence in Networks: A Co-evolving Process
."
Proceedings of the 2nd IEEE International Network Science Workshop (NSW)
,
pp. 58–65,
May
2013
.
Abstract
Unmanned Aerial Vehicles (UAVs) are of increasing interest to researchers because of their diverse applications, such as military operations and search and rescue. The problem we have chosen to focus on is using a swarm of small, inexpensive UAVs to discover static targets in a search space. Though many different swarm models have been used for similar problems, our proposed model, the Icosystem Swarm Game, to our knowledge has not been evaluated for this particular problem of target search.

Further, we propose to simulate the performance of this model in a semi-realistic communications environment. The challenge here is to find the optimal multi-hop configuration for the UAVs, so that they can find the most targets, avoid collision with each other as much as possible, and still communicate efficiently. We implement this through a weighted shortest-path problem using Dijkstra's algorithm, with the weights being the transmission cost over distance. Testing has shown that our multi-hop communications perform, in terms of target-finding and collision avoidance with other UAVs, as well as an idealized communications environment.
Rachael Purta,
Saurabh Nagrecha
, and Gregory Madey
. "
Multi-hop Communications in a Swarm of UAVs
."
Proceedings of the Agent-Directed Simulation Symposium (ADSS)
,
no. 5,
April
2013
.
Abstract
A wide variety of networked systems in human societies are composed of repeated communications between actors. A dyadic relationship made up of repeated interactions may be reciprocal (both actors have the same probability of directing a communication attempt to the other) or non-reciprocal (one actor has a higher probability of initiating a communication attempt than the other). In this paper we propose a theoretically motivated index of reciprocity appropriate for networks formed from repeated interactions based on these probabilities. We go on to examine the distribution of reciprocity in a large-scale social network built from trace-logs of over a billion cell-phone communication events across millions of actors in a large industrialized country. We find that while most relationships tend toward reciprocity, a substantial minority of relationships exhibit large levels of non-reciprocity. This is puzzling because behavioral theories in social science predict that persons will selectively terminate non-reciprocal relationships, keeping only those that approach reciprocity. We point to two structural features of human communication behavior and relationship formation—the division of contacts into strong and weak ties and degree-based assortativity—that either help or hinder the ability of persons to obtain communicative balance in their relationships. We examine the extent to which deviations from reciprocity in the observed network are partially traceable to the operation of these countervailing tendencies.
Cheng Wang, Omar Lizardo, David Hachen, Anthony Strathman, Zoltán Toroczkai, and
Nitesh V. Chawla
. "
A Dyadic Reciprocity Index for Repeated Interaction Networks
."
Network Science
,
1
(01)
:31–48,
April
2013
.
Abstract
Electronic health records are being adopted at a rapid rate due to increased funding from the US federal government. Health data provide the opportunity to identify possible improvements in health care delivery by applying data mining and statistical methods to the data and will also enable a wide variety of new applications that will be meaningful to patients and medical professionals. Researchers are often granted access to health care data to assist in the data mining process, but HIPAA regulations mandate comprehensive safeguards to protect the data. Often universities (and presumably other research organizations) have an enterprise information technology infrastructure and a research infrastructure. Unfortunately, both of these infrastructures are generally not appropriate for sensitive research data such as HIPAA, as they require special accommodations on the part of the enterprise information technology (or increased security on the part of the research computing environment). Cloud computing, which is a concept that allows organizations to build complex infrastructures on leased resources, is rapidly evolving to the point that it is possible to build sophisticated network architectures with advanced security capabilities. We present a prototype infrastructure in Amazon.s Virtual Private Cloud to allow researchers and practitioners to utilize the data in a HIPAA-compliant environment.
Nathan Regola
and
Nitesh V. Chawla
. "
Storing and Using Health Data in a Virtual Private Cloud
."
Journal of Medical Internet Research (JMIR)
,
15
(3)
:e63,
March
2013
.
Abstract
Community detection or cluster detection in networks is often at the core of mining network data. Whereas the problem is well-studied, given the scale and complexity of modern day social networks, detecting "reasonable" communities is often a hard problem. Since the first use of k-means algorithm in 1960s, many community detection algorithms have been presented—most of which are developed with specific goals in mind and the idea of detecting meaningful communities varies widely from one algorithm to another.

As the number of clustering algorithms grows, so does the number of metrics on how to measure them. Algorithms are often reduced to optimizing the value of an objective function such as modularity and internal density. Some of these metrics rely on ground-truth, some do not. In this chapter we study these algorithms and aim to find whether these optimization based measurements are consistent with the real performance of community detection algorithm. Seven representative algorithms are compared under various performance metrics, and on various real world social networks.

The difficulties of measuring community detection algorithms are mostly due to the unavailability of ground-truth information, and then objective functions, such as modularity, are used as substitutes. The benchmark networks that simulate real world networks with planted community structure are introduced to tackle the unavailability of ground-truth information, however whether the simulation is precise and useful has not been verified. In this chapter we present the performance of community detection algorithms on real world networks and their corresponding benchmark networks, which are designed to demonstrate the differences between real world networks and benchmark networks.
Yang Yang
, Yizhou Sun,
Saurav Pandit
,
Nitesh V. Chawla
, and Jiawei Han
. "
Perspective on Measurement Metrics for Community Detection Algorithms
."
Mining Social Networks and Security Informatics
,
pp. 227–242,
March
2013
.
Abstract
Inferring genetic networks is of great importance in unlocking gene behaviour, which in turn provides solutions for drug testing, disease resistance, and many other applications. Dynamic network models provide room for handling noisy or missing prelearned data. This paper discusses how Dynamic Bayesian Networks compare against coexpression networks as discussed by Zhang and Horvath. These shall be tested out on the genes of yeast Saccharomyces cerevisiae. A method is then proposed to get the best out of the strengths of both models, namely, the causality inference from Bayesian networks and the scoring method from a modified version of Zhang and Horvath's method.
Saurabh Nagrecha
, Pawan J. Lingras, and
Nitesh V. Chawla
. "
Comparison of Gene Co-expression Networks and Bayesian Networks
."
Proceedings of the 5th Asian Conference on Intelligent Information and Database Systems (ACIIDS)
,
pp. 507–516,
March
2013
.
Abstract
Link prediction and recommendation is a fundamental problem in social network analysis. The key challenge of link prediction comes from the sparsity of networks due to the strong disproportion of links that they have potential to form to links that do form. Most previous work tries to solve the problem in single network, few research focus on capturing the general principles of link formation across heterogeneous networks. In this work, we give a formal definition of link recommendation across heterogeneous networks. Then we propose a ranking factor graph model (RFG) for predicting links in social networks, which effectively improves the predictive performance. Motivated by the intuition that people make friends in different networks with similar principles, we find several social patterns that are general across heterogeneous networks. With the general social patterns, we develop a transfer-based RFG model that combines them with network structure information. This model provides us insight into fundamental principles that drive the link formation and network evolution. Finally, we verify the predictive performance of the presented transfer model on 12 pairs of transfer cases. Our experimental results demonstrate that the transfer of general social patterns indeed help the prediction of links.
Yuxiao Dong
, Jie Tang, Sen Wu, Jilei Tian,
Nitesh V. Chawla
, Jinghai Rao, and Huanhuan Cao
. "
Link Prediction and Recommendation Across Heterogeneous Social Networks
."
Proceedings of the 12th IEEE International Conference on Data Mining (ICDM)
,
pp. 181–190,
December
2012
.
Abstract
Link prediction is an important task in network analysis, benefiting researchers and organizations in a variety of fields. Many networks in the real world, for example social networks, are heterogeneous, having multiple types of links and complex dependence structures. Link prediction in such networks must model the influence propagating between heterogeneous relationships to achieve better link prediction performance than in homogeneous networks. In this paper, we introduce Multi-Relational Influence Propagation (MRIP), a novel probabilistic method for heterogeneous networks. We demonstrate that MRIP is useful for predicting links in sparse networks, which present a significant challenge due to the severe disproportion of the number of potential links to the number of real formed links. We also explore the impact of time in the process of link formation. We use of the temporal-related features by carefully investigating the issues of feasibility and generality. In accordance with our work in unsupervised learning, we further design an appropriate supervised approach in heterogeneous networks. Our experiments on co-authorship prediction demonstrate the effectiveness of our approach.
Yang Yang
,
Nitesh V. Chawla
, Yizhou Sun, and Jiawei Han
. "
Predicting Links in Multi-relational and Heterogeneous Networks
."
Proceedings of the 12th IEEE International Conference on Data Mining (ICDM)
,
pp. 755–764,
December
2012
.
Abstract
Finding the most influential nodes in a network is a much discussed research topic of recent time in the area of network science, especially in social network analysis. The topic of this paper is a related, but harder problem. Given a social network where neighbors can influence each other, the problem is to identify k nodes such that if a piece of information is placed on each of those k nodes, the overall spread of that information (via word-of-mouth or other methods of influence flow) is maximized. The amount of information spread can be measured using existing information propagation models. Recent studies, which focus on how quickly k high-influential nodes can be found, tend to ignore the overall effect of the information spread. On the other hand some legacy methods, which look at all possible propagation paths to compute a globally optimal target set, present severe scalability challenges in large-scale networks. We present a simple, yet scalable (polynomial time) algorithm that outperforms the existing state-of-the-art, and its success does not depend significantly on any kind of tuning parameter. To be more precise, when compared to the existing algorithms, the output set of k nodes produced by our algorithm facilitates higher information spread—in almost all the instances, consistently across the commonly used information propagation models. The original algorithm in this paper, although scalable, can have higher running time than some standard approaches, e.g. simply picking the top k nodes with highest degree or highest PageRank value. To that end, we provide an optional speedup mechanism that considerably reduces the time complexity while not significantly affecting the quality of results vis-a-vis the full version of our algorithm.
Saurav Pandit
,
Yang Yang
, and
Nitesh V. Chawla
. "
Maximizing Information Spread Through Influence Structures in Social Networks
."
Proceedings of the DaMNet Workshop of the 12th IEEE International Conference on Data Mining (DaMNet-ICDMW)
,
pp. 258–265,
December
2012
.
Nitesh V. Chawla
and
Yang Yang
. "
Link Prediction: A Primer
."
Encyclopedia of Social Network Analysis and Mining
,
pp. 813–820,
December
2012
.
Abstract
Predicting the distributions of species is central to a variety of applications in ecology and conservation biology. With increasing interest in using electronic occurrence records, many modeling techniques have been developed to utilize this data and compute the potential distribution of species as a proxy for actual observations. As the actual observations are typically overwhelmed by non-occurrences, we approach the modeling of species’ distributions with a focus on the problem of class imbalance. Our analysis includes the evaluation of several machine learning methods that have been shown to address the problems of class imbalance, but which have rarely or never been applied to the domain of species distribution modeling. Evaluation of these methods includes the use of the area under the precision-recall curve (AUPR), which can supplement other metrics to provide a more informative assessment of model utility under conditions of class imbalance. Our analysis concludes that emphasizing techniques that specifically address the problem of class imbalance can provide AUROC and AUPR results competitive with traditional species distribution models.
Reid A. Johnson
,
Nitesh V. Chawla
, and Jessica J. Hellmann
. "
Species Distribution Modeling and Prediction: A Class Imbalance Problem
."
Proceedings of the Conference on Intelligent Data Understanding (CIDU)
,
pp. 9–16,
October
2012
.
Abstract
Link prediction is a popular area for publication. Papers appear in virtually every conference on data mining or network science with new methods. We argue that the practical performance potential of these methods is generally unknown because of challenges endemic to evaluation in many link prediction contexts. We demonstrate that current methods of evaluation are inadequate and can lead to woefully errant conclusions about practical performance potential. We argue for the use of precision-recall threshold curves and associated areas in lieu of receiver operating characteristic curves due to the extreme imbalance of the link prediction classification problem. We provide empirical examples of how current methods lead to questionable conclusions, how the fallacy of these conclusions is illuminated by methods we propose, and suggest a fair and consistent framework for link prediction evaluation for longitudinal and non-longitudinal network data sets.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
Link Prediction: Fair and Effective Evaluation
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 376–383,
August
2012
.
Abstract
Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applications. In non-stationary environments data arrives incrementally, however the underlying generating function may change over time. In addition to the environments being non-stationary, they also often exhibit class imbalance. That is one class (the majority class) vastly outnumbers the other class (the minority class). The combination of a non-stationary environments with class imbalance poses significant and interesting practical problems for classification. To overcome these issues, we introduce a novel instance selection mechanism, as well as update the Heuristic Updatable Weighted Random Subspaces (HUWRS) method for the class imbalance problem. We then compare our modifications of HUWRS (called HUWRS.IP) to other state of the art algorithms, concluding that HUWRS.IP often achieves vastly superior performance.
T. Ryan Hoens
and
Nitesh V. Chawla
. "
Learning in Non-stationary Environments with Class Imbalance
."
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 168–176,s
August
2012
.
Abstract
A systematic characterization of multivariate dependence at multiple spatio-temporal scales is critical to understanding climate system dynamics and improving predictive ability from models and data. However, dependence structures in climate are complex due to nonlinear dynamical generating processes, long-range spatial and long-memory temporal relationships, as well as low-frequency variability. Here we utilize complex networks to explore dependence in climate data. Specifically, networks constructed from reanalysis-based atmospheric variables over oceans and partitioned with community detection methods demonstrate the potential to capture regional and global dependence structures within and among climate variables. Proximity-based dependence as well as long-range spatial relationships are examined along with their evolution over time, yielding new insights on ocean meteorology. The tools are implicitly validated by confirming conceptual understanding about aggregate correlations and teleconnections. Our results also suggest a close similarity of observed dependence patterns in relative humidity and horizontal wind speed over oceans. In addition, updraft velocity, which relates to convective activity over the oceans, exhibits short spatiotemporal decorrelation scales but long-range dependence over time. The multivariate and multi-scale dependence patterns broadly persist over multiple time windows. Our findings motivate further investigations of dependence structures among observations, reanalysis and model-simulated data to enhance process understanding, assess model reliability and improve regional climate predictions.
Karsten Steinhaeuser
, Auroop R. Ganguly, and
Nitesh V. Chawla
. "
Multivariate and Multiscale Dependence in the Global Climate System Revealed Through Complex Networks
."
Climate Dynamics
,
39
(3-4)
:889–895,
August
2012
.
Abstract
Solid state disks (or flash disks) are decreasing in cost per gigabyte and are being incorporated into many appliances, such as the Oracle Database Appliance. Databases—and more specifically data warehouses—are often utilized to support large scale data analysis and decision support systems. Decision makers prefer information in real time. Traditional storage systems that are based on magnetic disks achieve high performance by utilizing many disks for parallel operations in RAID arrays. However, this performance is only possible if requests represent a reasonable fraction of the RAID stripe size, or I/O transactions will suffer from high overhead. Solid state disks have the potential to increase the speed of data retrieval for mission critical workloads that require real time applications, such as analytic dashboards. However, solid state disks behave differently than magnetic hard disks due to the limitations of rewriting NAND flash based blocks. Therefore, this work presents benchmark results for a modern relational database that stores data on solid state disks, and contrasts this performance to a ten disk RAID 10 array, a traditional storage design for high performance database data blocks. The preliminary results show that a single solid state disk is able to outperform the array for queries summarizing a data set for a variety of OLAP cube dimensions. Future work will explore the low level database performance in more detail.
Nathan Regola
,
David A. Cieslak
, and
Nitesh V. Chawla
. "
The Constraints of Magnetic Versus Flash Disk Capabilities in Big Data Analysis
."
Proceedings of 2nd ACM Workshop on Architectures and Systems for Big Data (ASBD)
,
pp. 4–9,
June
2012
.
Abstract
Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for bi- nary class datasets, multi-class datasets have received considerably less attention. This is partially due to the fact that the multi-class imbal- ance problem is often much harder than its related binary class problem, as the relative frequency and cost of each of the classes can vary widely from dataset to dataset. In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.
T. Ryan Hoens
, Qi Aian,
Nitesh V. Chawla
, and Zhi-Hua Zhou
. "
Building Decision Trees for the Multi-class Imbalance Problem
."
Advances in Knowledge Discovery and Data Mining (PAKDD)
,
pp. 122–134,
June
2012
.
Abstract
Here we present a range-limited approach to centrality measures in both nonweighted and weighted directed complex networks. We introduce an efficient method that generates for every node and every edge its betweenness centrality based on shortest paths of lengths not longer than ℓ=1,...,L in the case of nonweighted networks, and for weighted networks the corresponding quantities based on minimum weight paths with path weights not larger than wℓ=ℓΔ, ℓ=1,2...,L=R/Δ. These measures provide a systematic description on the positioning importance of a node (edge) with respect to its network neighborhoods one step out, two steps out, etc., up to and including the whole network. They are more informative than traditional centrality measures, as network transport typically happens on all length scales, from transport to nearest neighbors to the farthest reaches of the network. We show that range-limited centralities obey universal scaling laws for large nonweighted networks. As the computation of traditional centrality measures is costly, this scaling behavior can be exploited to efficiently estimate centralities of nodes and edges for all ranges, including the traditional ones. The scaling behavior can also be exploited to show that the ranking top list of nodes (edges) based on their range-limited centralities quickly freezes as a function of the range, and hence the diameter-range top list can be efficiently predicted. We also show how to estimate the typical largest node-to-node distance for a network of N nodes, exploiting the afore-mentioned scaling behavior. These observations were made on model networks and on a large social network inferred from cell-phone trace logs (∼5.5 nodes and ∼2.7 edges). Finally, we apply these concepts to efficiently detect the vulnerability backbone of a network (defined as the smallest percolating cluster of the highest betweenness nodes and edges) and illustrate the importance of weight-based centrality measures in weighted networks in detecting such backbones.
Mária Ercsey-Ravasz,
Ryan N. Lichtenwalter
,
Nitesh V. Chawla
, and Zoltán Toroczkai
. "
Range-Limited Centrality Measures in Complex Networks
."
Physical Review E
,
85
(6)
:660–665,
June
2012
.
Abstract
An underlying assumption of biomedical informatics is that decisions can be more informed when professionals are assisted by analytical systems. For this purpose, we propose ALIVE, a multi-relational link prediction and visualization environment for the healthcare domain. ALIVE combines novel link prediction methods with a simple user interface and intuitive visualization of data to enhance the decision-making process for healthcare professionals. It also includes a novel link prediction algorithm, MRPF, which outperforms many comparable algorithms on multiple networks in the biomedical domain. ALIVE is one of the first attempts to provide an analytical and visual framework for healthcare analytics, promoting collaboration and sharing of data through ease of use and potential extensibility. We encourage the development of similar tools, which can assist in facilitating successful sharing, collaboration, and a vibrant online community.
Reid A. Johnson
,
Yang Yang
,
Everaldo Aguiar
,
Andrew K. Rider
, and
Nitesh V. Chawla
. "
ALIVE: A Multi-Relational Link Prediction Environment for the Healthcare Domain
."
Proceedings of the 3rd Workshop on Data Mining for Healthcare Management at the 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining (DMHM-PAKDD)
,
pp. 36–46,
May
2012
.
Abstract
We introduce the concept of a vertex collocation profile (VCP) for the purpose of topological link analysis and prediction. VCPs provide nearly complete information about the surrounding local structure of embedded vertex pairs. The VCP approach offers a new tool for domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other contexts. The same resolution that gives VCP its analytical power also enables it to perform well when used in supervised models to discriminate potential new links. We first develop the theory, mathematics, and algorithms underlying VCP. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across several different network families. We conclude with timing results that introduce the comparative performance of several existing algorithms and the practicability of VCP computations on large networks.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
Vertex Collocation Profiles: Subgraph Counting for Link Analysis and Prediction
."
Proceedings of 21st International Conference on World Wide Web (WWW)
,
pp. 1019–1028,
April
2012
.
Abstract
The primary focus of machine learning has traditionally been on learning from data assumed to be sufficient and representative of the underlying fixed, yet unknown, distribution. Such restrictions on the problem domain paved the way for development of elegant algorithms with theoretically provable performance guarantees. As is often the case, however, real world problems rarely fit neatly into such restricted models. For instance class distributions are often skewed, resulting in the "class imbalance" problem. Data drawn from non-stationary distributions is also common in real world applications, resulting in the "concept drift" or "non-stationary learning" problem which is often associated with streaming data scenarios. Recently, these problems have independently experienced increased research attention, however the combined problem of addressing all of the above mentioned issues has enjoyed relatively little research. If the ultimate goal of intelligent machine learning algorithms is to be able to address a wide spectrum of real world scenarios, then the need for a general framework for learning from, and adapting to, a non-stationary environment that may introduce imbalanced data can be hardly overstated. In this paper, we first present an overview of each of these challenging areas, followed by a comprehensive review of recent research for developing such a general framework.
T. Ryan Hoens
, Robi Polikar, and
Nitesh V. Chawla
. "
Learning from Streaming Data with Concept Drift and Imbalance: An Overview
."
Progress in Artificial Intelligence (PRAI)
,
1
(1)
:89–101,
April
2012
.
Abstract
Link prediction, i.e., predicting links or interactions between objects in a network, is an important task in network analysis. Although the problem has attracted much attention recently, there are several challenges that have not been addressed so far. First, most existing studies focus only on link prediction in homogeneous networks, where all objects and links belong to the same type. However, in the real world, heterogeneous networks that consist of multi-typed objects and relationships are ubiquitous. Second, most current studies only concern the problem of whether a link will appear in the future but seldom pay attention to the problem of when it will happen. In this paper, we address both issues and study the problem of predicting when a certain relationship will happen in the scenario of heterogeneous networks. First, we extend the link prediction problem to the relationship prediction problem, by systematically defining both the target relation and the topological features, using a meta path-based approach. Then, we directly model the distribution of relationship building time with the use of the extracted topological features. The experiments on citation relationship prediction between authors on the DBLP network demonstrate the effectiveness of our methodology.
Yizhou Sun, Jiawei Han, Charu C. Aggarwal, and
Nitesh V. Chawla
. "
When Will It Happen? Relationship Prediction in Heterogeneous Information Networks
."
Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM)
,
pp. 663–672,
February
2012
.
Abstract
Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemblemethods, considering performance across 58 varied data- sets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced data- sets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt).
David A. Cieslak
,
T. Ryan Hoens
,
Nitesh V. Chawla
, and W. Philip Kegelmeyer
. "
Hellinger Distance Decision Trees are Robust and Skew-Insensitive
."
Data Mining and Knowledge Discovery (DMKD)
,
24
(1)
:136–158,
January
2012
.
Abstract
The field of Dataset Shift has received a growing amount of interest in the last few years. The fact that most real-world applications have to cope with some form of shift makes its study highly relevant. The literature on the topic is mostly scattered, and different authors use different names to refer to the same concepts, or use the same name for different concepts. With this work, we attempt to present a unifying framework through the review and comparison of some of the most important works in the literature.
Jose G. Moreno-Torres,
Troy Raeder
, Rocío Alaiz-Rodríguez,
Nitesh V. Chawla
, and Francisco Herrera
. "
A Unifying View on Dataset Shift in Classification
."
Pattern Recognition
,
45
(1)
:521–530,
January
2012
.
Abstract
Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applications. In non-stationary environments data arrives incrementally, however the underlying generating function may change over time. While there is a variety of research into such environments, the research mainly consists of detecting concept drift (and then relearning the model), or developing classifiers which adapt to drift incrementally. We introduce Heuristic Updatable Weighted Random Subspaces (HUWRS), a new technique based on the Random Subspace Method that detects drift in individual features via the use of Hellinger distance, a distributional divergence metric. Through the use of subspaces, HUWRS allows for a more fine-grained approach to dealing with concept drift which is robust to feature drift even without class labels. We then compare our approach to two state of the art algorithms, concluding that for a wide range of datasets and window sizes HUWRS outperforms the other methods.
T. Ryan Hoens
,
Nitesh V. Chawla
, and Robi Polikar
. "
Heuristic Updatable Weighted Random Subspaces for Non-Stationary Environments
."
Proceedings of the IEEE International Conference on Data Mining (ICDM)
,
pp. 241–250,
December
2011
.
Abstract
Enterprises of all sizes are dealing with an increasing amount of electronic data that is essential to their business operations. Since there are many options when moving beyond a single server, organizations must consider a range of investment options such as purchasing a small cluster of machines or leasing capacity from cloud providers. Given the complexity (and cost) of configuring larger scale systems (such as distributed databases), we believe that revisiting more generic batch computing systems provides several advantages that are worth considering and benchmarking. For example, leveraging low cost servers, avoiding the data transformation process, and the time necessary to load and move data to a database are important considerations when selecting a workflow or technology. This paper presents benchmark results that compare MySQL to basic Unix tools on Condor and Oracle Grid Engine (OGE) supported by distributed filesystems for high throughput access to data using a real world data set. The benchmark results should be largely generalizable to other business analytic tasks. These results should aid information technology (IT) managers facing difficult decisions on which software stack to utilize for large, dynamic, datasets where database setup and loading may not be feasible within the time constraints imposed by business needs.
Nathan Regola
and
Nitesh V. Chawla
. "
Small Compute Clusters for Large-Scale Data Analysis
."
Proceedings of the International Conference on Information Technology, Systems and Management (ITSM)
,
December
2011
.
Abstract
Online social networks (OSNs) have created new and exciting ways to connect and share information. Perhaps no site has had a more profound effect on information exchange than Twitter.com. In this paper, we study large-scale graph properties and lesser-studied local graph structures of the explicit social network and the implicit retweet network in order to better understand the relationship between socialization and tweeting behaviors. In particular, we first explore the interplay between the social network and user tweet topics and offer evidence that suggests that users who are close in the social graph tend to tweet about similar topics. We then analyze the implicit retweet network and find highly unreciprocal links and unbalanced triads. We also explain the effects of these structural patterns on information diffusion by analyzing and visualizing how URLs tend to be tweeted and retweeted. Finally, given our analyses of the social network and the retweet network, we provide some insights into the relationships between these two networks.
Jake T. Lussier
and
Nitesh V. Chawla
. "
Network Effects on Tweeting
."
Discovery Science, Lecture Notes in Computer Science
,
pp. 209–220,
October
2011
.
Abstract
Under what conditions is an edge present in a social network at time t likely to decay or persist by some future time t + Delta(t)? Previous research addressing this issue suggests that the network range of the people involved in the edge, the extent to which the edge is embedded in a surrounding structure, and the age of the edge all play a role in edge decay. This paper uses weighted data from a large-scale social network built from cell-phone calls in an 8-week period to determine the importance of edge weight for the decay/persistence process. In particular, we study the relative predictive power of directed weight, embeddedness, newness, and range (measured as outdegree) with respect to edge decay and assess the effectiveness with which a simple decision tree and logistic regression classifier can accurately predict whether an edge that was active in one time period continues to be so in a future time period. We find that directed edge weight, weighted reciprocity and time-dependent measures of edge longevity are highly predictive of whether we classify an edge as persistent or decayed, relative to the other types of factors at the dyad and neighborhood level.
Troy Raeder
, Omar Lizardo, David Hachen, and
Nitesh V. Chawla
. "
Predictors of Short-Term Decay of Cell Phone Contacts in a Large-Scale Communication Network
."
Social Networks
,
33
(4)
:245–257,
October
2011
.
Abstract
LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
LPmade: Link Prediction Made Easy
."
Journal of Machine Learning Research (JMLR)
,
12
(Aug)
:2489–2492,
August
2011
.
Abstract
Various clustering methods have been applied to climate, ecological, and other environmental datasets, for example to define climate zones, automate land-use classification, and similar tasks. Measuring the "goodness" of such clusters is generally application-dependent and highly subjective, often requiring domain expertise and/or validation with field data (which can be costly or even impossible to acquire). Here we focus on one particular task: the extraction of ocean climate indices from observed climatological data. In this case, it is possible to quantify the relative performance of different methods. Specifically, we propose to extract indices with complex networks constructed from climate data, which have been shown to effectively capture the dynamical behavior of the global climate system, and compare their predictive power to candidate indices obtained using other popular clustering methods. Our results demonstrate that network-based clusters are statistically significantly better predictors of land climate than any other clustering method, which could lead to a deeper understanding of climate processes and complement physics-based climate models.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Auroop R. Ganguly
. "
Comparing Predictive Power in Climate Data: Clustering Matters
."
Advances in Spatial and Temporal Databases (SSTD)
,
pp. 39–55,
August
2011
.
Abstract
Many important real-world systems, modeled naturally as complex networks, have heterogeneous interactions and complicated dependency structures. Link prediction in such networks must model the influences between heterogenous relationships and distinguish the formation mechanisms of each link type, a task which is beyond the simple topological features commonly used to score potential links. In this paper, we introduce a novel probabilistically weighted extension of the Adamic/Adar measure for heterogenous information networks, which we use to demonstrate the potential benefits of diverse evidence, particularly in cases where homogeneous relationships are very sparse. However, we also expose some fundamental flaws of traditional a priori link prediction. In accordance with previous research on homogeneous networks, we further demonstrate that a supervised approach to link prediction can enhance performance and is easily extended to the heterogeneous case. Finally, we present results on three diverse, real-world heterogeneous information networks and discuss the trends and tradeoffs of supervised and unsupervised link prediction in a multi-relational setting.
Darcy A. Davis
,
Ryan N. Lichtenwalter
, and
Nitesh V. Chawla
. "
Multi-Relational Link Prediction in Heterogeneous Information Networks
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 281–288,
July
2011
.
Abstract
With the rise of network science as an exciting interdisciplinary research topic, efficient graph algorithms are in high demand. Problematically, many such algorithms measuring important properties of networks have asymptotic lower bounds that are quadratic, cubic, or higher in the number of vertices. For analysis of social networks, transportation networks, communication networks, and a host of others, computation is intractable. In these networks computation in serial fashion requires years or even decades. Fortunately, these same computational problems are often naturally parallel. We present here the design and implementation of a master-worker framework for easily computing such results in these circumstances. The user needs only to supply two small fragments of code describing the fundamental kernel of the computation. The framework automatically divides and distributes the workload and manages completion using an arbitrary number of heterogeneous computational resources. In practice, we have used thousands of machines and observed commensurate speedups. Writing only 31 lines of standard C++ code, we computed betweenness centrality on a network of 4.7M nodes in 25 hours.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
DisNet: A Framework for Distributed Graph Computation
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 264–270,
July
2011
.
Abstract
Community detection or cluster detection in networks is a well-studied, albeit hard, problem. Given the scale and complexity of modern day social networks, detecting "reasonable" communities is an even harder problem. Since the first use of k-means algorithm in 1960s, many community detection algorithms have been invented—most of which are developed with specific goals in mind and the idea of detecting "meaningful" communities varies widely from one algorithm to another. With the increasing number of community detection algorithms, there has been an advent of a number of evaluation measures and objective functions such as modularity and internal density. In this paper we divide methods of measurements in to two categories, according to whether they rely on ground-truth or not. Our work is aiming to answer whether these general used objective functions are well consistent with the real performance of community detection algorithms across a number of homogeneous and heterogeneous networks. Seven representative algorithms are compared under various performance metrics, and on various real world social networks.
Yang Yang
, Yizhou Sun,
Saurav Pandit
,
Nitesh V. Chawla
, and Jiawei Han
. "
Is Objective Function the Silver Bullet? A Case Study of Community Detection Algorithms on Social Networks
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
,
pp. 394–397,
July
2011
.
Abstract
Availability of electronic health care records is unlocking the potential for novel studies on understanding and modeling the disease co-morbidities based on both phenotypic and genetic data. Moreover, the insurgence of increasingly reliable phenotypic data can aid further studies on investigating the potential genetic links among diseases. The goal is to create a feedback loop where computational tools guide and facilitate research, leading to improved biological knowledge and clinical standards, which in turn should generate better data. We build and analyze disease interaction networks based on data collected from previous genetic association studies and patient medical histories, spanning over 12 years, acquired from a regional hospital. By exploring both individual and combined interactions among these two levels of disease data, we provide novel insight into the interplay between genetics and clinical realities. Our results show a marked difference between the well-defined structure of genetic relationships and the chaotic co-morbidity network, but also highlight clear interdependencies. We demonstrate the power of these dependencies by proposing a novel multi-relational link prediction method, showing that disease co-morbidity can enhance our currently limited knowledge of genetic association. Furthermore, our methods for integrated networks of diverse data are widely applicable and can provide novel advances for many problems in systems biology and personalized medicine.
Darcy A. Davis
and
Nitesh V. Chawla
. "
Exploring and Exploiting Disease Interactions from Multi-Relational Gene and Phenotype Networks
."
PLoS one
,
6
(7)
:e22670,
July
2011
.
Abstract
The pattern of interactions between individuals in a population contains implicitly within them a remarkable amount of information. This information, if extracted, could be of significant importance in several realms such as containing the spread of disease, understanding information flow in social systems and predicting likely future interactions. A popular method of discovering structure in networks is through community detection which attempts to capture the extent to which that network is different from a random network. However, communities are not very well defined for time-varying networks. In this paper, we introduce the notion of spatio-temporal communities that attempts to capture the structure in spatial connections as well as temporal changes in a network. We illustrate the notion via several examples and list the challenges in effectively discovering spatio-temporal communities. For example, such communities are lost if the temporal interactions are aggregated in a single weighted network since the concurrency information is lost. We present an approach that first extracts concurrency information via node-clustering on each snapshot. Each node is then assigned a vector of community memberships over time, which is then used to group nodes into overlapping communities via recently introduced link clustering techniques. However we measure similarity (of nodes and edges) based on concurrence, i.e. when they existed, if they existed together. We call our approach the co-community algorithm. We validate our approach using several real-world data sets spanning multiple contexts.
Saurav Pandit
,
Yang Yang
, Vikas Kawadia, Sameet Sreenivasan, and
Nitesh V. Chawla
. "
Detecting Communities in Time-Evolving Proximity Networks
."
Proceedings of the IEEE Network Science Workshop (NSW)
,
pp. 173–179,
June
2011
.
Abstract
Climate change is an issue of growing economic, social, and political concern. Continued rise in the average temperatures of the Earth could lead to drastic climate change or an increased frequency of extreme events, which would negatively affect agriculture, population, and global health. One way of studying the dynamics of the Earth.s changing climate is by attempting to identify regions that exhibit similar climatic behavior in terms of long-term variability. Climate networks have emerged as a strong analytics framework for both descriptive analysis and predictive modeling of the emergent phenomena. Previously, the networks were constructed using only one measure of similarity, namely the (linear) Pearson cross correlation, and were then clustered using a community detection algorithm. However, nonlinear dependencies are known to exist in climate, which begs the question whether more complex correlation measures are able to capture any such relationships. In this paper, we present a systematic study of different univariate measures of similarity and compare how each affects both the network structure as well as the predictive power of the clusters.
Alex Pelan
,
Karsten Steinhaeuser
,
Nitesh V. Chawla
, Dilkushi A. de Alwis Pitts, and Auroop R. Ganguly
. "
Empirical Comparison of Correlation Measures and Pruning Levels in Complex Networks Representing the Global Climate System
."
Proceedings of the IEEE Symposium Series on Computational Intelligence and Data Mining (CIDM)
,
pp. 239–245,
April
2011
.
Abstract
The field of market basket analysis, the search for meaningful associations in customer purchase data, is one of the oldest areas of data mining. The typical solution involves the mining and analysis of association rules, which take the form of statements such as "people who buy diapers are likely to buy beer." It is well-known, however, that typical transaction datasets can support hundreds or thousands of obvious association rules for each interesting rule, and filtering through the rules is a non-trivial task. One may use an interestingness measure to quantify the usefulness of various rules, but there is no single agreed-upon measure and different measures can result in very different rankings of association rules. In this work, we take a different approach to mining transaction data. By modeling the data as a product network, we discover expressive communities (clusters) in the data, which can then be targeted for further analysis. We demonstrate that our network based approach can concisely isolate influence among products, mitigating the need to search through massive lists of association rules. We develop an interestingness measure for communities of products and show that it isolates useful, actionable communities. Finally, we build upon our experience with product net- works to propose a comprehensive analysis strategy by combining both traditional and network-based techniques. This framework is capable of generating insights that are difficult to achieve with traditional analysis methods.
Troy Raeder
and
Nitesh V. Chawla
. "
Market Basket Analysis with Networks
."
Social Networks Analysis and Mining (SNAM)
,
1
(2)
:97–113,
April
2011
.
Abstract
Clustering is a common step in the analysis of microarray data. Microarrays enable simultaneous highthroughput measurement of the expression level of genes. These data can be used to explore relationships between genes and can guide development of drugs and further research. A typical first step in the analysis of these data is to use an agglomerative hierarchical clustering algorithm on the correlation between all gene pairs. While this simple approach has been successful it fails to identify many genetic interactions that may be important for drug design and other important applications. We present an approach to the clustering of expression data that utilizes known gene-gene interaction data to improve results for already commonly used clustering techniques. The approach creates an ensemble similarity measure that can be used as input to common clustering techniques and provides results with increased biological significance while not altering the clustering approach at all.
Andrew K. Rider
, Geoffrey H. Siwo,
Nitesh V. Chawla
, Michael T. Ferdig, and Scott J. Emrich
. "
A Supervised Learning Approach to the Unsupervised Clustering of Genes
."
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
,
pp. 322–328,
December
2010
.
Abstract
The prevailing approach to evaluating classifiers in the machine learning community involves comparing the performance of several algorithms over a series of usually unrelated data sets. However, beyond this there are many dimensions along which methodologies vary wildly. We show that, depending on the stability and similarity of the algorithms being compared, these sometimes-arbitrary methodological choices can have a significant impact on the conclusions of any study, including the results of statistical tests. In particular, we show that performance metrics and data sets used, the type of cross-validation employed, and the number of iterations of cross-validation run have a significant, and often predictable, effect. Based on these results, we offer a series of recommendations for achieving consistent, reproducible results in classifier performance comparisons.
Troy Raeder
,
T. Ryan Hoens
, and
Nitesh V. Chawla
. "
Consequences of Variability in Classifier Performance Estimates
."
Proceedings of the 10th IEEE International Conference on Data Mining (ICDM)
,
pp. 421–430,
December
2010
.
Abstract
The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of "climate networks" have useful interpretation within the domain. Further, we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Auroop R. Ganguly
. "
Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate Science
."
Statistical Analysis and Data Mining (SADM)
,
4
(5)
:497–511,
December
2010
.
Abstract
One of the concerns patients have when confronted with a medical condition is which physician to trust. Any recommendation system that seeks to fulfill this need must ensure any sensitive medical information collected by the system is properly secured. In this paper we codify these privacy concerns in a privacy-friendly framework and present two architectures that realize it: the Secure Processing Architecture (SPA) and the Anonymous Contributions Architecture (ACA). In SPA, patients submit their ratings in a protected form without revealing any information about their data, and the computation of recommendations proceeds over the protected data using secure multi-party computation techniques. In ACA, patients submit their ratings in the clear, but no link between a submission and patient data can be made. We discuss various aspects of both architectures including techniques for ensuring reliability of computed recommendations and system performance, and provide their comparison.
T. Ryan Hoens
, Marina Blanton, and
Nitesh V. Chawla
. "
Reliable Medical Recommendation Systems with Patient Privacy
."
Proceedings of the 1st ACM International Health Informatics Symposium (IHI)
,
pp. 173–182,
November
2010
.
Abstract
Networks have been used to describe and model a wide range of complex systems, both natural as well as man-made. One particularly interesting application in the earth sciences is the use of complex networks to represent and study the global climate system. In this paper, we motivate this general approach, explain the basic methodology, report on the state of the art (including our contributions), and outline open questions and opportunities for future research.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Auroop R. Ganguly
. "
Complex Networks in Climate Science: Progress, Opportunities and Challenges
."
Proceedings of the IEEE Conference on Intelligent Data Understanding (CIDU)
,
pp. 16–26,
October
2010
.
Abstract
Background: Complexity and noise in expression quantitative trait loci (eQTL) studies make it difficult to distinguish potential regulatory relationships among the many interactions. The predominant method of identifying eQTLs finds associations that are significant at a genome-wide level. The vast number of statistical tests carried out on these data make false negatives very likely. Corrections for multiple testing error render genome-wide eQTL techniques unable to detect modest regulatory effects. We propose an alternative method to identify eQTLs that builds on traditional approaches. In contrast to genome-wide techniques, our method determines the significance of an association between an expression trait and a locus with respect to the set of all associations to the expression trait. The use of this specific information facilitates identification of expression traits that have an expression profile that is characterized by a single exceptional association to a locus. Our approach identifies expression traits that have exceptional associations regardless of the genome-wide significance of those associations. This property facilitates the identification of possible false negatives for genome-wide significance. Further, our approach has the property of prioritizing expression traits that are affected by few strong associations. Expression traits identified by this method may warrant additional study because their expression level may be affected by targeting genes near a single locus.

Results: We demonstrate our method by identifying eQTL hotspots in Plasmodium falciparum (malaria) and Saccharomyces cerevisiae (yeast). We demonstrate the prioritization of traits with few strong genetic effects through Gene Ontology (GO) analysis of Yeast. Our results are strongly consistent with results gathered using genome-wide methods and identify additional hotspots and eQTLs.

Conclusions: New eQTLs and hotspots found with this method may represent regions of the genome or biological processes that are controlled through few relatively strong genetic interactions. These points of interest warrant experimental investigation.
Andrew K. Rider
, Geoffrey H. Siwo,
Nitesh V. Chawla
, Michael T. Ferdig, and Scott J. Emrich
. "
A Statistical Approach to Finding Overlooked Genetic Associations
."
BMC Bioinformatics
,
11
:526,
October
2010
.
Abstract
Managing complex enterprise networks requires the understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, concludes with the visualization of similarity between individual nodes. The visualization tool we developed quantifies and presents these important changes and dynamic essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.
Qi Liao, Aaron D. Striegel, and
Nitesh V. Chawla
. "
Visualizing Graph Dynamics and Similarity for Enterprise Network Security and Management
."
Proceedings of the 7th ACM International Symposium on Visualization for Cyber Security (VizSec)
,
pp. 34–45,
September
2010
.
Abstract
With the proliferation of internet-based social networks into our lives, new mechanisms to control the release and use of personal data are required. As a step toward this goal, we develop a recommendation system which protects the privacy of user answers while allowing them to learn an aggregate weighted average of ratings. Due to the use of social network connections, the querier obtains a more relevant and trustworthy result than what generic anonymous recommendation systems can provide, while at the same time preserving user privacy. We also give experimental performance results for our solution and several recently developed secure computation techniques, which is of independent interest.
T. Ryan Hoens
, Marina Blanton, and
Nitesh V. Chawla
. "
A Private and Reliable Recommendation System for Social Networks
."
Proceedings of the 2nd IEEE International Conference on Social Computing (SocialCom)
,
pp. 816–825,
August
2010
.
Abstract
Learning in a non-stationary environment and in the presence of class imbalance has been receiving more recognition from the computational intelligence community, but little work has been done to create an algorithm or a framework that can handle both issues simultaneously. We have recently introduced a new member to the Learn++ family of algorithms, Learn++.NSE, which is designed to track non-stationary environments. However, this algorithm does not work well when there is class imbalance as it has not been designed to handle this problem. On the other hand, SMOTE—a popular algorithm that can handle class imbalance—is not designed to learn in nonstationary environments because it is a method of oversampling the data. In this work we describe and present preliminary results for integrating SMOTE and Learn++.NSE to create an algorithm that is robust to learning in a nonstationary environment and under class imbalance.
Gregory Ditzler,
Nitesh V. Chawla
, and Robi Polikar
. "
An Incremental Learning Algorithm for Nonstationary Environments and Class Imbalance
."
Proceedings of the 20th International Conference on Pattern Recognition (ICPR)
,
pp. 2997–3000,
August
2010
.
Abstract
This paper examines important factors for link prediction in networks and provides a general, high-performance framework for the prediction task. Link prediction in sparse networks presents a significant challenge due to the inherent disproportion of links that can form to links that do form. Previous research has typically approached this as an unsupervised problem. While this is not the first work to explore supervised learning, many factors significant in influencing and guiding classification remain unexplored. In this paper, we consider these factors by first motivating the use of a supervised framework through a careful investigation of issues such as network observational period, generality of existing methods, variance reduction, topological causes and degrees of imbalance, and sampling approaches. We also present an effective flow-based predicting algorithm, offer formal bounds on imbalance in sparse network link prediction, and employ an evaluation method appropriate for the observed imbalance. Our careful consideration of the above issues ultimately leads to a completely general framework that outperforms unsupervised link prediction methods by more than 30% AUC.
Ryan N. Lichtenwalter
,
Jake T. Lussier
, and
Nitesh V. Chawla
. "
New Perspectives and Methods in Link Prediction
."
Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
,
pp. 243–252,
July
2010
.
Abstract
One of the more challenging problems faced by the data mining community is that of imbalanced datasets. In imbalanced datasets one class (sometimes severely) outnumbers the other class, causing correct, and useful predictions to be difficult to achieve. In order to combat this, many techniques have been proposed, especially centered around sampling methods. In this paper we propose an ensemble framework that combines random subspaces with sampling to overcome the class imbalance problem. We then experimentally verify this technique on a wide variety of datasets. We conclude by analyzing the performance of the ensembles, and showing that, overall, our technique provides a significant improvement.
T. Ryan Hoens
and
Nitesh V. Chawla
. "
Generating Diverse Ensembles to Counter the Problem of Class Imbalance
."
Advances in Knowledge Discovery and Data Mining (PAKDD)
,
pp. 488–499,
June
2010
.
Abstract
Consider the scenario where information about a large network is distributed across several different parties or commercial entities. Intuitively, we would expect that the aggregate network formed by combining the individual private networks would be a more faithful representation of the network phenomenon as a whole. However, privacy preservation of the individual networks becomes a mandate. Thus, it would be useful, given several portions of an underlying network p1 ... pn, to securely compute the aggregate of all the networks pi in a manner such that no party learns information about any other party's network. In this work, we propose a novel privacy preservation protocol for the non-trivial case of weighted networks. The protocol is secure against malicious adversaries.
Troy Raeder
, Marina Blanton,
Nitesh V. Chawla
, and Keith Frikken
. "
Privacy-Preserving Network Aggregation
."
Advances in Knowledge Discovery and Data Mining (PAKDD)
,
pp. 198–207,
June
2010
.
Abstract
OBJECTIVE: The goal was to examine nursing team structure and its relationship with family satisfaction.

METHODS: We used electronic health records to create patient-based, 1-mode networks of nursing handoffs. In these networks, nurses were represented as nodes and handoffs as edges. For each patient, we calculated network statistics including team size and diameter, network centrality index, proportion of newcomers to care teams according to day of hospitalization, and a novel measure of the average number of shifts between repeat caregivers, which was meant to quantify nursing continuity. We assessed parental satisfaction by using a standardized survey.

RESULTS: Team size increased with increasing length of stay. At 2 weeks of age, 50% of shifts were staffed by a newcomer nurse who had not previously cared for the index patient. The patterns of newcomers to teams did not differ according to birth weight. When the population was dichotomized according to median mean repeat caregiver interval value, increased reports of problems with nursing care were seen with less-consistent staffing by familiar nurses. This relationship persisted after controlling for factors including birth weight, length of stay, and team size.

CONCLUSIONS: Family perceptions of nursing care quality are more strongly associated with team structure and the sequence of nursing participation than with team size. Objective measures of health care team structure and function can be examined by applying network analytic techniques to information contained in electronic health records.
James E. Gray,
Darcy A. Davis
, DeWayne M. Pursley, Jane E. Smallcomb, Alon Geva, and
Nitesh V. Chawla
. "
Network Analysis of Team Structure in the Neonatal Intensive Care Unit
."
Journal of the American Academy of Pediatrics
,
125
(6)
:e1460–e1467,
June
2010
.
Abstract
The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment and Recommendation Engine, which relies only on patient's medical history using ICD-9-CM codes in order to predict future disease risks. CARE uses collaborative filtering methods to predict each patient's greatest disease risks based on their own medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts for improved performance. Also, we apply time-sensitive modifications which make the CARE framework practical for realistic long-term use. These novel systems require no specialized information and provide predictions for medical conditions of all kinds in a single run. We present experimental results on a large Medicare dataset, demonstrating that CARE and ICARE perform well at capturing future disease risks.
Darcy A. Davis
,
Nitesh V. Chawla
, Nicholas A. Christakis, and Albert László Barabási
. "
Time to CARE: A Collaborative Filtering Engine for Practical Disease Prediction
."
Data Mining and Knowledge Discovery (DMKD)
,
20
(3)
:388–415,
May
2010
.
Abstract
We compare and evaluate different metrics for community structure in networks. In this context we also discuss a simple approach to community detection, and show that it performs as well as other methods, but at lower computational complexity.
Karsten Steinhaeuser
and
Nitesh V. Chawla
. "
Identifying and Evaluating Community Structure in Complex Networks
."
Pattern Recognition Letters
,
31
(5)
:413–421,
April
2010
.
Abstract
OBJECTIVES: Systemic sclerosis (SSc) is a multiorgan disease with high mortality rates. Several clinical features have been associated with poor survival in different populations of SSc patients, but no clear and reproducible prognostic model to assess individual survival prediction in scleroderma patients has ever been developed.

METHODS: We used Cox regression and three data mining-based classifiers (Naive Bayes Classifier [NBC], Random Forests [RND-F] and logistic regression [Log-Reg]) to develop a robust and reproducible 5-year prognostic model. All the models were built and internally validated by means of 5-fold cross-validation on a population of 558 Italian SSc patients. Their predictive ability and capability of generalisation was then tested on an independent population of 356 patients recruited from 5 external centres and finally compared to the predictions made by two SSc domain experts on the same population.

RESULTS: The NBC outperformed the Cox-based classifier and the other data mining algorithms after internal cross-validation (area under receiving operator characteristic curve, AUROC: NBC=0.759; RND-F=0.736; Log-Reg=0.754 and Cox=0.724). The NBC had also a remarkable and better trade-off between sensitivity and specificity (e.g. Balanced accuracy, BA) than the Cox-based classifier, when tested on an independent population of SSc patients (BA: NBC=0.769, Cox=0.622). The NBC was also superior to domain experts in predicting 5-year survival in this population (AUROC=0.829 vs. AUROC=0.788 and BA=0.769 vs. BA=0.67).

CONCLUSIONS: We provide a model to make consistent 5-year prognostic predictions in SSc patients. Its internal validity, as well as capability of generalisation and reduced uncertainty compared to human experts support its use at bedside. Available at: http://www.nd.edu/~nchawla/survival.xls.
Lorenzo Beretta, Alessandro Santaniello, Francesca Cappiello,
Nitesh V. Chawla
, Madelon C. Vonk, Patricia E. Carreira, Yannick Allanore, Delia A. Popa-Diaconu, Marta Cossu, Francesca Bertolotti, Gianfranco Ferraccioli, Antonino Mazzone, and Rafaella Scorza
. "
Development of a Five-Year Mortality Model in Systemic Sclerosis Patients by Different Analytical Approaches
."
Clinical & Experimental Rheumatology (CER)
,
28
(2)
:S18–27,
March
2010
.
Abstract
Knowledge-sharing online social networks are becoming increasingly pervasive and popular. While the user-to-user interactions in these networks have received substantial attention, the consumption of user generated content has not been studied extensively. In this work, we use data gathered from digg.com to present novel findings and draw important sociological conclusions regarding the intimate relationship between consumption and social networking. We first demonstrate that individuals' consumption habits influence their friend networks, consistent with the concept of homophily. We then show that one's social network can also influence the consumption of a submission through the activation of an extended friend network. Finally, we investigate the level of reciprocity, or balance, in the network and uncover relationships that are significantly less balanced than expected.
Jake T. Lussier
,
Troy Raeder
, and
Nitesh V. Chawla
. "
User Generated Content Consumption and Social Networking in Knowledge-Sharing OSNs
."
Advances in Social Computing
,
pp. 228–237,
March
2010
.
Abstract
We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to class distribution and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher's exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.
Wei Liu, Sanjay Chawla,
David A. Cieslak
, and
Nitesh V. Chawla
. "
A Robust Decision Tree Algorithm for Imbalanced Data Sets
."
Proceedings of the SIAM Conference on Data Mining (SDM)
,
766–777,
January
2010
.
Abstract
There exist several music composition systems that generate blues chord progressions, jazz improvisation, or classical pieces. Such systems often work by applying a set of rules explicitly provided to the system to determine what sequence of output values is appropriate. Others use pattern recognition and generation techniques such as Markov models. These systems often suffer from mediocre performance and limited generality. We propose a system that goes from raw musical data to feature vector representation to classiffication models. We employ sliding window sequential machine learning techniques to generate classiffiers that correspond to a training set of musical data. Our approach has the adntage of greater generality than explicitly specifying rules to a system and the potential to apply a wide variety of powerful existing non-sequential learning algorithms. We present the design and implementation of the composition system. We demonstrate the efficacy of the method, show and analyze successful samples of its output, and discuss ways in which it might be improved.
Ryan N. Lichtenwalter
, Katerina Lichtenwalter, and
Nitesh V. Chawla
. "
Applying Learning Algorithms to Music Generation
.
"
Proceedings of the 4th Indian International Conference on Artificial Intelligence (IICAI)
,
pp. 483–502,
December
2009
.
Abstract
This work presents a novel method for the identification of specificity residues in two component systems based on the discovery of graphlet signatures. We use network representations of 3-D structures and sequence of proteins, experimental data and graph-based learning to detect graphlet signatures that potentially are responsible for phosphotranfer specificity between Histidine Kinase (HK) and Response Regulator (RR) domains. This approach is applied to the system of HK and RR in E. coli. Structural regions were found for Histidine Kinases RstB and Response Regulator RstA and confirmed using experimental data. In addition, some hypothetical regions of specificity were proposed to explain cross talk between the HKs that phosphorylate YhfA and the RRs that interact with UhpB. Such an approach offers the ability to identify domain specificity residues in two component systems in silico.
Faruck Morcos, Charles Lamanna,
Nitesh V. Chawla
, and Jesús Izaguirre
. "
Determination of Specificity Residues in Two Component Systems Using Graphlets
."
Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP)
,
pp. 860–866,
July
2009
.
Abstract
This paper presents Model Monitor (M2), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.
Troy Raeder
and
Nitesh V. Chawla
. "
Model Monitor (M2): Evaluating, Comparing, and Monitoring Models
."
Journal of Machine Learning Research (JMLR)
,
10
(Jul)
:1387–1390,
July
2009
.
Abstract
A market basket is a set of products that form a single retail transaction. This purchase data of products can shed important light on how product(s) might influence sales of other product(s). Departing from the standard approach of frequent itemset mining, we posit that purchase data can be modeled as a social network. One can then discover communities of products that are bought together, which can lead to expressive exploration and discovery of a larger influence zone of product(s). We develop a novel utility measure for communities of products and show, both financially and intuitively, that community detection provides a useful complement to association rules for market basket analysis. All our conclusions are validated on real store data.
Troy Raeder
and
Nitesh V. Chawla
. "
Modeling a Store's Product Space as a Social Network
."
Proceedings of the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM)
,
pp. 164–169,
July
2009
.
Abstract
To discover patterns in historical data, climate scientists have applied various clustering methods with the goal of identifying regions that share some common climatological behavior. However, past approaches are limited by the fact that they either consider only a single time period (snapshot) of multivariate data, or they consider only a single variable by using the time series data as multi-dimensional feature vector. In both cases, potentially useful information may be lost. Moreover, clusters in high-dimensional data space can be divergecult to interpret, prompting the need for a more ective data representation. We address both of these issues by employing a complex network (graph) to represent climate data, a more intuitive model that can be used for analysis while also having a direct mapping to the physical world for interpretation. A cross correlation function is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identities multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Auroop R. Ganguly
. "
An Exploration of Climate Data Using Complex Networks
."
Proceedings of the 3rd ACM SIGKDD International Workshop on Knowledge Discovery from Sensor Data (SensorKDD)
,
pp. 23–31,
July
2009
.
Abstract
Distributed PRocessing in Mobile Environments (DPRiME) is a framework for processing large data sets across an ad-hoc network. Developed to address the shortcomings of Google's MapReduce outside of a fully-connected network, DPRiME separates nodes on the network into a master and workers; the master distributes sections of the data to available one-hop workers to process in parallel. Upon returning results to its master, a worker is assigned an unfinished task. Five data mining classifiers were implemented to process the data: decision trees, k-means, k-nearest neighbor, Naïve Bayes, and artificial neural networks. Ensembles were used so the classification tasks could be performed in parallel. This framework is well-suited for many tasks because it handles communications, node movement, node failure, packet loss, data partitioning, and result collection automatically. Therefore, DPRiME allows users with little knowledge of networking or distributed systems to harness the processing power of an entire network of single- and multi-hop nodes.
Sean McRoskey
,
James Notwell
,
Nitesh V. Chawla
, and Christian Poellabauer
. "
Mining in a Mobile Environment
."
Proceedings of the 3rd ACM SIGKDD International Workshop on Knowledge Discovery from Sensor Data (SensorKDD)
,
pp 56–60,
July
2009
.
Abstract
Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. In this paper, we consider the issue of high class imbalance in conjunction with data streams. We propose a method called Boundary Definition, which relies on building the classifiers by stressing on the boundary cases as the streams arrive. We employ a sequential validation framework, which we believe is the most meaningful option in the context of streaming imbalanced data.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
Learning to Classify Data Streams with Imbalanced Class Distributions
."
Proceedings of the Pacific Asia Knowledge Discovery and Data Mining Conference (PAKDD)
,
April
2009
.
Abstract
Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. In this paper, we comprehensively consider the issues of changing distributions in conjunction with high degrees of class imbalance in streaming data. We propose new approaches based on distributional divergence and meta-classiffication that improve several performance metrics often applied in the study of imbalanced classiffication. We also propose a new distance measure for detecting distributional drift and examine its utility in weighting ensemble base classiffiers. We employ a sequential validation framework, which we believe is the most meaningful option in the context of streaming imbalanced data.
Ryan N. Lichtenwalter
and
Nitesh V. Chawla
. "
Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams
."
Proceedings of the PAKDD Workshop on Data Mining When Classes are Imbalanced and Errors Have Costs (PAKDD-ICEC)
,
pp. 53–75,
April
2009
.
Abstract
Background: The IOM report, Preventing Medication Errors, emphasizes the overall lack of knowledge of the incidence of adverse drug events (ADE). Operating rooms, emergency departments and intensive care units are known to have a higher incidence of ADE. Labor and delivery (L&D) is an emergency care unit that could have an increased risk of ADE, where reported rates remain low and under-reporting is suspected. Risk factor identification with electronic pattern recognition techniques could improve ADE detection rates.

Objective: The objective of the present study is to apply Synthetic Minority Over Sampling Technique (SMOTE) as an enhanced sampling method in a sparse dataset to generate prediction models to identify ADE in women admitted for labor and delivery based on patient risk factors and comorbidities.

Results: By creating synthetic cases with the SMOTE algorithm and using a 10-fold cross-validation technique, we demonstrated improved performance of the Naïve Bayes and the decision tree algorithms. The true positive rate (TPR) of 0.32 in the raw dataset increased to 0.67 in the 800% over-sampled dataset.

Conclusion: Enhanced performance from classification algorithms can be attained with the use of synthetic minority class oversampling techniques in sparse clinical datasets. Predictive models created in this manner can be used to develop evidence based ADE monitoring systems.
Laritza M. Taft, R. Scott Evans, Chi-Ren Shyu, Marlene J. Egger,
Nitesh V. Chawla
, Joyce A. Mitchell, Sidney N. Thornton, Bruce Bray, and Michael W. Varner
. "
Countering Imbalanced Datasets to Improve Adverse Drug Event Predictive Models in Labor and Delivery
."
Journal of Biomedical Informatics (JBI)
,
42
(2)
:356–364,
April
2009
.
Abstract
Pursuit of preventive healthcare relies on fundamental knowledge of the complex relationships between diseases and individuals. We take a step towards understanding these connections by employing a network-based approach to explore a large medical database. Here we report on two distinct tasks. First, we characterize networks of diseases in terms of their physical properties and emergent behavior over time. Our analysis reveals important insights with implications for modeling and prediction. Second, we immediately apply this knowledge to construct patient networks and build a predictive model to assess disease risk for individuals based on medical history. We evaluate the ability of our model to identify conditions a person is likely to develop in the future and study the benefits of demographic data partitioning.We discuss strengths and limitations of our method as well as the data itself to provide direction for future work.
Karsten Steinhaeuser
and
Nitesh V. Chawla
. "
A Network-Based Approach to Understanding and Predicting Diseases
."
Social Computing and Behavioral Modeling
,
pp. 1–8,
February
2009
.
Abstract
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different ldquorebalancerdquo heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
Yuchuh Tang, Yan-Qing Zhang,
Nitesh V. Chawla
, and Sven Kresser
. "
SVMs Modeling for Highly Imbalanced Classification
."
IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics
,
39
(1)
:281–288,
February
2009
.
Abstract
Classifier error is the product of model bias and data variance. While understanding the bias involved when selecting a given learning algorithm, it is similarly important to understand the variability in data over time, since even the One True Model might perform poorly when training and evaluation samples diverge. Thus, the ability to identify distributional divergence is critical towards pinpointing when fracture points in classifier performance will occur, particularly since contemporary methods such as ten-folds and hold-out are poor predictors in divergent circumstances. This article implements a comprehensive evaluation framework to proactively detect breakpoints in classifiers. predictions and shifts in data distributions through a series of statistical tests. We outline and utilize three scenarios under which data changes: sample selection bias, covariate shift, and shifting class priors. We evaluate the framework with a variety of classifiers and datasets.
David A. Cieslak
and
Nitesh V. Chawla
. "
A Framework for Monitoring Classifiers' Performance: When and Why Failure Occurs?
."
Knowledge and Information Systems (KAIS)
,
18
(1)
:83–108,
January
2009
.
Abstract
Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.
David A. Cieslak
and
Nitesh V. Chawla
. "
Start Globally, Optimize Locally, Predict Globally: Improving Performance on Unbalanced Data
."
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM)
,
pp. 143–152,
December
2008
.
Abstract
As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and computation to build ensembles of classifiers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system configuration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
Christopher Moretti,
Karsten Steinhaeuser
, Douglas L. Thain, and
Nitesh V. Chawla
. "
Scaling Up Classifiers to Cloud Computers
."
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM)
,
pp. 472–481,
December
2008
.
Abstract
The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment and Recommendation Engine, which relies only on a patient's medical history using ICD-9-CM codes in order to predict future diseases risks. CARE combines collaborative filtering methods with clustering to predict each patient's greatest disease risks based on their own medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts for improved performance. These novel systems require no specialized information and provide predictions for medical conditions of all kinds in a single run. We present experimental results on a large Medicare dataset, demonstrating that CARE and ICARE perform well at capturing future disease risks.
Darcy A. Davis
,
Nitesh V. Chawla
, Nicholas Blumm, Nicholas A. Christakis, and Albert-László Barabási
. "
Predicting Individual Disease Risk Based on Medical History
."
Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM)
,
pp. 769–778,
October
2008
.
Abstract
Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a year's worth of data on a 3,000 CPU production grid and use it to gain a high level understanding of the system behavior.
David A. Cieslak
,
Nitesh V. Chawla
, and Douglas L. Thain
. "
Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques
."
Proceedings of the 9th IEEE/ACM International Conference on Grid Computing (GRID)
,
pp. 217–224,
October
2008
.
Abstract
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost-sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods - MetaCost and the Cost-Sensitive Classifiers—and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
Nitesh V. Chawla
,
David A. Cieslak
, Lawrence O. Hall, and Ajay Joshi
. "
Automatically Countering Imbalance and Its Empirical Relationship to Cost
."
Data Mining and Knowledge Discovery (DMKD)
,
17
(2)
:225–252,
October
2008
.
Abstract
Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria—information gain, Gini measure, and DKM—and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, propose its application in forming decision trees, and perform a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.
David A. Cieslak
and
Nitesh V. Chawla
. "
Learning Decision Trees for Unbalanced Data
."
Proceedings of the 19th European Conference on Machine Learning and the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 241–256,
September
2008
.
Abstract
Distributed denial of service (DDoS) attacks originating from botnets can quickly bring normally effective web services to a screeching halt. This paper presents SESRAA (selective short-term randomized acceptance algorithms), an adaptive scheme for maintaining web service despite the presence of multifaceted attacks in a noisy environment. In contrast to existing solutions that rely upon "clean" training data, we presume that a live web service environment makes finding such training data difficult if not impossible. SESRAA functions much like a battlefield surgeon's triage: focusing on quickly and efficiently salvaging good connections with the realization that the chaotic nature of the live environment implicitly limits the accuracy of such detections. SESRAA employs an adaptive k-means clustering approach using short-term extraction and limited centroid evolution to defend the legitimate connections in a mixed attack environment. We present the SESRAA approach and evaluate its performance through experimental studies in a diverse attack environment. The results show significant improvements against a wide variety of DDoS configurations and input traffic patterns.
Qi Liao,
David A. Cieslak
, Aaron D. Striegel, and
Nitesh V. Chawla
. "
Using Selective, Short-Term Memory to Improve Resilience Against DDoS Exhaustion Attacks
."
Security and Communication Networks
,
1
(4)
:287–299,
July
2008
.
Abstract
We examine modularity as both an evaluation measure of community structure as well as an optimization criterion used by some algorithms to identify communities in networks Specifically, we assess the pros and cons of modularity, and identify its shortcomings via comparison to alternate evaluation metrics on networks for which the true communities are known. We also develop a simple method for community detection using random walks and show that it can identify the actual communities at least as well as more complex algorithms. We further expand upon the idea of community detection with random walks by extending the method to weighted networks and explain how to incorporate node attributes—information that is frequently available but usually ignored by other algorithms—to compute edge weights that can aid in detecting more meaningful communities.
Karsten Steinhaeuser
and
Nitesh V. Chawla
. "
Is Modularity the Answer to Evaluating Community Structure in Networks?
."
Proceedings of the International Workshop and Conference on Network Science (NetSci)
,
June
2008
.
Abstract
Many machine learning applications like finance, medicine, and risk management suffer from class imbalance: cases of interest occur rarely. Further complicating these applications is that the training and testing samples might differ significantly in their respective class distributions. Sampling has been shown to be a strong solution to imbalance and additionally offers a rich parameter space from which to select classifiers. This paper is concerned with the interaction between Probability Estimation Trees (PETs), sampling, and performance metrics as testing distributions fluctuate substantially. A set of comprehensive analyses is presented, which anticipate classifier performance through a set of widely varying testing distributions.
David A. Cieslak
and
Nitesh V. Chawla
. "
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
."
Advances in Knowledge Discovery and Data Mining (PAKDD)
,
pp. 519–526,
May
2008
.
Abstract
A significant increase in the ability to collect and store diverse information over the past decade has led to an outright data explosion, providing larger and richer datasets than ever before. This proliferation in dataset size is accompanied by the dilemma of successfully analyzing this data to discover patterns of interest. Extreme dataset sizes place unprecedented demands on high-performance computing infrastructures, and a gap has developed between the available real-world datasets and our ability to process them; data volumes are quickly approaching Tera and Petabytes. This rate of increase also defies the subsampling paradigm, as even a subsample of data runs well into Gigabytes. To counter this challenge, we exploit advances in multi-threaded processor technology. We explore massive thread-level parallelism—provided by the Cray MTA-2—as a platform for scalable data mining. We conjecture that such an architecture is well suited for the application of machine learning to large datasets. To this end, we present a thorough complexity analysis and experimental evaluation of a popular decision tree algorithm implemented using fine-grain parallelism, including a comparison to two more conventional architectures. We use diverse datasets with sizes varying in both dimensions (number of records and attributes). Our results lead us to the conclusion that a massively parallel architecture is an appropriate platform for the implementation of highly scalable learning algorithms.
Karsten Steinhaeuser
and
Nitesh V. Chawla
. "
Scalable Learning with Thread-Level Parallelism
."
Proceedings of the Midwest Artificial Intelligence and Cognitive Science Conference (MAICS)
,
April
2008
.
Abstract
Identifying meaningful community structure in social networks is a hard problem, and extreme network size or sparseness of the network compound the difficulty of the task. With a proliferation of real-world network datasets there has been an increasing demand for algorithms that work effectively and efficiently. Existing methods are limited by their computational requirements and rely heavily on the network topology, which fails in scale-free networks. Yet, in addition to the network connectivity, many datasets also include attributes of individual nodes, but current methods are unable to incorporate this data. Cognizant of these requirements we propose a simple approach that stirs away from complex algorithms, focusing instead on the edge weights; more specifically, we leverage the node attributes to compute better weights. Our experimental results on a real-world social network show that a simple thresholding method with edge weights based on node attributes is sufficient to identify a very strong community structure.
Karsten Steinhaeuser
and
Nitesh V. Chawla
. "
Community Detection in a Large Real-World Social Network
."
Social Computing, Behavioral Modeling, and Prediction
,
pp. 168–175,
March
2008
.
Abstract
Mobile communication networks produce massive amounts of data which may be useful in identifying the location of an emergency situation and the area it affects. We propose a one pass clustering algorithms for quickly identifying anomalous data points. We evaluate this algorithms ability to detect outliers in a data set and describe how such an algorithm may be used as a component of an emergency response management system.
Alec Pawling,
Nitesh V. Chawla
, and Gregory Madey
. "
Anomaly Detection in a Mobile Communication Network
."
Computational and Mathematical Organizational Theory
,
13
(4)
:407–422,
December
2007
.
Abstract
A fundamental tenet assumed by many classification algorithms is the presumption that both training and testing samples are drawn from the same distribution of data—this is the stationary distribution assumption. This entails that the past is strongly indicative of the future. However, in real world applications, many factors may alter the One True Model responsible for generating the data distribution both significantly and subtly. In circumstances violating the stationary distribution assumption, traditional validation schemes such as ten-folds and hold-out become poor performance predictors and classifier rankers. Thus, it becomes critical to discover the fracture points in classifier performance by discovering the divergence between populations. In this paper, we implement a comprehensive evaluation framework to identify bias, enabling selection of a "correct" classifier given the sample bias. To thoroughly evaluate the performance of classifiers within biased distributions, we consider the following three scenarios: missing completely at random (akin to stationary); missing at random; and missing not at random. The latter reflects the canonical sample selection bias problem.
David A. Cieslak
and
Nitesh V. Chawla
. "
Detecting Fractures in Classifier Performance
."
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM)
,
pp. 123–132,
October
2007
.
Abstract
This paper employs machine learning techniques to develop models that predict gender based on the iris texture features. While there is a large body of research that explores biometrics as a means of verifying identity, there has been very little work done to determine if biometric measures can be used to determine specific human attributes. If it is possible to discover such attributes, they would be useful in situations where a biometric system fails to identify an individual that has not been enrolled, yet still needs to be identified. The iris was selected as the biometric to analyze for two major reasons: (1) quality methods have already been developed to segment and encode an iris image, (2) current iris encoding methods are conducive to selecting and extracting attributes from an iris texture and creating a meaningful feature vector.
Vince Thomas
,
Nitesh V. Chawla
, Kevin W. Bowyer, and Patrick J. Flynn
. "
Learning to Predict Gender from Iris Images
."
Proceedings of the 1st IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS)
,
pp. 1–5,
September
2007
.
Abstract
We propose a learning framework that actively explores creation of face space(s) by selecting images that are complementary to the images already represented in the face space. We also construct ensembles of classifiers learned from such actively sampled image sets, which further provides improvement in the recognition rates. We not only significantly reduce the number of images required in the training set but also improve the accuracy over learning from all the images. We also show that the single face space or ensemble of face spaces, thus constructed, has a higher generalization performance across different illumination and expression conditions.
Nitesh V. Chawla
and Kevin W. Bowyer
. "
Actively Exploring Creation of Face Space(s) for Improved Face Recognition
."
Proceedings of the National Conference on Artificial Intelligence (AAAI)
,
22
(1)
:809–814,
July
2007
.
Abstract
The authentication logs on a network can provide a trove of information for discovering potential anomalies in login attempts. Using such logs collected by a production Virtual Private Network device over a period of 15 months, we generate a diurnal model of network accesses. These models are used to detect anomalous authentications, which merit further investigation by a security analyst. We intend that this work will dramatically reduce the amount time spent by analysts identifying anomalous events and allow them to focus on in-depth analysis of these anomalies. Our work makes two contributions: a novel approach of mining authentication data, and the use of geographic distance as a metric to evaluate Virtual Private Network connections. We demonstrate the success of our model using real-world case analysis.
Michael J. Chapple,
Nitesh V. Chawla
, and Aaron D. Striegel
. "
Authentication Anomaly Detection: A Case Study on a Virtual Private Network
."
Proceedings of the 3rd Annual ACM Workshop on Mining Network Data (MineNet)
,
pp. 17–22,
June
2007
.
Abstract
We describe a prototype emergency and disaster information system designed and implemented using DDDAS concepts. The system is designed to use real-time cell phone calling data from a geographical region, including calling activity—who calls whom, call duration, services in use, and cell phone location information—to provide enhanced situational awareness for managers in emergency operations centers (EOCs) during disaster events. Powered-on cell phones maintain contact with one or more within-range cell towers so as to receive incoming calls. Thus, location data about all phones in an area are available, either directly from GPS equipped phones, or by cell tower, cell sector, distance from tower and triangulation methods. This permits the cell phones of a geographical region to serve as an ad hoc mobile sensor net, measuring the movement and calling patterns of the population. A prototype system, WIPER, serves as a test bed to research open DDDAS design issues, including dynamic validation of simulations, algorithms to interpret high volume data streams, ensembles of simulations, runtime execution, middleware services, and experimentation frameworks.
Gregory R. Madey, Albert-László Barabási,
Nitesh V. Chawla
, Marta Gonzales, David Hachen, Brett Lantz, Alec Pawling, Timothy W. Schoenharl, Gábor Szabó, Pu Wang, and Ping Yan
. "
Enhanced Situational Awareness: Application of DDDAS Concepts to Emergency and Disaster Management
."
Proceedings of the 7th International Conference on Computational Science (ICCS)
,
pp. 1090–1097,
May
2007
.
Abstract
Ensembles are often capable of greater predictive performance than any of their individual classifiers. Despite the need for classifiers to make different kinds of errors, the majority voting scheme, typically used, treats each classifier as though it contributed equally to the group's performance. This can be particularly limiting on unbalanced datasets, as one is more interested in complementing classifiers that can assist in improving the true positive rate without signicantly increasing the false positive rate. Therefore, we implement a genetic algorithm based framework to weight the contribution of each classifier by an appropriate fitness function, such that the classifiers that complement each other on the unbalanced dataset are preferred, resulting in significantly improved performances. The proposed framework can be built on top of any collection of classifiers with different fitness functions.
Nitesh V. Chawla
and
Jared Sylvester
. "
Exploiting Diversity in Ensembles: Improving Performance on Unbalanced Datasets
."
Multiple Classifier Systems (MCS)
,
pp. 397–406,
May
2007
.
Abstract
We present a "black-box" approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query's attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
Tanu Malik, Randal Burns, and
Nitesh V. Chawla
. "
A Black-Box Approach to Query Cardinality Estimation
."
ACM Conference on Innovative Data Systems Research (CIDR)
,
pp. 56–67,
January
2007
.
Abstract
In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective.

We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance.
Tanu Malik, Randal Burns,
Nitesh V. Chawla
, and Alexander S. Szalay
. "
Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations
."
Proceedings of the ACM/IEEE Conference on Supercomputing (SC)
,
pp. 102–115,
November
2006
.
Abstract
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.
Yang Liu,
Nitesh V. Chawla
, Mary P. Harper, Elizabeth Shriberg, and Andreas Stolcke
. "
A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech
."
Journal of Computer Speech and Language
,
20
(4)
:468–494,
October
2006
.
Abstract
Classification is an important data mining task, and decision trees have emerged as a popular classifier due to their simplicity and relatively low computational complexity. However, as datasets get extremely large, the time required to build a decision tree still becomes intractable. Hence, there is an increasing need for more efficient tree-building algorithms. One approach to this problem involves using a parallel mode of computation. Prior work has successfully used processor-level parallelism to partition the data and computation. We propose to use Cray's Multi-Threaded Architecture (MTA) and extend the idea by employing thread-level parallelism to reduce the execution time of the tree building process. Decision tree building is well-suited for such low-level parallelism as it requires a large number of independent computations. In this paper, we present the analysis and parallel implementation of the ID3 algorithm, along with experimental results.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Peter M. Kogge
. "
Exploiting Thread-Level Parallelism to Build Decision Trees
."
Proceedings of the ECML/PKDD International Workshop on Parallel and Distributed Data Mining
,
pp. 13–24,
September
2006
.
Abstract
Energy management mechanisms in resource constrained environments such as mobile systems are typically based on the current status of various system parameters such as utilization and quality of service requirements. Any knowledge gained about future system resource usage and other significant dynamics in the system can further lead to increased optimizations in such mechanisms. Future knowledge about system resource utilization and dynamics can be derived from two separate sources: a) the information maintained in the system about its current utilization and resource accesses and, b) the application specific behavioral patterns of currently active applications. This work targets at employing known efficient data mining techniques to exploit the latter class of information in achieving a suitable application behavior model that can provide accurate predictions of future application specific resource accesses and usage. Such formulated application behavioral patterns containing information about their resource usage, termed Resource Access Patterns, are then employed in a novel dynamic energy management scheme for a mobile system network resource. This work also shows that a data mining approach is beneficial in energy management decisions as it can capture all the underlying intricate interactions and dependencies that exist among the active applications and between the resources available in the system. The key components of this work can thus be summarized as: a) employing a data mining approach to characterize application behavior and, b) designing a novel energy management policy for a network interface device based on information about its future usage derived from the formulated application behavioral model.
Dinesh Rajan, Christian Poellabauer, and
Nitesh V. Chawla
. "
Resource Access Pattern Mining for Dynamic Energy Management
."
Proceedings of the ECML/PKDD Workshop on Automatic Computing: A New Challenge for Machine Learning
,
September
2006
.
Abstract
Traditional discretization techniques for machine learning, from examples with continuous feature spaces, are not efficient when the data is in the form of a stream from an unknown, possibly changing, distribution. We present a time-and-memory-efficient discretization technique based on computing ε-approximate exponential frequency quantiles, and prove bounds on the worst-case error introduced in computing information entropy in data streams compared to an offline algorithm that has no efficiency constraints.We compare the empirical performance of the technique, using it for feature selection, with (streaming adaptations of) two popular methods of discretization, equal width binning and equal frequency binning, under a variety of streaming scenarios for real and artificial datasets. Our experiments show that ε-approximate exponential frequency quantiles are remarkably consistent in their performance, in contrast to the simple and efficient equal width binning that perform quite well when the streams are from stationary distributions, and quite poorly otherwise.
Alec Pawling,
Nitesh V. Chawla
, and Amitabh Chaudhary
. "
Evaluation of Summarization Schemes for Learning in Streams
."
Proceedings of the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 347–358,
September
2006
.
Abstract
Data mining models often carry the final objective of maximizing profit or minimizing cost. This problem becomes even more profound in financial applications that can have multiple constraints, such as interest rate, score cut-off, and the loan amount to allocate. In this paper, we present a pricing framework for discovering the total profit from a probabilistic model, given a benefit function.
Nitesh V. Chawla
and Xiangning Li
. "
Pricing Based Framework for Benefit Scoring
."
Proceedings of the ACM SIGKDD 2nd International Workshop on Utility-Based Data Mining (UBDM)
,
pp. 65–75,
August
2006
.
Abstract
We report on our approach, CBAmethod3E, which was submitted to the NIPS 2003 Feature Selection Challenge on Dec. 8, 2003. Our approach consists of combining filtering techniques for variable selection, information gain and feature correlation, with Support Vector Machines for induction. We ranked 13th overall and ranked 6th as a group. It is worth pointing out that our feature selection method was very successful in selecting the second smallest set of features among the top-20 submissions, and in identifying almost all probes in the datasets, resulting in the challenge's best performance on the latter benchmark.
Danny Roobaert, Grigoris Karakoulas, and
Nitesh V. Chawla
. "
Information Gain, Correlation and Support Vector Machines
."
Feature Extraction: Foundations and Applications
,
pp. 463–470,
August
2006
.
Abstract
Ensembles are often capable of greater predictive accuracy than any of their individual members. One key attribute of ensembles' success is the notion of diversity. However, the majority voting scheme used in most ensembles treats each classifier as if it contributed equally to the group performance, without capitalizing on the relative improvement offered by each member of the ensemble. Our solution to this problem is to use genetic algorithms to weight the contribution of each classifier. This improves the performance of the ensemble by providing a weighted vote which maximizes collaboration among classifiers. Our approach provides a general-purpose framework for evolutionary ensembles, allowing them to build on top of any collection of classifiers, whether they be heterogeneous or homogeneous. In addition, the ability of our framework to thin ensembles, and its effect on ensemble diversity is presented.
Jared Sylvester
and
Nitesh V. Chawla
. "
Evolutionary Ensemble Creation and Thinning
."
Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)
,
pp. 5148–5155,
July
2006
.
Abstract
Decision trees, a popular choice for classification, have their limitation in providing good quality probability estimates. Typically, smoothing methods such as Laplace or m-estimate are applied at the decision tree leaves to overcome the systematic bias introduced by the frequency-based estimates. An ensemble of decision trees has also been shown to help in reducing the bias and variance in the leaf estimates, resulting in better calibrated probabilistic predictions. In this work, we evaluate the calibration or quality of these estimates using various loss measures. We also examine the relationship between the quality of such estimates and resulting rank-ordering of test instances. Our results quantify the impact of smoothing in terms of the loss measures, and the coupled relationship with the AUC measure.
Nitesh V. Chawla
and
David A. Cieslak
. "
Evaluating Probability Estimates from Decision Trees
."
Proceedings of the AAAI Workshop on Evaluation Methods for Machine Learning
,
July
2006
.
Abstract
Decision trees, a popular choice for classification, have their limitation in providing probability estimates, requiring smoothing at the leaves. Typically, smoothing methods such as Laplace or m-estimate are applied at the decision tree leaves to overcome the systematic bias introduced by the frequency-based estimates. In this work, we show that an ensemble of decision trees significantly improves the quality of the probability estimates produced at the decision tree leaves. The ensemble overcomes the myopia of the leaf frequency based estimates. We show the effectiveness of the probabilistic decision trees as a part of the Predictive Uncertainty Challenge. We also include three additional highly imbalanced datasets in our study. We show that the ensemble methods significantly improve not only the quality of the probability estimates but also the AUC for the imbalanced datasets.
Nitesh V. Chawla
. "
Many Are Better Than One: Improving Probabilistic Estimates from Decision Trees
."
Machine Learning Challenges
,
pp. 41–55,
July
2006
.
Abstract
The management of wireless sensor networks in the presence of multiple constraints is an open problem in systems research. Existing methods perform well when optimized for a single parameter (such as energy, delay, network bandwidth). However, we might want to establish trade-offs on the fly, and optimize the information flow/exchange. This position paper shall serve as a preliminary proof-of-concept that techniques and algorithms from the machine learning and data mining domains can be applied to network data to learn relevant information about the routing behavior of individual nodes and the overall state of the network. We describe two simple examples which demonstrate the application of existing algorithms and analyze the results to illustrate their usefulness.
Karsten Steinhaeuser
,
Nitesh V. Chawla
, and Christian Poellabauer
. "
Towards Learning-Based Sensor Management
."
Proceedings of the 1st Workshop on Tackling Computer Systems Problems with Machine Learning (SysML)
,
June
2006
.
Abstract
Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection.
Alec Pawling,
Nitesh V. Chawla
, and Gregory Madey
. "
Anomaly Detection in a Mobile Communication Network
."
Annual Conference of the North American Association for Computational Social and Organizational Science (NAACSOS)
,
pp. 407–422,
June
2006
.
Abstract
Through massive parallelism, distributed systems enable the multiplication of productivity. Unfortunately, increasing the scale of available machines to users will also multiply debugging when failure occurs. Data mining allows the extraction of patterns within large amounts of data and therefore forms the foundation for a useful method of debugging, particularly within such distributed systems. This paper outlines a successful application of data mining in troubleshooting distributed systems, proposes a framework for further study, and speculates on other future work.
David A. Cieslak
, Douglas L. Thain, and
Nitesh V. Chawla
. "
Troubleshooting Distributed Systems via Data Mining
."
Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC)
,
pp. 309–312,
June
2006
.
Abstract
An approach to combating network intrusion is the development of systems applying machine learning and data mining techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets—destination and actual packets based—constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.
David A. Cieslak
,
Nitesh V. Chawla
, and Aaron D. Striegel
. "
Combating Imbalance in Network Intrusion Datasets
."
Proceedings of the IEEE International Conference on Granular Computing (GrC)
,
pp. 732–737,
May
2006
.
Abstract
Computing information gain in general data streams, in which we do not make any assumptions on the underlying distributions or domains, is a hard problem, severely constrained by the limitations on memory space. We present a simple randomized solution to this problem that is time and space efficient as well as tolerates a relative error that has a theoretical upper bound. It is based on a novel method of discretization of continuous domains using quantiles. Our empirical evaluation of the technique, using standard and simulated datasets, convincingly demonstrates its practicality and robustness. Our results include accuracy versus memory usage plots and comparisons with a popular discretization technique.
Alec Pawling,
Nitesh V. Chawla
, and Amitabh Chaudhary
. "
Computing Information Gain in Data Streams
."
Proceedings of the IEEE ICDM Workshop on Temporal Data Mining: Algorithms, Theory and Applications
,
pp. 72–81,
November
2005
.
Abstract
We report on our experience for the first time departmental offering of the data mining course in Spring 2005. The course was cross-listed such that both the upper level undergraduates and graduate students could attend. However, the majority of the registered students were undergraduates. Data mining, being a confluence of multiple fields, offers an interesting addition to the computer science curriculum. The main objective of the course was to provide grounding on both the theoretical and practical aspects of data mining and machine learning. In addition, the course used concepts learned in various courses throughout the undergraduate degree. The course utilized a machine learning toolkit, Weka, by the University of Waikato, New Zealand. In this paper, we present the various components of the course, structure, innovative assignments and discussions, and the project life cycle.
Nitesh V. Chawla
. "
Teaching Data Mining by Coalescing Theory and Applications
."
Proceedings of the 35th Annual ASEE/IEEE Conference on Frontiers in Education (FIE)
,
pp. S1J 17–23,
October
2005
.
Abstract
Random subspaces are a popular ensemble construction technique that improves the accuracy of weak classifiers. It has been shown, in different domains, that random subspaces combined with weak classifiers such as decision trees and nearest neighbor classifiers can provide an improvement in accuracy. In this paper, we apply the random subspace methodology to the 2-D face recognition task. The main goal of the paper is to see if the random subspace methodology can improve the performance of the face recognition system given the high dimensional data, temporal, and distribution variant data. We used two different datasets to evaluate the methodology. One dataset comprises of completely unique subjects for testing, and the other dataset comprises of the same subjects (both in training and testing) but images in the test set are captured at different times under different conditions.
Nitesh V. Chawla
and Kevin W. Bowyer
. "
Ensembles in Face Recognition: Tackling the Extremes of High Dimensionality, Temporality, and Variance in Data
."
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC)
,
vol. 3,
pp. 2246–2351.
October
2005
.
Abstract
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
Nitesh V. Chawla
. "
Data Mining for Imbalanced Datasets: An Overview
."
Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers
,
pp. 853–867,
September
2005
.
Abstract
Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, and reduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.
Nitesh V. Chawla
, Lawrence O. Hall, and Ajay Joshi
. "
Wrapper-Based Computation and Evaluation of Sampling Methods for Imbalanced Datasets
."
Proceedings of the ACM SIGKDD International Workshop on Utility-Based Data Mining (UBDM)
,
pp. 24–33,
August
2005
.
Abstract
Ensembles of classifiers are often used to achieve accuracy greater than any single classifier. The predictions of these classifiers are typically combined together by uniform or weighted voting. In this paper, we approach the ensembles construction under a multi-agent framework. Each individual agent is capable of learning from data, and the agents can either be homogenous (same learning algorithm) or heterogeneous (different learning algorithm). These learning agents are combined by a meta-agent that utilizes evolutionary algorithm, using the accuracy as fitness score, for discovering the weights for each individual agent. The weights are indicative the best searched combination (or collaboration) of the set of agents. Experimental results show that this approach is a valid model for ensemble building when compared to the best individual agent and a simple plurality vote of the agents.
Jared Sylvester
and
Nitesh V. Chawla
. "
Evolutionary Ensembles: Combining Learning Agents Using Genetic Algorithms
."
AAAI Workshop on Multi-Agent Systems
,
pp. 46–51,
July
2005
.
Abstract
Open Source software repository is a collection of various projects with varying levels of activities, participations, and downloads. In this paper, we attempt to categorize and mine activity, thus discovering success, by focusing on the Games group. However, we observe that there are a significant number of projects within the Games category that are inactive or have no ranking assigned by Sourceforge. So, we first segmented our dataset based on just the ranking or activity assigned. We then constructed a set of characteristics, such as number of developers; number of posters; number of downloads, to identify associations, activity, and relationships in the smaller active set. We used regression rules and association rules to discover the associations and relationships
Daniel Mack
,
Nitesh V. Chawla
, and Gregory Madey
. "
Activity Mining in Open Source Software
."
Proceedings of the Annual Conference of the North American Association for Computational Social and Organizational Science (NAACSOS)
,
June
2005
.
Abstract
Random subspaces are a popular ensemble construction technique that improves the accuracy of weak classifiers. It has been shown, in different domains, that random subspaces combined with weak classifiers such as decision trees and nearest neighbor classifiers can provide an improvement in accuracy. In this paper, we apply the random subspace methodology to the 2-D face recognition task. The main goal of the paper is to see if the random subspace methodology can do as well, if not better, than the single classifier constructed on the tuned face space. We also propose the use of a validation set for tuning the face space, to avoid bias in the accuracy estimation. In addition, we also compare the random subspace methodology to an ensemble of subsamples of image data. This work shows that a random subspaces ensemble can outperform a well-tuned single classifier for a typical 2-D face recognition problem. The random subspaces approach has the added advantage of requiring less careful tweaking.
Nitesh V. Chawla
and Kevin W. Bowyer
. "
Random Subspaces and Subsampling for 2-D Face Recognition
."
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
,
vol. 2,
pp. 582–589,
June
2005
.
Abstract
Face recognition systems often use different images of a subject for training and enrollment. Typically, one may use LDA using all the image samples or train a nearest neighbor classifier for each (separate) set of images. The latter can require that information about lighting or expression about each testing point be available. In this paper, we propose usage of different images in a multiple classifier systems setting. Our main goals are to see (1) what is the preferred use of different images? And (2) can the multiple classifiers generalize well enough across different kinds of images in the testing set, thus mitigating the need of the meta-information? We show that an ensemble of classifiers outperforms the single classifier versions without any tuning, and is as good as a single classifier trained on all the images and tuned on the test set.
Nitesh V. Chawla
and Kevin W. Bowyer
. "
Designing Multiple Classifier Systems for Face Recognition
."
Proceedings of the 6th International Workshop on Multiple Classifier Systems (MCS)
,
pp. 407–416,
June
2005
.
Abstract
There has been increased interest in devising learning techniques that combine unlabeled data with labeled data—i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.
Nitesh V. Chawla
and Grigoris J. Karakoulas
. "
Learning from Labeled and Unlabeled Data: An Empirical Study Across Techniques and Domains
."
Journal of Artificial Intelligence Research (JAIR)
,
vol. 23,
331–366,
March
2005
.
Abstract
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.
Nitesh V. Chawla
, Lawrence O. Hall, Kevin W. Bowyer, and W. Philip Kegelmeyer
. "
Learning Ensembles from Bites: A Scalable and Accurate Approach
."
Journal of Machine Learning Research (JMLR)
,
vol. 5,
pp. 521–451,
December
2004
.
Abstract
Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
Steven Eschrich,
Nitesh V. Chawla
, and Lawrence O. Hall
. "
Learning to Predict in Complex Biological Domains
."
Journal of System Simulation
,
14
(11)
:1464–1471,
November
2004
.
Abstract
We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.
Predrag Radivojac,
Nitesh V. Chawla
, A. Keith Dunker, and Zoran Obradovic
. "
Classification and Knowledge Discovery in Protein Databases
."
Journal of Biomedical Informatics (JBI)
,
37
(4)
:224–239,
August
2004
.
Abstract
The purpose of this special issue is to communicate and present some of the latest research carried out in this area while reviewing other important recent developments in the field. In this Editorial, we begin by reviewing the class imbalance as well as an array of general solutions that were previously proposed to deal with it. We then discuss the progression of ideas starting at the 2000 workshop to today. In order to give a comprehensive picture of the state of the art in the field, we give a short overview of the papers that were presented at the 2003 workshop as well as a short description of the papers contained in this volume.
Nitesh V. Chawla
, Nathalie Japkowicz, and Aleksander Kołcz
. "
Editorial: Special Issue on Learning From Imbalanced Datasets
."
ACM SIGKDD Explorations
,
6
(1)
:1–6,
June
2004
.
Abstract
The purpose of this paper is to provide insight on the performance of different feature selection techniques and learning algorithms that we used on the five datasets of the competition. As part of our participation (CBAgroup) we considered filtering and wrapper feature selection techniques, combined with different learning algorithms. In terms of feature selection techniques, we used information gain, ReliefF, linear SVM together with forward selection and a genetic algorithm. In terms of inductive learning techniques we used a proprietary Bayesian learning algorithm as well as different types of hyperparameter-tuning algorithms for (standard and Bayesian) SVMs with linear and RBF kernel. We provide an evaluation of these techniques on the datasets.
Nitesh V. Chawla
, Grigoris Karakoulas, and Danny Roobaert
. "
Lessons Learned from Feature Selection Competition
."
Proceedings of the NIPS Workshop on Feature Selection
,
December
2003
.
Abstract
Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.
Nitesh V. Chawla
, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer
. "
SMOTEBoost: Improving the Prediction of the Minority Class in Boosting
."
Proceedings of the 14th European Conference on Machine Learning and the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
,
pp. 107–119,
September
2003
.
Abstract
Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the "interesting" or "abnormal" class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets—quality of probabilistic es timates, pruning, and ect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets.
Nitesh V. Chawla
. "
C4.5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure
."
Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets II
,
vol. 3,
August
2003
.
Abstract
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagging-like performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.
Nitesh V. Chawla
, Thomas E. Moore Jr., Lawrence O. Hall, Kevin W. Bowyer, W. Philip Kegelmeyer, and Clayton Springer
. "
Distributed Learning with Bagging-Like Performance
."
Pattern Recognition Letters
,
24
(1)
:455–471,
January
2003
.
Abstract
Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A substantialcient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the ectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
Steven Eschrich,
Nitesh V. Chawla
, and Lawrence O. Hall
. "
Generalization Methods in Bioinformatics
."
Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD)
,
vol. 2,
pp. 25–32,
July
2002
.
Abstract
Bagging and boosting are two popular ensemble methods that achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive datasets. Pasting small votes can utilize the power of boosting and bagging, and potentially scale up to massive datasets. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable to massive datasets.
Nitesh V. Chawla
, Lawrence O. Hall, Kevin W. Bowyer, Thomas E. Moore Jr., and W. Philip Kegelmeyer
. "
Distributed Pasting of Small Votes
."
Proceedings of the 3rd International Workshop on Multiple Classifier Systems (MCS)
,
pp. 52–61,
June
2002
.
Abstract
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Nitesh V. Chawla
, Kevin W. Bowyer, Lawrence O. Hall, and Philip Kegelmeyer
. "
SMOTE: Synthetic Minority Over-Sampling Technique
."
Journal of Artificial Intelligence Research (JAIR)
,
16
(1)
:321–357,
June
2002
.
Abstract
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments on various datasets show that, given the same size partitions and bags, the use of disjoint partitions results in better performance than the use of bags. Many applications (e.g., protein structure prediction) involve the use of datasets that are too large to handle in the memory of the typical computer. Our results indicate that, in such applications, the simple approach of creating a committee of classifiers from disjoint partitions is to be preferred over the more complex approach of bagging.
Nitesh V. Chawla
, Thomas E. Moore Jr., Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer, and W. Philip Kegelmeyer
. "
Bagging is a Small-Data-Set Phenomenon
."
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
,
vol. 2,
pp. 685–689,
December
2001
.
Abstract
Ensembles of classifiers offer promise in increasing overall classification accuracy. The availability of extremely large datasets has opened avenues for application of distributed and/or parallel learning to efficiently learn models of them. In this paper, distributed learning is done by training classifiers on disjoint subsets of the data. We examine a random partitioning method to create disjoint subsets and propose a more intelligent way of partitioning into disjoint subsets using clustering. It was observed that the intelligent method of partitioning generally performs better than random partitioning for our datasets. In both methods a significant gain in accuracy may be obtained by applying bagging to each of the disjoint subsets, creating multiple diverse classifiers. The significance of our finding is that a partition strategy for even small/moderate sized datasets when combined with bagging can yield better performance than applying a single learner using the entire dataset.
Nitesh V. Chawla
, Steven Eschrich, and Lawrence O. Hall
. "
Creating Ensembles of Classifiers
."
Proceedings of the IEEE International Conference on Data Mining (ICDM)
,
pp. 580–581,
November
2001
.
Abstract
In the Third Critical Assessment of Techniques for Protein Structure Prediction ("CASP-3") contest, the best performance was obtained with a classifier that uses neural networks, a window size of fiteen around a given amino acid, and a training set of about 299,186 amino acids. We set out to investigate the possibility of obtaining better performance by using a bagging-like committee of binary decision trees, created using an order-of-magnitude more training data. There are two main reasons to believe that it should be possible to obtain better performance in this way. One is that Jones did not use a committee of classifiers in CASP-3 (and used only a four-classifier committee in CASP-4), whereas bagging studies indicate that improvement plateaus in the range of thirty to fifty classifiers in a committee. A second is that, by using supercomputers available at the Sandia National Labs, it is feasible to use an order of magnitude more training data than was used by Jones. This paper reports on our experiences pursuing this line of research. We show that with 'large' data sets and with bag size equal to partition size, simple disjoint partitioning performs at least as well as standard bagging. Given large datasets, either outperforms a single classifier built on all the data. We also show that there are subtle differences in the operation of binary decision trees and neural networks for this problem. One difference is that the neural network seems less prone to "over-learning" the "easy" subset of the training data.
Nitesh V. Chawla
, Thomas E. Moore Jr., Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer, and W. Philip Kegelmeyer
. "
Bagging-Like Effects and Decision Trees and Neural Nets in Protein Secondary Structure Prediction
."
Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD)
,
pp. 50–59,
August
2001
.
Abstract
Simulation problems in the DOE ASCI program generate visualization datasets more than a terabyte in size. The practical difficulties in visualizing such datasets motivate the desire for automatic recognition of salient events. We have developed a parallel decision tree classifier for use in this context. Comparisons to ScalParC, a previous attempt to build a fast parallelization of a decision tree classifier, are provided. Our parallel classifier executes on the "ASCI Red" supercomputer. Experiments demonstrate that datasets too large to be processed on a single processor can be efficiently handled in parallel, and suggest that there need not be any decrease in accuracy relative to a monolithic classifier constructed on a single processor.
Kevin W. Bowyer, Lawrence O. Hall, Thomas E. Moore Jr.,
Nitesh V. Chawla
, and W. Phillip Kegelmeyer
. "
A Parallel Decision Tree Builder for Mining Very Large Visualization Datasets
."
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC)
,
vol. 3,
pp. 1888–1893,
October
2000
.
Abstract
In this paper a concern about the accuracy (as a function of parallelism) of a certain class of distributed learning algorithms is raised, and one proposed improvement is illustrated. We focus on learning a single model from a set of disjoint data sets, which are distributed across a set of computers. The model is a set of rules. The distributed data sets may be disjoint for any of several reasons. In our approach, the first step is to construct a rule set (model) for each of the original disjoint data sets. Then rule sets are merged until an eventual final rule set is obtained which models the aggregate data. We show that this approach compares to directly creating a rule set from the aggregate data and promises faster learning. Accuracy can drop oás the degree of parallelism increases. However, an approach has been developed to extend the degree of parallelism achieved before this problem takes over.
Lawrence O. Hall,
Nitesh V. Chawla
, Kevin W. Bowyer, and W. Philip Kegelmeyer
. "
Learning Rules from Distributed Data
."
Large-Scale Parallel Data Mining
,
pp. 211–220,
January
2000
.
Abstract
Consider a labeled data set of 1 terabyte in size. A salient subset might depend upon the users interests. Clearly, browsing such a large data set to find interesting areas would be very time consuming. An intelligent agent which, for a given class of user, could provide hints on areas of the data that might interest the user would be very useful. Given large data sets having categories of salience for different user classes attached to the data in them, these labeled sets of data can be used to train a decision tree to label unseen data examples with a category of salience. The training set will be much larger than usual. This paper describes an approach to generating the rules for an agent from a large training set. A set of decision trees are built in parallel on tractable size training data sets which are a subset of the original data. Each learned decision tree will be reduced to a set of rules, conflicting rules resolved and the resultant rules merged into one set. Results from cross validation experiments on a data set suggest this approach may be effectively applied to large sets of data.
Lawrence O. Hall,
Nitesh V. Chawla
, and Kevin W. Bowyer
. "
Decision Tree Learning on Very Large Data Sets
."
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC)
,
vol. 3,
pp. 2579–2584,
October
1998
.
Abstract
Very large data sets may be utilized for visualization. To focus attention on the salient regions of a data set being visualized, it is useful to have information on the interesting regions of data. It is possible to learn the salience of regions of data but very slow, if possible, to do so serially on currently available terabyte plus datasets. This paper describes an approach in which decision trees can be learned in parallel from disjoint subsets of a complete data set. The learned decision trees are converted to rules and the rules are combined into a single rule set. The combination process is based on an approach, suggested in Williams 1990 dissertation, in which rules that match one or more examples but assign them to different classes are resolved. Similar rules are also combined into more general rules. An alternate approach to combining the rule sets based on work of Provost and Hennessy 1996 is also discussed. Results on two small data sets indicate the decision tree to rules with rule conflict resolution approach has promise.
Lawrence O. Hall,
Nitesh V. Chawla
, and Kevin W. Bowyer
. "
Combining Decision Trees Learned in Parallel
."
Proceedings of the ACM SIGKDD Workshop on Distributed Data Mining (KDDW-DDM)
,
pp. 10–15,
August
1998
.












Loading ...