DIAL

Data, Inference, Analysis and Learning Lab

Software

  • Model Monitor: A Toolkit for evaluating, comparing, and monitoring the effectiveness of classification models under distribution shift
    • Model Monitor is a Java toolkit for the systematic evaluation of classifiers under changes in distribution. It provides methods for detecting distribution shifts in data, comparing the performance of multiple classifiers under shifts in distribution, and evaluating the robustness of individual classifiers to distribution change. As such, it allows users to determine the best model (or models) for their data under a number of potential scenarios. Additionally, Model Monitor is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.
      Techniques implemented in this package come primarily from the following sources:
      D.A. Cieslak and N.V. Chawla
      Detecting Fracture Points in Classifier Performance.
      7th IEEE Conference on Data Mining, pp. 123-132, 2007.
      D.A. Cieslak and N.V. Chawla
      A Framework for Monitoring Classifiers' Performance: When and Why Failure Occurs?
      Knowledge and Information Systems 2008.
      Download Manual Paper
  • Condor Grid Analysis Software Package (GASP)
    • Whether you are a first time Condor user or an advanced system administrator, job failure on the grid is inevitible. In a submission batch of 1000 jobs, one might observe 500 job failures, leaving the user with several questions: Why are some jobs evicted multiple times? Why do some jobs create Shadow Exceptions? Is a group of machines incapable of running a particular submission? All of these are difficult to answer due to the scale of the machine pool and jobs submitted. Failure may appear to occur at random, but often there is a pattern and the Condor Grid Analysis Software Package (GASP) is the tool to help you find it.
      This software implements work from the following publications:
      David Cieslak, Nitesh Chawla, and Douglas Thain
      Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques.
      IEEE Grid Computing, September 2008.
      David Cieslak, Douglas Thain, Nitesh Chawla
      Short Paper: Troubleshooting Distributed Systems via Data Mining.
      IEEE Symposium on High Performance Distributed Computing (HPDC), Paris, France, June 2006.
      Download Instructions Paper
  • Perl/C SMOTE+Undersample Wrapper Implementation
    • Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a corresponding high cost for misclassification of rare events. Under such circumstances, generating models with high minority class accuracy and with lower total misclassification cost is necessary. It becomes important to apply resampling and/or cost-based reweighting to improve the prediction of the minority class. However, the question remains on how to effectively apply the sampling strategy. To that end, we provide a wrapper paradigm that discovers the amount of re-sampling for a dataset. This method has produced favorable results compared to other imbalance methods and some cost-sensitive learning methods --- MetaCost and Cost-Sensitive Classifier. In addition, we also obtain the lowest cost per test example compared to any result we are aware of for the KDD Cup-99 intrusion detection dataset.
      Download
  • Hellinger Distance Decision Tree Implementation in WEKA
    • This is a WEKA-3-7-1 jar file which includes Hellinger Distance Decision Trees (HDDT) with binary nominal splits. To run, use the following:
      java -cp <path to weka-hddt.jar> weka.classifiers.trees.HTree -U -A -B -t <training file> -T <testing file>
      Download
  • C++ Reduced Error Pruning Tree Implementation
    • This is a high performance and memory-efficient C++ re-implementation of the REPTree provided by WEKA. This code often requires less than 5% of the time and space of WEKA code on equivalent large data sets. This stand-alone implementation is useful as a base classifier in the context of a shell script wrapper implementating bootstrap aggregation and/or random subspace.
      Download
  • LPmade: Link Prediction Made Easy
    • LPmade is a complete cross-platform software solution for multicore link prediction and related tasks and analysis. Its first principal contributions are a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined process for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make script that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the process of creating multivariate supervised link prediction models as proposed by Lichtenwalter and Chawla in 2010 with version 3.5.8 of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to complete plots, gigabytes or terabytes of output, and actionable or publishable results.
      Download Manual Publication