WEKA for Imbalanced Data (SMOTE and HDDT)

We've modified WEKA (v3.7.14), the popular Java-based data mining software package, to include two methods developed specifically for learning from imbalanced data: Synthetic Minority Oversampling TEchnique (SMOTE), a popular sampling method for data-preprocessing, and Hellinger Distance Decision Tree (HDDT), a skew-insensitive decision tree-based algorithm for classification. In the provided WEKA implementation, SMOTE can be found as a supervised instance filter, while HDDT can be found as a tree-based classifier. The SMOTE filter implementation for WEKA may also be downloaded separately here.

For more details on these methods, please consult the following publications:

[Download]

LPmade: Link Prediction Made Easy

LPmade is a complete cross-platform software solution for multicore link prediction and related tasks and analysis. Its first principal contributions are a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined process for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make script that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the process of creating multivariate supervised link prediction models as proposed by Lichtenwalter and Chawla in 2010 with WEKA (v3.5.8) modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to complete plots, gigabytes or terabytes of output, and actionable or publishable results.

For more details on the supervised link prediction methods implemented, please consult the following publication:

  • Ryan N. Lichtenwalter, Jake T. Lussier, and Nitesh V. Chawla
    . "
    New Perspectives and Methods in Link Prediction
    ."
    Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
    ,
    pp. 243–252,
    July
    2010
    .

Citation:

[Download] [Manual]

DisNet: A Framework for Distributed Graph Computation

DisNet is a framework for distributed computation in large networks. This C++ implementation is targeted for high-efficiency distributed computation. To use DisNet, the user needs only to supply two small fragments of code describing the fundamental kernel of the computation. The framework automatically divides and distributes the workload and manages completion using an arbitrary number of heterogeneous computational resources. In practice, we have used thousands of machines and observed commensurate speedups.

Citation:

[Download] [Manual]

Model Monitor (M2)

Model Monitor is a Java toolkit for the systematic evaluation of classifiers under changes in distribution. It provides methods for detecting distribution shifts in data, comparing the performance of multiple classifiers under shifts in distribution, and evaluating the robustness of individual classifiers to distribution change. As such, it allows users to determine the best model (or models) for their data under a number of potential scenarios. Additionally, Model Monitor is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.

Techniques implemented in this package come primarily from the following sources:

Citation:

[Download] [Manual]

Perl/C SMOTE+Undersampling Wrapper Implementation

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a corresponding high cost for misclassification of rare events. Under such circumstances, generating models with high minority class accuracy and with lower total misclassification cost is necessary. It becomes important to apply resampling and/or cost-based reweighting to improve the prediction of the minority class. However, the question remains on how to effectively apply the sampling strategy. To that end, we provide a wrapper paradigm that discovers the amount of re-sampling for a dataset. This method has produced favorable results compared to other imbalance methods and several cost-sensitive learning methods, such as MetaCost. With it, we have obtained the lowest cost per test example compared to any result we are aware of for the KDD Cup 1999 intrusion detection dataset.

For more details on the wrapper method, please consult the following publication:

[Download]

Condor Grid Analysis Software Package (GASP)

Whether you are a first time Condor user or an advanced system administrator, job failure on the grid is inevitible. In a submission batch of 1,000 jobs, one might observe 500 job failures, leaving the user with several questions: Why are some jobs evicted multiple times? Why do some jobs create Shadow Exceptions? Is a group of machines incapable of running a particular submission? All of these are difficult to answer due to the scale of the machine pool and jobs submitted. Failure may appear to occur at random, but often there is a pattern and the Condor Grid Analysis Software Package (GASP) is the tool to help you find it.

Citation:

[Download] [Instructions]
Loading ...