Terminology and Notation (for Predictive Analytics)

Because of the hybrid parentry of data mining, its practitioners often use multiple terms to refer

to the same thing. For example, in the machine learning (artficial intelligence) field, the variable

being predicted is the output variable or the target variable. To a statistician, it is the dependent

variable or the response. Here is a summary of terms used:

Algorithm refers to a specific procedure used to implement a particular data mining technique-

classification tree, discriminant analysis, etc.

Attribute - see Predictor.

Case - see Observation.

Confidence has a specific meaning in association rules of the type "If A and B are purchased, C

is also purchased." Confidence is the conditional probability that C will be purchased, IF A

and B are purchased.

Confidence also has a broader meaning in statistics ("confidence interval"), concerning the degree

of error in an estimate that results from selecting one sample as opposed to another.

Dependent variable - see Response.

Estimation - see Prediction.

Feature - see Predictor.

Holdout sample is a sample of data not used in fitting a model, used to assess the performance

of that model; this book uses the terms validation set or, if one is used in the problem, test set

instead of holdout sample.

Input variable - see Predictor.

Lift Chart -- Important!

Model refers to an algorithm as applied to a dataset, complete with its settings (many of the

algorithms have parameters which the user can adjust).

Observation is the unit of analysis on which the measurements are taken (a customer, a trans-

action, etc.); also called case, record, pattern or row. (each row typically represents a record,

each column a variable)

Outcome variable - see Response.

Output variable - see Response.

P(A|B) is the conditional probability of event A occurring given that event B has occurred. Read

as "the probability that A will occur, given that B has occurred."

Pattern is a set of measurements on an observation (e.g., the height, weight, and age of a person)

Prediction means the prediction of the value of a continuous output variable; also called estimation.

Predictor usually denoted by X, is also called a feature, input variable, independent variable, or,

from a database perspective, a field.

Record - see Observation.

Response , usually denoted by Y , is the variable being predicted in supervised learning; also called

dependent variable, output variable, target variable or outcome variable.

Score refers to a predicted value or class. "Scoring new data" means to use a model developed with

training data to predict output values in new data.

Success class is the class of interest in a binary outcome (e.g., "purchasers" in the outcome

"purchase/no-purchase")

Supervised learning refers to the process of providing an algorithm (logistic regression, regression

tree, etc.) with records in which an output variable of interest is known and the algorithm

"learns" how to predict this value with new records where the output is unknown.

Test data (or test set) refers to that portion of the data used only at the end of the model

building and selection process to assess how well the final model might perform on additional

data.

Training data (or training set) refers to that portion of data used to fit a model.

Unsupervised learning refers to analysis in which one attempts to learn something about the data

other than predicting an output value of interest (whether it falls into clusters, for example).

Validation data (or validation set) refers to that portion of the data used to assess how well

the modelfits, to adjust some models, and to select the best model from among those that

have been tried.

Variable is any measurement on the records, including both the input (X) variables and the output

(Y) variable.