Iris Dataset: This dataset is often referred to as Fisher’s Iris dataset because it was used in R. A. Fisher’s 1936 paper “The Use of Multiple Measurements in Taxonomic Problems”. It was originally generated by Edgar Anderson in 1935. The dataset consists of 50 samples from each of three cultivars of Iris flowers (I. setosa, I. virginica and I. versicolor). Four features were measured from each sample, they are the sepal length, sepal width, petal length and petal width. The setosa are linearly separable from the other two cultivars. The virginica and versicolor are not separable and provide a good example of data noise.
Wine Dataset: This dataset consists of 153 instances and is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. These constituents are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines and Praline. Class 1 and Class 2 wines are not separable and Class 2 and Class 3 wines are not separable therefore this poses a greater data noise challenge than the iris dataset.
Diabetes Dataset: This dataset is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990. Each instance represents individual patients and their various medical attributes along with their diabetes classification. The dataset consists of 768 instances. Each instance contains 8 attributes: Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function and Age. This is a significantly noisy dataset with many missing values (only 532 instances are complete).
Here are links to Excel versions of these datasets, if you want to explore them independent of the applications: Iris, Wine, Diabetes.
[return to top]