Canary Home

	Canary - A Software Suite for Data Mining Education Early coal mining practices utilized canaries to help miners monitor the condition of their mine. The Canary Software Suite helps data mining students experiment with data mining algorithms. Canary is written in standard C++ and uses a graphical user interface developed with Qt. Each algorithm is packaged separately and can be used as a stand-alone application. Each algorithm package contains two folders; one containing a MS Windows executable and sample data files, the other contains all of the source files needed to modify the application or recompile for other operating systems. It is advised the you create a separate folder to download and unzip the Canary packages into. You will need to download, unzip, and copy the Qt DLLs into your C:\Windows\System 32 folder if you want to run the Windows executable. No other programming experience is needed. You can also experiment with modifying and recompiling the application source code. You will need the Qt libraries and a compliant C++ compiler to do this. These can be obtained from Trolltech (the Open Source version of the Qt library is free under GPLv2). This package is released under the GNU General Public License (GPL), version 2 (http://www.gnu.org/licenses/old-licenses/gpl-2.0.html). You are free to use and modify this application subject to the conditions of the GPLv2. Please send any suggestions you may have for the improvement of this application to jalesh@webster.edu. Thank you. © 2007 John Aleshunas

Canary - A Software Suite for Data Mining Education

Early coal mining practices utilized canaries to help miners monitor the condition of their mine. The Canary Software Suite helps data mining students experiment with data mining algorithms.

Canary is written in standard C++ and uses a graphical user interface developed with Qt. Each algorithm is packaged separately and can be used as a stand-alone application. Each algorithm package contains two folders; one containing a MS Windows executable and sample data files, the other contains all of the source files needed to modify the application or recompile for other operating systems. It is advised the you create a separate folder to download and unzip the Canary packages into.

You will need to download, unzip, and copy the Qt DLLs into your C:\Windows\System 32 folder if you want to run the Windows executable. No other programming experience is needed. You can also experiment with modifying and recompiling the application source code. You will need the Qt libraries and a compliant C++ compiler to do this. These can be obtained from Trolltech (the Open Source version of the Qt library is free under GPLv2).

This package is released under the GNU General Public License (GPL), version 2 (http://www.gnu.org/licenses/old-licenses/gpl-2.0.html). You are free to use and modify this application subject to the conditions of the GPLv2.

Please send any suggestions you may have for the improvement of this application to jalesh@webster.edu.

Thank you.

© 2007 John Aleshunas

Canary File Directory

The Qt DLL Library

The K-Nearest Neighbors Algorithm

The K-Means Clustering Algorithm

The C4.5 Decision Tree Algorithm

Descriptions of the Sample Datasets Packaged with the Canary Applications

Other References

The Qt DLL Library

Qt is an Open source C++ class library and tools for cross-platform development and internationalization created by Trolltech. The GUI for Canary was developed using Qt and requires the Qt DLLs to run correctly.

You can download a zip file containing the Qt DLLs HERE (7.013 MB).

Download and unzip the desired Canary application first [see the Canary applications below]. The unziped application folder contains two subfolders named Executable and Source. Download and unzip the Qt DLLs into the Executable folder. That's it! You can now experiment with your Canary application.

If you acquire Qt (commercial or open source versions) to modify or recompile the applications, you will already have the DLLs on your machine and will not need this additional download file. The applications will run correctly on your PC without this addional step.

[return to top]

K-Nearest Neighbors Algorithm

The K-Nearest Neighbors algorithm is a naive classifier that classifies a data instance by calculating its distance to each member of a classification data set and associating the test instance with the K classification instances nearest to it. This implementation uses the Euclidian distance (square root of the sum of the squared attribute differences) to measure the distance between the test instance and a classification instance. Noisy data can result in a set of classification instances from several classes and the test instance is usually assigned the class of the most frequently occurring class in the set of returned classification instances.

The algorithm assumes that the classification data set is representative of the of the range of attribute values for the known classes. Because this implementation uses the Euclidean distance metric, it can only evaluate data that is numeric. It also will not work properly with missing values. The method becomes computationally intensive when the classification set is very large.

HERE is a zip file containing a folder with the Windows executable and sample data files, and another folder containing all of the source code files. You can learn how to use the application by refering to the Help_file.html file externally or by accessing Help within the application itself. It is advised the you create a separate folder to download and unzip this package into.

[return to top]

K-Means Clustering Algorithm

The K-Means algorithm is a simple algoritm that clusters a set of data instances by calculating each data instance's distance from the mean of each cluster in a set of K data clusters and assigning the data instance to the cluster whose mean it is closest to. This implementation uses the Euclidian distance (square root of the sum of the squared attribute differences) to measure the distance between a given data instance and cluster mean. The cluster means are initially assigned using the attribute values of the first K data instances. The algorithm recalculates the mean of each cluster based on the attribute values of the data instances assigned to that cluster at the end of each clustering iteration. The algorithm stops when the difference between the new and the old cluster means is less than a predetermined stopping threshold value.

The K-Means algorithm assumes that the data instances form a vector space with natural clusters having Gaussian distributions. The algorithm can tolerate some noise in the data but its results degrade when this assumption is violated. This algorithm converges extremely quickly in practice. The quality of the clusters produced is sensitive to the choice of the value K. Normally, the algorithm is re-run using several different values for K and the results are compared to determine which output data set created the best set of clusters.

HERE is a zip file containing a folder with the Windows executable and sample data files, and another folder containing all of the source code files. You can learn how to use the application by refering to the Help_file.html file externally or by accessing Help within the application itself. It is advised the you create a separate folder to download and unzip this package into.

[return to top]

C4.5 Decision Tree Algorithm

The C4.5 decision tree algorithm is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of Information Entropy. C4.5 uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets. C4.5 examines the normalized Information Gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recurses on the smaller sublists.

The C45_DTI application is a Qt graphical interface communicating with the C4.5 executable from Quinlan (which is written in C). I may eventually convert the C4.5 algorithm to standard C++, but I've produced this version as an interim option.

HERE is a zip file containing a folder with the Windows executable and sample data files, and another folder containing all of the source code files. You can learn how to use the application by refering to the Help_file.html file externally or by accessing Help within the application itself. It is advised the you create a separate folder to download and unzip this package into.

[return to top]

Descriptions of the Sample Datasets Packaged with the Canary Applications

Iris Dataset: This dataset is often referred to as Fisher’s Iris dataset because it was used in R. A. Fisher’s 1936 paper “The Use of Multiple Measurements in Taxonomic Problems”. It was originally generated by Edgar Anderson in 1935. The dataset consists of 50 samples from each of three cultivars of Iris flowers (I. setosa, I. virginica and I. versicolor). Four features were measured from each sample, they are the sepal length, sepal width, petal length and petal width. The setosa are linearly separable from the other two cultivars. The virginica and versicolor are not separable and provide a good example of data noise.

Wine Dataset: This dataset consists of 153 instances and is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. These constituents are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines and Praline. Class 1 and Class 2 wines are not separable and Class 2 and Class 3 wines are not separable therefore this poses a greater data noise challenge than the iris dataset.

Diabetes Dataset: This dataset is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990. Each instance represents individual patients and their various medical attributes along with their diabetes classification. The dataset consists of 768 instances. Each instance contains 8 attributes: Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function and Age. This is a significantly noisy dataset with many missing values (only 532 instances are complete).

Here are links to Excel versions of these datasets, if you want to explore them independent of the applications: Iris, Wine, Diabetes.

[return to top]

Comming Soon

Self-Organizing Map (SOM)

Other References

Algorithm References

K-Nearest Neighbors: Wikipedia

K-Means: Wikipedia

C4.5 Decision Tree Induction: Wikipedia, Ross Quinlan

Self-Organizing Map (SOM): Wikipedia, Teuvo Kohonen

Develpoment References

Qt: Trolltech

Dev C++ editor: Bloodshed Software

MinGW Compiler: MinGW

[return to top]