UCI Machine Learning Repository Content Summary
Abalone Database
- Donated by Sam Waugh
- Predicting the age of abalone from physical measurements
- Documentation: On everything
- 4177 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp Access
Adult Database
- Donated by Ron Kohavi
- Predicting whether income exceeds $50K/yr based on census data
- Documentation: On everything
- 48842 instances, 14 attributes (6 continuous and 8 nominal)
- Missing attribute values
- Originally listed as the "Census Income" Database. It was renamed because it is cited as the "Adult" database
- Ftp Access
Annealing Database
- Documentation: On everything except database statistics
- Background information on this database: unknown
- Many missing attribute values
- Ftp Access
Anonymous Microsoft Web Data Database
- Title: Log of anonymous users of the site www.microsoft.com
- Donated by: Jack S. Breese, David Heckerman, Carl M. Kadie
- Number of Instances: Training: 32711 Testing: 5000
- Each instance represents an anonymous, randomly selected user of the web site.
- Number of Attributes: 294
- Ftp Access
Arrhythmia Database
- Documentation: On everything
- The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups.
- 16 classes
- 452 examples
- 279 attributes, 206 numeric
- Some missing attribute values
- Ftp Access
Artificial Characters Database
- Artificially generated using a first order theory (which
describes the structure of ten capitol letters) and random
choice theorem prover
- Domain Theory included
- Ftp Access
Audiology Databases
- Original Version
- From Baylor College
- Documentation: On everything except database statistics
- Non-standardized attributes (differs between instances)
- All attributes are nominally-valued
- Standard Attribute Version of the original
- A standard set of attributes have been defined in terms of the
orignal properties according to a well defined set of rules
described in the documentation files.
- 70 nominally-valued attributes
- Some missing attributes
- Ftp Access
Auto-Mpg Database
- Revised from CMU StatLib library
- data concerns city-cycle fuel consumption
- Continuously valued class attribute (mpg)
- 398 instances, 5 numeric attributes
- Ftp Access
Automobile Database
- From 1985 Ward's Automotive Yearbook
- Documentation: On everything except statistics and class distribution
- Good mix of numeric and nominal-valued attributes
- More than 1 attribute can be used as a class attribute in this database
- Ftp Access
Badges Database
- Donated by Haym Hirsh
- 294 instances, 2 classes
- Instances are described using a sequence of characters (a name)
- Badge problem generated for attendee's to figure out at MLC94
- Ftp Access
Balance Scale Database
- Donated by Tim Hume
- 625 instances, 4 numeric attributes
- 3 classes (tip right, tip left, balanced)
- No missing values
- Ftp Access
Balloons Database
- Donated by Michael Pazzani
- Previously used in cognitive psychology experiment
- 16 instances, 2 classes, 4 attributes
- No missing values
- Ftp Access
- From Ljubljana Oncology Institute
- Documentation: On everything except database statistics
- Well-used database
- 286 instances, 2 classes, 9 attributes + the class attribute
Wisconsin Breast Cancer Databases
- Original database
- Donated by Olvi Mangasarian
- Located in breast-cancer-wisconsin sub-directory, filenames root: breast-cancer-wisconsin
- Currently contains 699 instances
- 2 classes (malignant and benign)
- 9 integer-valued attributes
- Ftp Access
- New prognostic database
- Donated 1/96 by Nick Street
- Located in breast-cancer-wisconsin sub-directory, filenames' root: wpbc
- Two possible learning problems: prediciting class (recurrent, non-recurrent) or time to recur
- 33 numeric attributes
- Ftp Access
- New diagnostic database
- Donated 1/96 by Nick Street
- Located in breast-cancer-wisconsin sub-directory, filenames' root: wdbc
- Classification learning problems: prediciting class (malignant, benign)
- 30 numeric attributes
- Ftp Access
Pittsburgh Bridges Database
- Donated by Yoram Reich
- Topic: design knowledge
- 108 instances, 13 attributes (7 specifications, 5 design description,
and 1 identifier)
- 2 versions of the data: original and numeric-discretized
- Ftp Access
Car Evaluation Database
- Donated by Marko Bohanec and Blaz Zupan (see also: Nursery Database)
- Car Evaluation Database was derived from a simple hierarchical
decision model originally developed for the demonstration of DEX
(M. Bohanec, V. Rajkovic: Expert system for decision
making. Sistemica 1(1), pp. 145-157, 1990.)
- Because of known underlying concept structure, this database may be
particularly useful for testing constructive induction and
structure discovery methods.
- Classification (4 classes)
- Documentation: On everything
- 1728 instances, 6 nominal ordered attributes
- No missing attribute values
- Ftp Access
Census Income Database
Chess Databases
- king-rook-vs-king-knight
- Documentation: limited (nothing on class distribution, statistics)
- This concerns king-knight versus king-rook end games
- The database creator is coded in Common Lisp
- king-rook-vs-king-pawn
- Documentation: sufficient
- This concerns king-rook versus king-pawn end games
- Originally described by Alen Shapiro
- king-rook-vs-king
- Donated by Michael Bain and Arthur van Hoff
- 28056 instances, 6 nominal features
- 17 classes to determine optimal depth-of-win
- Six Domain Theories
- Donated by Nick Flann
- In the "domain-theories" sub-directory
- Coded in a dialect of Prolog
- They all generate legal moves of chess
- I haven't yet touched Nick's documentation on them (See README)
- Ftp Access
Bach Chorales (time-series) Database
- Donated by Darrell Conklin
- Single-line melodies of 100 Bach chorales (originally 4 voices)
- Number of Instances: 100 Chorales, each with ~45 events
- Number of Attributes: 6 (nominal) per event
- Ftp Access
Connect-4 Opening Database
- Donated/Created by John Tromp
- Contains all legal 8-ply positions in the game of connect-4 in
which neither player has won yet, and in which the next move
is not forced
- 67557 instances, 42 nominal attributes
- Ftp Access
Credit Screening Databases
- Japanese Credit Screening Database
- Includes domain theory
- Positive instances are people who were granted credit
- The theory was generated by talking to Japanese domain experts
- Credit Card Application Approval Database
- Good mix of attributes -- continuous, nominal with small numbers
of values, and nominal with larger numbers of values
- 690 instances, 15 attributes some with missing values
- Ftp Access
Computer Hardware Database
- From CACM 4/87
- Described in terms of its cycle time, memory size, etc.
- Classified in terms of their relative performance capabilities
- Documentation: complete
- Contains integer-valued concept labels
- All attributes are integer-valued
- Ftp Access
Contraceptive Method Choice
- Origin: A subset of the 1987 National Indonesia
Contraceptive Prevalence Survey
- Donated by Tjen-Sien Lim (limt@stat.wisc.edu)
- 1473 instances, 2 classes, 10 attributes
- This dataset is a subset of the 1987 National Indonesia
Contraceptive
Prevalence Survey. The samples are married women who were either not
pregnant or do not know if they were at the time of interview. The
problem is to predict the current contraceptive method choice
(no use, long-term methods, or short-term methods) of a woman based
on her demographic and socio-economic characteristics.
- Ftp Access
Covertype data
- Donated by Jock A. Blackard 8/28/98
- 581012 instances, 8 classes, 54 attributes
- Ftp Access
Cylinder Bands Database
- Donated by Bob Evans 8/95
- Used in decision tree induction for mitigating process delays know as "cylinder bands" in rotogravure printing
- 512 instances, 2 classes, 19 attributes
- Missing values
- Ftp Access
Dermatology Database
- Documentation: On everything
- The aim is to determine the type of Eryhemato-Squamous Disease.
- 6 classes
- 366 examples
- 34 attributes, 1 nominal
- Some missing attribute values
- Ftp Access
Diabetes Data
- From AIM '94
- Non-Uniform Data format
- Time dependencies
- Ftp Access
The Second Data Generation Program - DGP/2
- Generates instances around peaks and allows for specification of the
mean and standard deviations in the normally distributed data
- Generates application domains based on specific parameters: number of
features, and proportion of positive to negative examples
- Allows for variations in the number of instances, the range of feature
values, the number of peaks, the percent of positive instances desired
and a radius around the peaks that these instances fall within
- Ftp Access
Document Understanding Database
- Donated by Donato Malerba
- Five concepts, expressed as predicates, to be learned
- mulptiple predicate learning problem
- see .info file for more information
- Ftp Access
EBL Domain Theories and Examples
- cup
- deductive.assumable (contains three domain theories)
- emotion
- ice
- pople
- safe-to-stack
- suicide
- Ftp Access
Echocardiogram Database
- From Reed Institute, Miami
- Documentation: sufficient
- 13 numeric-valued attributes
- Binary classification: patient either alive or dead after survival period
- Ftp Access
Ecoli Database
- Donated by Paul Horton (see also: yeast database)
- Predicting the Cellular Localization Sites of Proteins
- Documentation: On everything
- 336 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp Access
Event Detection Database
- 2 datasets: Calit2 building people counts and Dodger traffic data
- Goal is to predict the occurence of events based on counts
- Ground truth given in .events file
- 4 attributes (including counts and date/time stamp)
- Ftp Access
Flags Database
- From Collins Gem Guide to Flags, 1986
- 194 instances, mixed numeric- and nominal-valued attributes
- donated by Richard S. Forsyth, creator of PC/BEAGLE
- Ftp Access
Function Finding Databases
- Donated by Cullen Schafer
- 352 Studies in Function-Finding
- Collected mostly from investigations in physical science
- Intention: Evaluation of function-finding algorithms
- Ftp Access
Glass Identification Database
- From USA Forensic Science Service
- Documentation: completed
- 6 types of glass
- Defined in terms of their oxide content (i.e. Na, Fe, K, etc)
- All attributes are numeric-valued
- Ftp Access
Haberman's Survival Data
- Donar: Tjen-Sien Lim (limt@stat.wisc.edu)
- The dataset contains cases from a study that was conducted between
1958 and 1970 at the University of Chicago's Billings Hospital on
the survival of patients who had undergone surgery for breast
cancer.
- Ftp Access
Hayes-Roth Database
- Described in their 1977 paper
- Topic: human subjects study
- Ftp Access
Heart Disease Databases
- Documentation: extensive
- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
- 13 of the 75 attributes were used for prediction in 2 separate
tests, each of which achieved approximately 75%-80% classification
accuracy
- The chosen 13 attributes are all continuously valued
- Includes cost data (donated by Peter Turney)
- Ftp Access
Hepatitis Database
- From G.Gong: CMU
- Documentation: incomplete
- 155 instances with 20 attributes each; 2 classes
- Mostly Boolean or numeric-valued attribute types
- Includes cost data (donated by Peter Turney)
- Ftp Access
Horse Colic Database
- From Mary McLeish & Matt Cecile
- Well documented attributes
- 368 instances with 28 attributes (continuous, discrete, and nominal)
- 30% missing values
- Ftp Access
Housing Database (Boston)
- From CMU StatLib Library
- concerns housing prices in suburbs of Boston
- Continuously valued class attribute (MEDV)
- 506 instances, 12 continuous, 1 binary attributes
- Ftp Access
ICU Data
- From Serdar Uckun (AIM '94)
- Deals with ICU treatment of patients with Adult respiratory
distress syndrome (ARDS)
- Complex dataset (see documentation)
- Ftp Access
Image segmentation Database
- Donated by Carla Brodley
- Documentation status: Skimpy
- Not previously used in the ml literature as of 8/1991
- Image data described by high-level numeric-valued attributes, 7 classes
- Ftp Access
Internet Advertisements
- From Nicholas Kushmerick (nick@ucd.ie)
-
This dataset represents a set of possible advertisements on
Internet pages. The features encode the geometry of the image (if
available) as well as phrases occuring in the URL, the image's URL and
alt text, the anchor text, and words occuring near the anchor text.
The task is to predict whether an image is an advertisement ("ad") or
not ("nonad").
- Number of Instances: 3279 (2821 nonads, 458 ads)
- Number of Attributes: 1558 (3 continous; others binary)
- Ftp Access
Ionosphere Database
- From V. Sigillito
- Documentation Complete
- 2 classes, 351 instances, 34 numeric attributes, no missing values
- Classification of radar returns from the ionosphere
- Ftp Access
Iris Plant Database
- From Fisher, 1936
- Documentation: complete
- 3 classes, 4 numeric attributes, 150 instances
- 1 class is linearly separable from the other 2, but the other 2 are
not linearly separable from each other (simple database)
- Ftp Access
Isolet Spoken Letter Recognition Database
- From Ron Cole and Mark Fanty
- 6238 + 1559 instances, 26 classes (one for each letter)
- All attributes are real-valued scaled from -1.0 to 1.0.
- No missing values
- Ftp Access
Kinship Database
- From Hinton 1986 & Quinlan 1989
- Relational
- 24 individuals, 12 relations
- 104 instances derivable
- Case studies have been reported by both authors
- Ftp Access
Labor relations Database
- From Collective Bargaining Review
- Documentation: no statistics
- Please see the labor directory for more information
- Ftp Access
LED Display Domains
- From Classification and Regression Trees book
- Documentation: sufficient, but missing statistical information
- All attributes are Boolean-valued
- Two versions: 7 and 24 attributes
- Optimal Baye's rate known for the 10% probability of noise problem
- Several ML researchers have used this domain for testing noise tolerancy
- We provide here 2 C programs for generating sample databases
- Ftp Access
Lenses Database
- Donated by Benoit Julien
- Small database with few attributes
- attributes are either binary- or ternary-valued
- 3 classes: hard contact lenses, soft contact lenses, or neither
- Ftp Access
Letter Recognition Database
- From David Slate
- Based on various fonts
- 20,000 instances (712565 bytes) (.Z available)
- 17 attributes: 1 class (letter category) and 16 numeric (integer)
- No missing attribute values
- Ftp Access
Liver-disorders Database
- BUPA Medical Research Ltd. database donated by Richard S. Forsyth
- 7 numeric-valued attributes
- 345 instances (male patients)
- Includes cost data (donated by Peter Turney)
- Ftp Access
Logic-theorist
- Donated by Paul O'Rorke's (described in Machine Learning)
- All code for LT
- Ftp Access
Lung Cancer Database
- Donated by Stefan Aeberhard
- 32 instances, 57 Attributes (2 classes)
- No Attribute Definitions
- Ftp Access
- From Ljubljana Oncology Institute
- Documentation: incomplete
- CITATION REQUIREMENT: Please use (see the documentation file)
- 148 instances; 19 attributes; 4 classes; no missing data values
MAGIC Gamma Telescope Database
- Data set comes from Major Atmospheric Gamma Imaging Cherenkov (MAGIC) Telescope project
- Data are MC generated to simulate registration of high energy gamma particles in an atmospheric
Cherenkov telescope
- 11 attributes, 19020 instances
- Ftp Access
Mammographic Mass Data
- Donated by M. Elter.
- Data set can be used to predict the severity (benign or malignant) of a mammographic mass
lesion from BI-RADS attributes and the patient's age.
- 6 attributes, 961 instances
- Ftp Access
Mechanical Analysis Data
- Donated by members of the Universita di Torino
- Fault diagnosis problem of electromechanical devices
- ENIGMA system application described in proceedings of MLC-1990
- Each of the 209 instances is described by a different set of
components
- PUMPS DATA SET
- Newer version of above dataset with domain theory and results
- Ftp Access
Meta-data Database
- Donated by J.Gama
- Meta-Data was used in order to give advice about which
classification method is appropriate for a particular dataset
(taken from the results of the Statlog project).
- 528 instances; 22 attributes; numeric prediction; missing values
- Ftp Access
Mobile Robots Database
- Donated by Volker Klingspor, Katharina J. Morik and Anke D. Rieger
- Learning Concepts from Sensor Data of a Mobile Robot
- Multiple levels of learning (from raw sensor data to high level concepts)
- Ftp Access
Molecular Biology Databases
- Promoter Gene Sequences Database
- Donated by Jude Shavlik; See AAAI-90 Towell, Shavlik, & Noordewier
- E. Coli promoter gene sequences (DNA) with partial domain theory
- 106 instances, each predictor attribute takes on one of four values
- 50% positive instances
- Splice-junction Gene Sequences Database
- Donated by Geoffrey Towell, Noordewier, & Shavlik
- categories "ei" and "ie" include every "split-gene"
for primates in Genbank 64.1
- non-splice examples taken from sequences known not to include
a splicing site
- 3190 instances with classes "ei" (25%), "ie" (25%) and
Neither (50%)
- Domain theory included
- Protein Secondary Structure Database
- Originally created and used by Qian and Sejnowski
- From CMU connectionist bench repository
- Classifies secondary structure of certain globular proteins
- 3 classes: alpha-helix, beta-sheet and random-coil
- Protein Secondary Structure Domain Theory
- Donated and created by Jude Shavlik & Rich Maclin
- Imperfect domain theory for Qian and Sejnowski Protein
Secondary Structure database (above)
- Closely implements the algorithm of Chou and Fasman
- Ftp Access
MONK's Problems
- Donated by Sebastian Thrun
- A set of three artificial domains over the same attribute space
- 6 nominally values attributes, no missing values
- 1 problems has class noise added
- Used to test a wide range of induction algorithms
- Ftp Access
Moral Reasoner Database
- Donated by James Wogulis
- Horn-clause model that qualitatively simulates moral reasoning
- 202 instances and theory
- Theory includes negated literals
- Ftp Access
Multiple Features Database
- From Robert P.W. Duin
- This dataset consists of features of handwritten numerals (`0'--`9')
extracted from a collection of Dutch utility maps.
- 200 patterns per class (for a total of 2,000 patterns) have been digitized in binary images.
- Digits are represented in terms of Fourier coefficients, profile correlations, Karhunen-Love coefficients,pixel averages,Zernike moments and morphological features.
- Number of Instances: 2000 (200 per class)
- Number of Attributes: 649
- Number of Classes:10
- Ftp Access
Mushrooms Database
- From Audobon Society Field Guide
- Documentation: complete, but missing statistical information
- Described in terms of physical characteristics
- Classification: poisonous or edible
- All attributes are nominal-valued
- Large database: 8124 instances (2480 missing values for attribute #12)
- Ftp Access
MUSK Databases
- Donated by Tom Dietterich
- Task: to classify if musk molecule
- Two datasets: 476 and 6,598 instances, 168 attributes
- Was used to explore "multiple instance problem"
- Ftp Access
Nursery Database
- Donated by Marko Bohanec and Blaz Zupan (see also: Car Evaluation Database)
- Nursery Database was derived from a hierarchical decision model
originally developed to rank applications for nursery schools.
- Classification (5 classes)
- Because of known underlying concept structure, this database may be
particularly useful for testing constructive induction and
structure discovery methods.
- Documentation: On everything
- 12960 instances, 8 nominal attributes
- No missing attribute values
- Ftp Access
Othello Domain Theory
- Written and donated by Tom Fawcett
- Coded in Prolog
- Used in research to generate features for an inductive learning system
- Ftp Access
Page Blocks Classification Database
- Written and donated by Donato Malerba
- The problem consists of classifying all the blocks of the page
layout of a document that has been detected by a segmentation
process. This is an essential step in document analysis.
- 5473 examples comes from 54 distinct documents
- All attributes are numeric
- Ftp Access
Pima Indians Diabetes Database
- From National Institute of Diabetes and Digestive and Kidney Diseases
- Binary classes (tested positive or negative for diabetes)
- All 8 attributes are numeric-valued
- 768 instances
- Includes cost data (donated by Peter Turney)
- Ftp Access
Optical Recognition of Handwritten Digits
- From E. Alpaydin, C. Kayna
- 10 classes
- 3823 training, 1797 test cases
- 64 attributes (All input attributes are integers 0..16)
- Ftp Access
Pen-Based Recognition of Handwritten Digits
- From E. Alpaydin, Fevzi Alimoglu
- 10 classes
- 7494 training cases, 3498 test cases
- 16 attributes (All input attributes are integers 0..100)
- Ftp Access
UJI Pen Characters
- From group at Universitat Jaume I
- 1364 instances, taken from 11 different writers
- 35 classes (26 letters + 9 non-zero digits)
- Ftp Access
Postoperative Patient Database
- From Jerzy W. Grzymala-Busse
- 3 classes
- 90 instances
- 8 attributes, one numeric with missing values
- Ftp Access
Poker Hand Database
- From Robert Cattral, Franz Oppacher
- 11 attributes, 25010 training instances, 1,000,000 testing instances
- Each record is an example of a hand consisting of five playing cards drawn from a standard deck of 52.
- Each card is described using two attributes (suit and rank).
- Predictive attribute is the poker hand.
- Ftp Access
- From Ljubljana Oncology Institute
- Documentation: incomplete
- CITATION REQUIREMENT: Please use (see the documentation file)
- 339 instances; 18 attributes; 22 classes; lots of missing data values
Qualitative Structure Activity Relationships (QSARs)
- Donated by Ross King
- Two sets of dataset are given: pyrimidines and triazines
- 3 representations: ILP, Propositional Machine Learning Discrimination,
and Propositional Machine Learning Regression
- Ftp Access
Quadraped Animals Data Generator
- Donated by John H. Gennari
- Structured data; each instance has 9 components, with 9 numeric-valued
attributes per component
- 4 classes
- Previously used to evaluate unsupervised learning algorithms
- Ftp Access
Servo Database
- Donated by Ross Quinlan
- numerically valued class attribute
- 4 nominal attributes; 167 instances
- covers an extremely non-linear phenomenon
- Ftp Access
Shuttle Landing Control Database
- Tiny, 15-instance database with 7 attributes per instance; 2 classes
- Instances have don't care values for some features (database may be
expanded to 277 instances)
- Ftp Access
Solar Flare Databases
- From Gary Bradshaw
- 1389 instances, 13 attributes (includes 3 class attributes)
- Each class attribute counts the number of solar flares of a
certain class that occur in a 24 hour period
- Prediction attributes are nominal; no missing values
- Ftp Access
Soybean Databases
- Donated by Michalski
- Documentation: Only the statistics is missing
- (2 sizes)
- Michalski's famous soybean disease databases
- Ftp Access
Challenger USA Space Shuttle O-Ring Databases
- Donated by David Draper
- 2 small 23-instance databases containing only positive integers
- Fascinating topic: Analysis of launch temperature vs. O-ring stress
- Task: predict the number of O-rings that experience thermal distress
on a flight at 31 degrees F given data on the previous 23 shuttle
flights
- Ftp Access
Low Resolution Spectrometer Database
- From IRAS data -- NASA Ames Research Center
- Documentation: no statistics nor class distribution given
- LARGE database...and this is only 531 of the instances
- 98 attributes per instance (all numeric)
- Contact NASA-Ames Research Center for more information
- Ftp Access
Spambase Database
- Donated by George Forman (gforman at nospam hpl.hp.com) 650-857-7835 Mark Hopkins, Erik Reeber and Jaap Suermondt.
- Number of Instances: 4601 (1813 Spam = 39.4%)
- Number of Attributes: 58 (57 continuous, 1 nominal class label)
- The "spam" concept is diverse: advertisements for products/web
sites, make money fast schemes, chain letters, pornography...
Our collection of spam e-mails came from our postmaster and
individuals who had filed spam. Our collection of non-spam
e-mails came from filed work and personal e-mails, and hence
the word 'george' and the area code '650' are indicators of
non-spam. These are useful when constructing a personalized
spam filter. One would either have to blind such non-spam
indicators or get a very wide collection of non-spam to
generate a general purpose spam filter.
- Ftp Access
SPECT and SPECTF heart databases
- Donated by Krzysztof J. Cios & Lukasz A. Kurgan (Krys.Cios@cudenver.
edu)
- Documentation: Describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each of the patients is classified into two categories: normal and abnormal.
- 267 image sets (patients) in each dataset
- 23 attributes per instance (22 binary, 1 binary class) in SPECT
- 44 attributes per instance (43 binary, 1 binary class) in SPECTF
- Ftp Access
Sponge Database
- Donated by Javier Bejar and Ulises Cortes
- Classification of atlantic-mediterranean marine sponges
- 76 instances
- 45 nominal and numeric attributes (some missing values)
- Ftp Access
Statlog Project Databases
- Donated by Ross King
- Vehicle Silhouettes: 3D objects within a 2D image by
application of an ensemble of shape feature extractors
to the 2D silhouettes of the objects.
- Landsat Satellite: multi-spectral values of pixels in
3x3 neighbourhoods in a satellite image, and the
classification associated with the central pixel in each
neighbourhood
- Shuttle: The shuttle dataset contains 9 attributes all of
which are numerical. Approximately 80% of the data belongs
to class 1
- Australian Credit Approval: This file concerns credit card
applications. This database exists elsewhere in the repository
(Credit Screening Database) in a slightly different form
- Heart Disease: This dataset is a heart disease database similar
to a database already present in the repository (Heart Disease
databases) but in a slightly different form
- Image Segmentation: This dataset is an image segmentation
database similar to a database already present in the repository
(Image segmentation database) but in a slightly different form.
- German Credit Database: This dataset classifies people described
by a set of attributes as good or bad credit risks. Comes in
two formats (one all numeric). Also comes with a cost matrix
- Ftp Access
Student Loan Relational Database
- Donated by Michael Pazzani
- Target concept: no_payment_due by person for student loan
- 1000 instances of target concept
- Includes domain theory
- 10+ extensionally and intesionally defined relations
- Ftp Access
Teaching Assistant Evaluation
- Collected by Wei-Yin Loh (Department of Statistics, UW-Madison)
- Donated by Tjen-Sien Lim (limt@stat.wisc.edu)
- 151 instances, 6 attributes , 3 classes
- The data consist of evaluations of teaching performance over three
regular semesters and two summer semesters of 151 teaching assistant
(TA) assignments at the Statistics Department of the University of
Wisconsin-Madison. The scores were divided into 3 roughly equal-sized
categories ("low", "medium", and "high") to form the class variable.
- Ftp Access
Tic-Tac-Toe Endgame Database
- Donated by David W. Aha, Turing Institute
- Documentation complete as of Summer 1991
- 958 instances, all attributes can take on 1 of 3 possible values
- Binary classification task (i.e., "win for x")
- A paradigmatic domain for constructive induction studies
- Ftp Access
Thyroid Disease Database
- From Garavan Institute
- Documentation: as given by Ross Quinlan
- 6 databases from the Garavan Institute in Sydney, Australia
- Approximately the following for each database:
- 2800 training (data) instances and 972 test instances
- Plenty of missing data
- 29 or so attributes, either Boolean or continuously-valued
- 2 additional databases, also from Ross Quinlan, are also here
- Hypothyroid.data and sick-euthyroid.data
- Quinlan believes that these databases have been corrupted
- Their format is highly similar to the other databases
- 1 more database of 9172 instances that cover 20 classes, and
a related domain theory
- Another thyroid database from Stefan Aeberhard
- 3 classes, 215 instances, 5 attributes
- No missing values
- A Thyroid database suited for training ANNs
- 3 classes
- 3772 training instances, 3428 testing instances
- Includes cost data (donated by Peter Turney)
- Ftp Access
Trains Database
- Donated by David Aha & Eric Bloedorn
- Original owners: R. Michalski & R. Stepp
- 10 instances
- 10 attributes + class (direction: east or west)
- 2 data formats (structured, one-instance-per-line)
- Includes "East-West" competion data and results (donated by Peter Turney)
- Ftp Access
University Database
- Donated by Steve Souders
- Documentation: scant; we've left it in its original (LISP-readable) form
- 285 instances, including some duplicates
- At least one attribute, academic-emphasis, can have multiple values
per instance
- The user is encouraged to pursue the Lebowitz reference for more
information on the database
- Ftp Access
Congressional Voting Records Database
- 1984 United Stated Congressional Voting Records
- Classification: Republican or Democrat
- Documentation: completed
- All attributes are Boolean valued; plenty of missing values; 2 classes
- Ftp Access
Water Treatement Plant Database
- Donated by Javier Bejar and Ulises Cortes
- 38 numeric attributes; 527 instances; missing values
- Multiple classes predict plant state
- Ill-Stuctured Domain
- Ftp Access
Waveform Data Generator
- From Classification and Regression Trees book
- Documentation: no statistics
- CART book's waveform domains
- 21 and 40 continuous attributes respectively
- difficult concepts to learn, but known Bayes optimal classification
rate of 86% accuracy
- Ftp Access
Wine Recognition Database
- Donated by Stefan Aeberhard
- Using chemical analysis determine the origin of wines
- 13 attributes (all continuous), 3 classes, no missing values
- 178 instances
- Ftp Access
Yeast Database
- Donated by Paul Horton (see also: Ecoli database)
- Predicting the Cellular Localization Sites of Proteins
- Documentation: On everything
- 1484 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp Access
Zoo Database
- From Richard Forsyth
- Artificial
- 7 classes of animals
- 17 attributes (besides name), 15 Boolean and 2 numeric-valued
- No missing attribute values
- Ftp Access
Undocumented Databases
- Mike Pazzani's economic sanctions database
- Philippe Collard's database on cloud cover images
- Vince Sigillito's database on dna secondary structure
- Nettalk data (see connectionist-bench)
- Sonar data (see connectionist-bench)
- Vowel data (see connectionist-bench)
- Ftp Access