README.txt for directory ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location/ This directory contains the Information Extraction Protein Localization dataset used by Gleaner in the paper Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and Jude Shavlik Fourteenth International Conference on Inductive Logic Programming (ILP 2004) Porto, Portugal, September 6-8 2004 This data was originally described in Representing Sentence Structure in Hidden Markov Models for Information Extraction Souyma Ray and Mark Craven Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) and was relabeled Fall 2003 by Soumya Ray, Mark Goadrich and Louis Oliphant. In the archive IE_Protein-Location.tar.gz, you will find the files README.txt (this file) predicates.html (detailed descriptions of all background knowledge predicates) aleph.yap (version 5) loader.yap (some helper methods for loading the dataset) uw_utils.yap (small prolog utility functions, set, concat, reverse, etc) and the directories background_knowledge (ground predicates for all our background knowledge) datasets (.b, .f and .n files for 5 folds of train, tune, test) dotbfiles (modes and determinations) ontologies (more background knowledge on medical ontologies, not directly used in learning) To load these files and perform some experiments, we recommend using Aleph Version 5 (included) and Yap version 4.5.3 (found online at http://www.ncc.up.pt/~vsc/Yap/). First, load the files aleph.yap and uw_utils.yap with the command ? [aleph, uw_utils]. When these are finished processing, you can use the aleph function read_all to load your choice of the training data for the five folds. For trainfold 1, use ? read_all('datasets/NPonly_trainfold1_protein_location'). From here you can saturate example and reduce to obtain one clause, or induce a whole theory by using. ? sat(1), reduce. or ? induce. The current limit of nodes is set to 100, to prevent Yap from crashing. To obtain the results reported in the above ILP paper, we used an in-house modified version of Aleph. Please let us know if you download this dataset, so we can update you if there are any substantial changes. Direct any questions you have to: Mark Goadrich - August 2004 (richm@cs.wisc.edu)