README.txt for directory
ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location/

This directory contains the Information Extraction Protein Localization
dataset used by Gleaner in the paper

  Learning Ensembles of First-Order Clauses for Recall-Precision Curves:
  A Case Study in Biomedical Information Extraction
  Mark Goadrich, Louis Oliphant and Jude Shavlik
  Fourteenth International Conference on Inductive Logic Programming (ILP 
  2004) Porto, Portugal, September 6-8 2004

This data was originally described in

  Representing Sentence Structure in Hidden Markov Models for 
  Information Extraction
  Souyma Ray and Mark Craven 
  Proceedings of the 17th International Joint Conference on 
  Artificial Intelligence (IJCAI-2001)

and was relabeled Fall 2003 by Soumya Ray, Mark Goadrich and Louis 
Oliphant.

In the archive IE_Protein-Location.tar.gz, you will find the files

README.txt      (this file)
predicates.html (detailed descriptions of all background knowledge predicates)
aleph.yap       (version 5)
loader.yap      (some helper methods for loading the dataset)
uw_utils.yap    (small prolog utility functions, set, concat, reverse, etc)

and the directories

background_knowledge    (ground predicates for all our background knowledge)
datasets                (.b, .f and .n files for 5 folds of train, tune, test)
dotbfiles               (modes and determinations)
ontologies              (more background knowledge on medical ontologies, 
			 not directly used in learning)

To load these files and perform some experiments, we recommend using Aleph 
Version 5 (included) and Yap version 4.5.3 (found online at 
http://www.ncc.up.pt/~vsc/Yap/).  

First, load the files aleph.yap and uw_utils.yap with the command

? [aleph, uw_utils].

When these are finished processing, you can use the aleph function 
read_all to load your choice of the training data for the five folds.
For trainfold 1, use

? read_all('datasets/NPonly_trainfold1_protein_location').

From here you can saturate example and reduce to obtain one clause, or 
induce a whole theory by using.

? sat(1), reduce.

or 

? induce.

The current limit of nodes is set to 100, to prevent Yap from crashing.
To obtain the results reported in the above ILP paper, we used an in-house 
modified version of Aleph.

Please let us know if you download this dataset, so we can update you if there
are any substantial changes.  Direct any questions you have to:

Mark Goadrich - August 2004
(richm@cs.wisc.edu)