This is a list of predicates and literals that we use in our Information Extraction Protein Localization task and some detailed definitions. Not all of these predicates have modes and determinations associated with them, as they are superceded by another predicate. In the definitions below, target args are referring to the arguments to our protein-localization predicate. Target arg1 is the protein phrase and target arg2 is the location phrase.

Literals
The basic literals we use are for abstracts, sentences, phrases and words. We let A be the PubMed abstract code, S the sentence number within the abstract starting at 1, P be the phrase number within a sentence starting at 0, and W be the word number within a sentence starting at 0. Abstract literals are denoted as "abA", sentences as "abA_senS", phrases as "abA_senS_phP" and words as "abA_senS_phP_wW". For example, ab1316274_sen1_ph2_w6 denotes the 6th word in the 1st sentence of PubMed abstract 1316274. Note that words are references as to their place in a sentence, not in the phrase in a sentence.

Other literals we include are the actual strings, denoted by the argument string below, of text for the words, phrases and sentences, and the fold identification for abstracts to allow us to compute statistics on the training set without using the testset predidates.

Basics Predicates
These predicates form the basis of our objects and relations:

PredicateDefinition
abstract(abstract) Type predicate, ex abstract(ab1316274).
sentence(sentence) Type predicate, ex sentence(ab1316274_sen1).
phrase(phrase) Type predicate, ex phrase(ab1316274_sen1_ph2).
word(word) Type predicate, ex word(ab1316274_sen1_ph2_w6).
assigned_subfold(abstract, subfold) records the subfold to which each abstract is assigned, useful for keeping statistical predicates only created from the training set.
different_phrases(phrase, phrase) These two phrases are not the same literal.
different_words(word, word) These two words are not the same literal.
word_ID_to_string(word, string) A mapping of the literal to the actual word content.
phrase_ID_to_string(phrase, string) A mapping of the literal to the actual phrase content.
sentence_ID_to_string(sentence, string) A mapping of the literal to the actual sentennce content.

Sentence Structure Predicates
PredicateDefinition
sentence_parent(sentence, abstract) Abstracts are "parents" of sentences.
sentence_child(sentence, phrase) Phrases are "children" of sentences.
sentence_descendent(sentence, phrase) Phrases are under sentence.
sentence_descendent(sentence, word) And so are words.
phrase_ancestor(phrase, sentence) Ancestors are parents, parent's parents, etc.
phrase_descendent(phrase, word) Descendents are children's children.
phrase_child(phrase, word) Words are "children" of phrases.
phrase_parent(phrase, sentence) Sentences are "parents" of phrases.
phrase_previous(phrase, phrase) The phrase immediately previous this phrase.
phrase_next(phrase, phrase) The phrase immediately following this phrase.
phrase_before(phrase, phrase) A phrase somewhere before this phrase in the sentence.
phrase_after(phrase, phrase) A phrase somewhere after this phrase in the sentence.
phrase_sibling(phrase, phrase) A phrase either before or after this phrase in the sentence.
word_ancestor(word, phrase) Phrases are "parents" of words.
word_ancestor(word, sentence) Phrases are "parents" of words.
word_parent(word, phrase) Phrases are "parents" of words.
word_previous(word, word) The word immediately previous this word in the sentence.
word_next(word, word) The word immediately following this word in the sentence.
word_before(word, word) A word somewhere before this word in the sentence.
word_after(word, word) A word somewhere after this word in the sentence.
word_sibling(word, word) A word either before or after this word in the sentence.
word_previous_within_phrase(word, word) The word immediately previous this word not crossing phrase boundaries.
word_next_within_phrase(word, word) The word immediately following this word not crossing phrase boundaries.
word_before_within_phrase(word, word) A word somewhere before this word not crossing phrase boundaries.
word_after_within_phrase(word, word) A word immediately previous this word not crossing phrase boundaries.
word_sibling_within_phrase(word, word) The word immediately previous this word not crossing phrase boundaries.

Part of Speech and Lexical Predicates
PredicateDefinition
pp_segment(phrase) This is a prepositional phrase (of, in, with, etc)
vp_segment(phrase) This is a verb phrase.
adj_segment(phrase) This is an adjective phrase.
np_segment(phrase) This is a noun phrase.
np_conj_segment(phrase) This is a noun conjuntive phrase (Sally and Bob)
isa_np_segment(phrase) Includes both np_segment and np_conj_segment.
c_m(phrase)
art(phrase)
adj(phrase)
prep(phrase)
conj(phrase)
adv(phrase)
n(phrase)
lex(phrase)
part(phrase)
v(phrase)
c_m(word)
art(word)
adj(word)
prep(word)
conj(word)
adv(word)
n(word)
lex(word)
part(word)
v(word)
cop(word)
det(word)
unk(word)
pn(word)
num(word)
ger(word)
inf(word)
aux(word)
novelword(word) This word was not found in the standard UNIX Webster's dictionary.
alphabetic(word) This word contains only letters.
alphanumeric(word) This word contains both numbers and letters.
singleChar(word) There is only one character in this word.
hyphenated(word) There is a hypen in this word.
all_caps(word) All letters in this word are capitalized.
leading_cap(word) The first letter of this word is capitalized.
internal_cap(word) An internal letter of this word is capitalized.

Phrase and Sentence Descriptive Predicates
PredicateDefinition
first_word_in_phrase(phrase, word) This word is the first word in this phrase.
last_word_in_phrase(phrase, word) This word is the last word in this phrase
first_phrase_in_sentence(sentence, phrase) This phrase is the first phrase in this sentence.
last_phrase_in_sentence(sentence, phrase) This phrase is the last phrase in this sentence.
short_phrase(phrase) Short phrases have <= 3 child words in a phrase_child(word, phrase) relation.
medium_phrase(phrase) Medium phrases have between 3 and 7 words.
long_phrase(phrase) Long phrases have >= 7 words.
short_sentence(sentence) Short sentences have <= 10 words.
avg_length_sentence(sentence) Average Length sentences have between 10 and 30 words.
long_sentence(sentence) Long sentences have >= 30 words.
few_phrases_in_sentence(sentence) Sentences with less than 6 phrases.
several_phrases_in_sentence(sentence) Sentences with between 6 and 18 phrases.
many_phrases_in_sentence(sentence) Sentences with more than 18 phrases.
first_sentence_in_abstract(abstract, sentence) The first sentence in the abstract.
middle_sentence_in_abstract(abstract, sentence) Not the first or the last sentence in the abstract.
last_sentence_in_abstract(abstract, sentence) The last sentence in the abstract.
short_abstract(abstract) Abstracts with less than 5 sentences.
medium_abstract(abstract) Abstracts with between 5 and 10 sentences.
long_abstract(abstract) Abstracts with 10 or more sentences.
phrase_contains_go_term(phrase, string, string, word) This phrase contains a word listed in the Gene Ontology (http://www.geneontology.com)
phrase_contains_medDict_term(phrase, string, string, word) This phrase contains a word listed in the Online Medical Dictionary (http://cancerweb.ncl.ac.uk/omd/index.html)
phrase_contains_mesh_term(phrase, string, string, word) This phrase contains a word listed in Medical Subject Headings (http://www.nlm.nih.gov/mesh/meshhome.html)
phrase_contains_mesh_protein(phrase, string, string, word) This phrase contains a word listed in Medical Subject Headings subjeading protein (D12.776)
phrase_contains_mesh_peptide(phrase, string, string, word) This phrase contains a word listed in Medical Subject Headings subjeading peptide (D12.644)
phrase_contains_mesh_cellular_structure(phrase, string, string, word) This phrase contains a word listed in Medical Subject Headings subjeading cellular structure (A11.284)
phrase_contains_some_prep(phrase, word)
phrase_contains_some_art(phrase, word)
phrase_contains_some_adj(phrase, word)
phrase_contains_some_n(phrase, word)
phrase_contains_some_v(phrase, word)
phrase_contains_some_cop(phrase, word)
phrase_contains_some_det(phrase, word)
phrase_contains_some_unk(phrase, word)
phrase_contains_some_pn(phrase, word)
phrase_contains_some_adv(phrase, word)
phrase_contains_some_c_m(phrase, word)
phrase_contains_some_num(phrase, word)
phrase_contains_some_ger(phrase, word)
phrase_contains_some_inf(phrase, word)
phrase_contains_some_conj(phrase, word)
phrase_contains_some_aux(phrase, word)
phrase_contains_some_lex(phrase, word)
phrase_contains_some_part(phrase, word)
phrase_contains_some_marked_up_arg(phrase, arg, word, fold)
phrase_contains_some_unknown_word(phrase, pos, word) This phrase contains a word not found in the standard UNIX webster dictionary.
phrase_contains_some_alphabetic(phrase, pos, word) This phrase contains a word with all alphabetic characters.
phrase_contains_some_alphanumeric(phrase, pos, word) This phrase contains a word with alphabetic and numeric characters mixed.
phrase_contains_some_numeric(phrase, pos, word) This phrase contains a word with only numbers.
phrase_contains_some_singlechar_word(phrase, pos, word) This phrase contains a word with only one character.
phrase_contains_some_hyphenated_word(phrase, pos, word) This phrase contains a word with a hyphen.
phrase_contains_some_all_caps_word(phrase, pos, word) This phrase contains a word that every letter is capitalized.
phrase_contains_some_leading_cap_word(phrase, pos, word) This phrase contains a word with the first letter capitalized.
phrase_contains_some_internal_cap_word(phrase, pos, word) This phrase contains a word with an internal character capitalized.
no_POS_in_phrase(phrase, pos) This phrase has no Parts of Speech of type pos.
one_POS_in_phrase(phrase, pos) This phrase has one Part of Speech of type pos.
few_POS_in_phrase(phrase, pos) This phrase has 0-2 Parts of Speech of type pos.
some_POS_in_phrase(phrase, pos) This phrase has 3-5 Parts of Speech of type pos.
many_POS_in_phrase(phrase, pos) This phrase has 6 or more Parts of Speech of type pos.
no_wordPOS_in_sentence(sentence, pos) This sentence has no Parts of Speech of type pos.
one_wordPOS_in_sentence(sentence, pos) This sentence has one Parts of Speech of type pos.
few_wordPOS_in_sentence(sentence, pos) This sentence has 0-3 Parts of Speech of type pos.
some_wordPOS_in_sentence(sentence, pos) This sentence has 4-7 Parts of Speech of type pos.
many_wordPOS_in_sentence(sentence, pos) This sentence has 8 or more Parts of Speech of type pos.
no_phrasePOS_in_sentence(sentence, pos) This sentence has no phrase Parts of Speech of type pos.
one_phrasePOS_in_sentence(sentence, pos) This sentence has one phrase Parts of Speech of type pos.
few_phrasePOS_in_sentence(sentence, pos) This sentence has 0-2 phrase Parts of Speech of type pos.
some_phrasePOS_in_sentence(sentence, pos) This sentence has 3-5 phrase Parts of Speech of type pos.
many_phrasePOS_in_sentence(sentence, pos) This sentence has 6 or more phrase Parts of Speech of type pos.
phrase_contains_POS(phrase, word, pos) This phrase contains this word with Part of Speech pos.
phrase_contains_POS_pair(phrase, word, word, pos, pos) This phrase contains these two words and parts of speech.
phrase_contains_POS_triple(phrase, word, word, word, pos, pos, pos) This phrase contains these three words and their parts of speech.
phrase_contains_specific_word(phase, word, string) This phrase contains a word and the actual text matters.
phrase_contains_specific_word_pair(phrase, word, word, string, string) This phrase contains two words and their actual text matters.
phrase_contains_specific_word_triple(phrase, word, word, word, string, string, string) This phrase contains three words and their actual text matters.
sentence_contains_specific_phrase(sentence, phrase, string) This sentence contains a phrase where the actual text matters.
sentence_contains_specific_word(sentence, phrase, word, string) This sentence contains a word where the actual text matters.
sentence_contains_specific_word_pair(sentence, phrase, phrase, word, word, string, string) This sentence contains two words where the actual text matters.
sentence_contains_specific_word_triple(sentence, phrase, phrase, phrase, word, word, word, string, string, string) This sentence contains three words where the actual text matters.
sentence_contains_POS_pair(sentence, phrase, phrase, word, word, pos, pos) This sentence contains two words with particular Parts Of Speech.
sentence_contains_POS_triple(sentence, phrase, phrase, phrase, word, word, word, pos, pos, pos) This sentence contains three words with particular Parts Of Speech.
sentence_contains_specific_word_POS_pair(sentence, phrase, phrase, word, word, string, pos) This sentence contains two words where the actual text and Part Of Speech matters.
sentence_contains_specific_POS_word_pair(sentence, phrase, phrase, word, word, pos, string) This sentence contains two words where the Part of Speech and actual text matters.

Target Args Predicates
PredicateDefinition
target_arg1_before_target_arg2(word), etc. For this task, the protein phrase is before the location phrase.
target_arg1_after_target_arg2(word), etc. For this task, the protein phrase is after the location phrase.
adjacent_target_args(example, dataset, fold) The two target phrases are adjacent in the sentence.
identical_target_args(example, dataset, fold) The two target phrases are the exact same phrase.
few_phrases_before_target_args(example, dataset, fold) There are 0-2 phrases before the target args
some_phrases_before_target_args(example, dataset, fold) There are 3-5 phrases before the target args
many_phrases_before_target_args(example, dataset, fold) There are 6 or more phrases before the target args
few_phrases_between_target_args(example, dataset, fold) There are 0-2 phrases between the target args
some_phrases_between_target_args(example, dataset, fold) There are 3-5 phrases between the target args
many_phrases_between_target_args(example, dataset, fold) There are 6 or more phrases between the target args
few_phrases_after_target_args(example, dataset, fold) There are 0-2 phrases after the target args
some_phrases_after_target_args(example, dataset, fold) There are 3-5 phrases after the target args
many_phrases_after_target_args(example, dataset, fold) There are 6 or more phrase after the target args
few_words_before_target_args(example, dataset, fold) There are 0-3 words before the target args
some_words_before_target_args(example, dataset, fold) There are 4-9 words before the target args
many_words_before_target_args(example, dataset, fold) There are 10 or more words before the target args
few_words_between_target_args(example, dataset, fold) There are 0-3 words between the target args
some_words_between_target_args(example, dataset, fold) There are 4-9 words between the target args
many_words_between_target_args(example, dataset, fold) There are 10 or more words between the target args
few_words_after_target_args(example, dataset, fold) There are 0-3 words after the target args
some_words_after_target_args(example, dataset, fold) There are 4-9 words after the target args
many_words_after_target_args(example, dataset, fold) There are 10 or more words after the target args
before_both_target_phrases(example, dataset, fold, phrase) This phrase is before both target args.
in_between_both_target_phrases(example, dataset, fold, phrase) This phrase is between both target args.
after_both_target_phrases(example, dataset, fold, phrase) This phrase is after both target args.
word_before_both_target_phrases(example, dataset, fold, phrase, word, string) This word is before both target args.
word_in_between_both_target_phrases(example, dataset, fold, phrase, word, string) This word is between both target args.
word_after_both_target_phrases(example, dataset, fold, phrase, word, string) This word is after both target args.
target_arg1_before_target_arg2(example, dataset, fold) The first target (protein phrase) is before the second (location phrase).
target_arg2_before_target_arg1(example, dataset, fold) The second target (location phrase) is before the first (protein phrase).
word_prev_target_arg1(example, dataset, fold, phrase, word, string) This word is before the protein phrase.
word_prev_target_arg2(example, dataset, fold, phrase, word, string) This word is before the location phrase
word_next_target_arg1(example, dataset, fold, phrase, word, string) This word is after the protein phrase.
word_next_target_arg2(example, dataset, fold, phrase, word, string) This word is after the location phrase.
word_pair_in_between_both_target_phrases(example, dataset, fold, phrase, phrase, word, word, string, string) These words are in between both target args.
pos_pair_in_between_both_target_phrases(example, dataset, fold, phrase, phrase, pos, pos, string, string) These Parts Of Speech are in between both target args.
word_pos_in_between_both_target_phrases(example, dataset, fold, phrase, phrase, word, pos, string, string) This word, Part Of Speech pair is in between both target args.
pos_word_in_between_both_target_phrases(example, dataset, fold, phrase, phrase, pos, word, string, string) This Part Of Speech, word pair is in between both target args.
word_pair_prev_target_arg2(example, dataset, fold, phrase, phrase, word, word, string, string) These two words are before the location phrase.
word_pair_prev_target_arg1(example, dataset, fold, phrase, phrase, word, word, string, string) These two words are before the protein phrase.
word_pair_next_target_arg2(example, dataset, fold, phrase, phrase, word, word, string, string) These two words are after the location phrase.
word_pair_next_target_arg1(example, dataset, fold, phrase, phrase, word, word, string, string) These two words are after the protein phrase.

Frequency Predicates
These predicates are all tied to a particular fold, so they can be learned on the trainset and evaluated on the test set without any information leakage.

PredicateDefinition
phrase_contains_some_arg_10x_word(phrase, arg, pos, word, fold)
phrase_contains_some_arg_5x_word(phrase, arg, pos, word, fold)
phrase_contains_some_arg_2x_word(phrase, arg, pos, word, fold)
phrase_contains_some_arg_halfX_word(phrase, arg, pos, word, fold)
phrase_contains_several_arg_10x_word(phrase, arg, pos, fold)
phrase_contains_several_arg_5x_word(phrase, arg, pos, fold)
phrase_contains_several_arg_2x_word(phrase, arg, pos, fold)
phrase_contains_many_arg_10x_word(phrase, arg, pos, fold)
phrase_contains_many_arg_5x_word(phrase, arg, pos, fold)
phrase_contains_many_arg_2x_word(phrase, arg, pos, fold)
phrase_contains_no_arg_halfX_word(phrase, arg, pos, fold)
phrase_contains_some_between_10x_word(phrase, arg, pos, word, fold)
phrase_contains_some_between_5x_word(phrase, arg, pos, word, fold)
phrase_contains_some_between_2x_word(phrase, arg, pos, word, fold)
phrase_contains_some_between_halfX_word(phrase, arg, pos, word, fold)
phrase_contains_several_between_10x_word(phrase, arg, pos, fold)
phrase_contains_several_between_5x_word(phrase, arg, pos, fold)
phrase_contains_several_between_2x_word(phrase, arg, pos, fold)
phrase_contains_many_between_10x_word(phrase, arg, pos, fold)
phrase_contains_many_between_5x_word(phrase, arg, pos, fold)
phrase_contains_many_between_2x_word(phrase, arg, pos, fold)
phrase_contains_no_between_halfX_word(phrase, arg, pos, fold)
very_high_phrase_log_odds(+phrase, #arg, #fold)
high_phrase_log_odds(+phrase, #arg, #fold)
med_phrase_log_odds(+phrase, #arg, #fold)
positive_high_phrase_log_odds(+phrase, #arg, #fold)
very_rare_word(word, fold))
rare_word(word, fold))
uncommon_word(word, fold))
common_word(word, fold))
very_common_word(word, fold))
only_in_one_sentence(word, fold))
only_in_one_abstract(word, fold))
in_few_sentences(word, fold))
in_few_abstracts(word, fold))
in_several_sentences(word, fold))
in_several_abstracts(word, fold))
in_many_sentences(word, fold))
in_many_abstracts(word, fold))
in_very_many_sentences(word, fold))
in_very_many_abstracts(word, fold)