Java Wordnet Interface

MIT Java Wordnet Interface

JWI是一个java实现的与Wordnet交互的工具包,通过JWI我们可以搜索词form对应的词汇(具有具体语义)及其synset,gloss,上下级词汇等。

Getting Started

JWI交互的主要接口是edu.mit.jwi.IDictionary,标准实现类为edu.mit.jwi.Dictionary,我们可以通过指定Wordnet所在的位置的url简单的初始化一个Dictionary。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public void testDictionary() throws IOException {
// construct the URL to the Wordnet dictionary directory
String wnhome = System.getenv("WNHOME");
String path = wnhome + File.separator + "dict";
URL url = new URL("file", null, path);
// construct the dictionary object and open it
IDictionary dict = new Dictionary(url); dict.open();
// look up first sense of the word "dog"
IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN);
IWordID wordID = idxWord.getWordIDs().get(0);
IWord word = dict.getWord(wordID);
System.out.println("Id = " + wordID);
System.out.println("Lemma = " + word.getLemma());
System.out.println("Gloss = " + word.getSynset().getGloss());
}
1
2
3
4
Id = WID-2084071-n-?-dog
Lemma = dog
Gloss = a member of the genus Canis (probably descended from the common
wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night"

为了提高速度,JWI2.2提供了一个新的IDicitionary实现类:edu.mit.jwi.RAMDictionary,允许将整个Dicitionary加载到RAM中,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public void testRAMDictionary(File wnDir) throws Exception {
// construct the dictionary object and open it
IRAMDictionary dict = new RAMDictionary(wnDir, ILoadPolicy.NO_LOAD);
dict.open();
// do something
trek(dict);
// now load into memory
System.out.print("\nLoading Wordnet into memory...");
long t = System.currentTimeMillis();
dict.load(true);
System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t);
// try it again, this time in memory
trek(dict);
}

public void trek(IDictionary dict){
int tickNext = 0;
int tickSize = 20000;
int seen = 0;
System.out.print("Treking across Wordnet");
long t = System.currentTimeMillis();
for(POS pos : POS.values()) {
for(Iterator <IIndexWord> i = dict.getIndexWordIterator(pos); i.hasNext(); ) {
for(IWordID wid : i.next().getWordIDs()){
seen += dict.getWord(wid).getSynset().getWords().size();
if(seen > tickNext) {
System.out.print(".");
tickNext = seen + tickSize;
}
}
}
}
System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t); System.out.println("In my trek I saw " + seen + " words");
}
1
2
3
Treking across Wordnet...........................done (3467 msec) In my trek I saw 522858 words
Loading Wordnet into memory...done (5728 msec)
Treking across Wordnet...........................done (205 msec) In my trek I saw 522858 words

需要注意的是,JWI要求作为参数的传入必须是词的词根root,否则可能会无法返回在wordnet正确的词汇,比如dog,我们应该传入dog作为参数,dogs则有可能会导致结果有误。我们可以通过edu.mit.jwi.morph.WordnetStemmer来过来的词root,之后以此输入词典进行查找。

Retrieve synonyms

也就是取得一个term的某种语义意思的synset中的其他词汇。

1
2
3
4
5
6
7
8
9
public void getSynonyms(IDictionary dict){
// look up first sense of the word "dog"
IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN);
IWordID wordID = idxWord.getWordIDs().get(0); // 1st meaning
IWord word = dict.getWord(wordID);
ISynset synset = word.getSynset();
// iterate over words associated with the synset
for(IWord w : synset.getWords()) System.out.println(w.getLemma());
}
1
2
3
dog
domestic_dog
Canis_familiaris

Get the synonym number(in all possible synset)

1
2
3
4
5
6
7
8
9
10
11
12
13
public static int getSynonymsNumber(String term) {
int count = 0;
for(POS pos: POS.values()) {
IIndexWord idxWord = dict.getIndexWord(term, pos);
if(idxWord==null) continue;
for(IWordID wordID : idxWord.getWordIDs()) {
IWord word = dict.getWord(wordID);
ISynset synset = word.getSynset();
count += synset.getWords().size();
}
}
return count;
}

Retrieve hypernyms

各个synset之间通过semantic pointers指针联系在一起,其中最常见的是Hyoernym pointer(上级指针),也就是指向一个更加general的synset。我们可以通过调用synset getRelatedSynsets(IPointer)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public void getHypernyms(IDictionary dict) {
// get the synset
IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN);
IWordID wordID = idxWord.getWordIDs().get(0); // 1st meaning
IWord word = dict.getWord(wordID);
ISynset synset = word.getSynset();
// get the hypernyms
List<ISynsetID> hypernyms = synset.getRelatedSynsets(Pointer.HYPERNYM);
// print out each hypernyms id and synonyms
List<IWord> words;
for(ISynsetID sid : hypernyms) {
words = dict.getSynset(sid).getWords(); System.out.print(sid + " {");
for(Iterator<IWord> i = words.iterator(); i.hasNext();) {
System.out.print(i.next().getLemma()); if(i.hasNext())
System.out.print(", "); System.out.println("}");
}
}
}
1
2
SID-2083346-n {canine, canid}
SID-1317541-n {domestic_animal, domesticated_animal}

Lexical pointers and Sematic pointers

Wordnet中存在两种指针,一种是在synset之间的semantic指针,一种是word form之间的lexical pointers,这些指针在JWI的枚举类Pointer,但在使用时,我们应该使用合适的类的合适的方法种使用合适的pointer,比如为了获得synset的具有某种关系synset,我们应该调用ISynset.getRelatedSynsets(IPointer) method,其中IPointer类型指名了这种关系,如上面使用的Pointer.HYPERNYM)。为了获得词的具有某种关系的相关词汇,我们应该使用IWord.getRelatedWords(IPointer) method,同样其中IPointer类型指名了这种关系。如果我们传递了一个lexical pointer(比如Pointer.DERIVED)到getRelatedSynsets method,则我们无法得到任何结果。

这两种关系在Wordnet的documentation并没有解释的很清楚,下面是一个很好的说明例子。我们可以看到,在JWI种,lexical pointer只存在与IWord对象间,而semantic pointer只存在于ISynset objects。没有指针连接word和synset。所以我们只能查找synset的hypernyms,而不能直接查找word的hypernyms,同样我们只能查找word的derived form,synset没有其derived synset。

存在的一个问题是,wordnet documentation并没有说明哪些指针是semantic pointer,哪些指针是lexical pointer,哪些同时是semantic pointer和lexical pointer。JWI文档给出了一个基于统计的各个pointer的使用类型情况,如下图所示。

参考:
JWI 2.4.0 - MIT