Java Wordnet Interface

MIT Java Wordnet Interface

JWI是一个java实现的与Wordnet交互的工具包，通过JWI我们可以搜索词form对应的词汇(具有具体语义)及其synset，gloss，上下级词汇等。

Getting Started

JWI交互的主要接口是edu.mit.jwi.IDictionary，标准实现类为edu.mit.jwi.Dictionary，我们可以通过指定Wordnet所在的位置的url简单的初始化一个Dictionary。

public void testDictionary() throws IOException {
	// construct the URL to the Wordnet dictionary directory
	String wnhome = System.getenv("WNHOME");
	String path = wnhome + File.separator + "dict"; 
	URL url = new URL("file", null, path);
	// construct the dictionary object and open it
	IDictionary dict = new Dictionary(url); dict.open();
	// look up first sense of the word "dog"
	IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN); 
	IWordID wordID = idxWord.getWordIDs().get(0);
	IWord word = dict.getWord(wordID);
	System.out.println("Id = " + wordID); 
	System.out.println("Lemma = " + 	word.getLemma()); 
	System.out.println("Gloss = " + 	word.getSynset().getGloss());
}

Id = WID-2084071-n-?-dog
Lemma = dog
Gloss = a member of the genus Canis (probably descended from the common
wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night"

为了提高速度，JWI2.2提供了一个新的IDicitionary实现类：edu.mit.jwi.RAMDictionary，允许将整个Dicitionary加载到RAM中，

public void testRAMDictionary(File wnDir) throws Exception {
	// construct the dictionary object and open it
	IRAMDictionary dict = new RAMDictionary(wnDir, ILoadPolicy.NO_LOAD); 
	dict.open();
	// do something
	trek(dict);
	// now load into memory
	System.out.print("\nLoading Wordnet into memory...");
	long t = System.currentTimeMillis();
	dict.load(true);
	System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t);
	// try it again, this time in memory
	trek(dict); 
}
	
public void trek(IDictionary dict){
	int tickNext = 0;
	int tickSize = 20000;
	int seen = 0;
	System.out.print("Treking across Wordnet"); 
	long t = System.currentTimeMillis(); 
	for(POS pos : POS.values()) {
		for(Iterator <IIndexWord> i = dict.getIndexWordIterator(pos); i.hasNext(); ) {
			for(IWordID wid : i.next().getWordIDs()){
				seen += dict.getWord(wid).getSynset().getWords().size(); 
				if(seen > tickNext) {
					System.out.print(".");
					tickNext = seen + tickSize; 
				}
			}
		}
	}
	System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t); System.out.println("In my trek I saw " + seen + " words");
}

1
2
3

Treking across Wordnet...........................done (3467 msec) In my trek I saw 522858 words
Loading Wordnet into memory...done (5728 msec)
Treking across Wordnet...........................done (205 msec) In my trek I saw 522858 words

需要注意的是，JWI要求作为参数的传入必须是词的词根root，否则可能会无法返回在wordnet正确的词汇，比如dog，我们应该传入dog作为参数，dogs则有可能会导致结果有误。我们可以通过edu.mit.jwi.morph.WordnetStemmer来过来的词root，之后以此输入词典进行查找。

Retrieve synonyms

也就是取得一个term的某种语义意思的synset中的其他词汇。

public void getSynonyms(IDictionary dict){
	// look up first sense of the word "dog"
	IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN); 
	IWordID wordID = idxWord.getWordIDs().get(0); // 1st meaning 
	IWord word = dict.getWord(wordID);
	ISynset synset = word.getSynset();
	// iterate over words associated with the synset
	for(IWord w : synset.getWords()) System.out.println(w.getLemma());
}

1
2
3

dog
domestic_dog 
Canis_familiaris

Get the synonym number(in all possible synset)

public static int getSynonymsNumber(String term) {
	int count = 0;
	for(POS pos: POS.values()) {
		IIndexWord idxWord = dict.getIndexWord(term, pos); 
		if(idxWord==null) continue;
		for(IWordID wordID : idxWord.getWordIDs()) {
			IWord word = dict.getWord(wordID);
			ISynset synset = word.getSynset();
			count += synset.getWords().size();
		}
	}
	return count;
}

Retrieve hypernyms

各个synset之间通过semantic pointers指针联系在一起，其中最常见的是Hyoernym pointer(上级指针)，也就是指向一个更加general的synset。我们可以通过调用synset getRelatedSynsets(IPointer)。

public void getHypernyms(IDictionary dict) {
	// get the synset
	IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN); 
	IWordID wordID = idxWord.getWordIDs().get(0); // 1st meaning 
	IWord word = dict.getWord(wordID);
	ISynset synset = word.getSynset();
	// get the hypernyms
	List<ISynsetID> hypernyms = synset.getRelatedSynsets(Pointer.HYPERNYM);
	// print out each hypernyms id and synonyms
	List<IWord> words;
	for(ISynsetID sid : hypernyms) {
		words = dict.getSynset(sid).getWords(); System.out.print(sid + " {");
		for(Iterator<IWord> i = words.iterator(); i.hasNext();) {
			System.out.print(i.next().getLemma()); if(i.hasNext())
			System.out.print(", "); System.out.println("}");
		}
	} 
}

1 2	SID-2083346-n {canine, canid} SID-1317541-n {domestic_animal, domesticated_animal}

Lexical pointers and Sematic pointers

Wordnet中存在两种指针，一种是在synset之间的semantic指针，一种是word form之间的lexical pointers，这些指针在JWI的枚举类Pointer，但在使用时，我们应该使用合适的类的合适的方法种使用合适的pointer，比如为了获得synset的具有某种关系synset，我们应该调用ISynset.getRelatedSynsets(IPointer) method，其中IPointer类型指名了这种关系，如上面使用的Pointer.HYPERNYM)。为了获得词的具有某种关系的相关词汇，我们应该使用IWord.getRelatedWords(IPointer) method，同样其中IPointer类型指名了这种关系。如果我们传递了一个lexical pointer（比如Pointer.DERIVED）到getRelatedSynsets method，则我们无法得到任何结果。

这两种关系在Wordnet的documentation并没有解释的很清楚，下面是一个很好的说明例子。我们可以看到，在JWI种，lexical pointer只存在与IWord对象间，而semantic pointer只存在于ISynset objects。没有指针连接word和synset。所以我们只能查找synset的hypernyms，而不能直接查找word的hypernyms，同样我们只能查找word的derived form，synset没有其derived synset。

存在的一个问题是，wordnet documentation并没有说明哪些指针是semantic pointer，哪些指针是lexical pointer，哪些同时是semantic pointer和lexical pointer。JWI文档给出了一个基于统计的各个pointer的使用类型情况，如下图所示。

参考：
JWI 2.4.0 - MIT