Lucene(4)_Document analysis

Analysis是指将field的内容转换为最基本的索引表示单元—项(Term),也就是token+field名。分析器对分析操作进行了封装,提供一系列操作将文本转换为tokens,可能包括:提取单词、取出标点符号、转为小写、去除停用词、词干还原等。这个过程称为tokenizaiton,从文本提取token,与其域名(field)结合后,就形成了项(term)。

使用Lucene时,选择一个合适的分析器是非常关键的,决定于语言、行业等。Lucene提供了许多内置分析器可以满足一般需要,同时如果我们需要自定义分析器,Lucene的构建模块也可以使得这一过程变得简单。

使用分析器(analyzers)

分析操作将出现在任何需要将文本转换为Term的时候,对于Lucene核心来说,主要包括两个过程:建立索引期间和使用QueryParser对象进行搜索时。

下面是使用4个内置分析器分别分析两个短语,让我们先对分析操作有一个直观的认识。

1
2
3
4
5
6
7
8
9
Analyzing "The quick brown fox jumped over the lazy dog"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
1
2
3
4
5
6
7
8
9
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]

可以看出分析结果中的词汇单元取决于对应的分析器。

  • WhitespaceAnalyzer,该分析器仅仅功过空格来分割文本信息,并不对生成的token进行其他处理。
  • SimpleAnalyzer,通过非字母字符来分割文本信息,然后统一为小写形式。
  • StopAnalyzer,在上面的基础上去除停用词。
  • StandardAnalyzer,这是Lucene最复杂的核心分析器,包含大量的逻辑操作来之别某些种类的token,比如公司名称,实体,e-mail等。同样会将token转换为小写形式并去除停用词和标点符号。

索引过程的analysis

在索引期间,文档field的内容需要被转换为token,用于indexing。

1
2
3
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
IndexWriter writer = new IndexWriter(directory, analyzer,
IndexWriter.MaxFieldLength.UNLIMITED);

Lucene_5

通常需要实例化一个Analyzer对象,然后将其传递给IndexWriter对象。如果没有指定则会使用默认的分析器。另外如果某个文档需要特殊的分析器处理的话,在addDocument和updateDocument时也可以指定分析器。

为了确保文本信息被分析器处理,可以在创建field时指定Field.Index.ANALYZED或Field.Index.ANLYZED_NO_NORMS参数。如果需要将整个field内容作为一个token处理(也就是不需要tokenlization),可以设置为Field.Index.NOT_ANALYZED或Field.Index.NOT_NO_ANALYZED。

1
2
3
4
5
6
7
8
9
10
11
12
new Field(String, String, Field.Store.YES, Field.Index.ANALYZED) 
//creates a tokenized and stored field. Rest assured the original
//String value is stored. But the output of the designated
//Analyzer dictates what’s indexed and available for searching.

//The following code demonstrates indexing of a document where
//one field is analyzed and stored, and the second field is
//analyzed but not stored:

Document doc = new Document();
doc.add(new Field("title", "This is the title",
Field.Store.YES, Field.Index.ANALYZED));

QueryParser分析

QueryParser同样需要使用analyzer将query分解为各个term。分析器会接受queyr expression的连续的独立文本片段,但不会接收整个表达式。如:
"president obama" +harvard +professor
QueryParser会调用三次分析器,分别处理"president obama"harvardprofessor

What’s inside an analyzer?

Analyzer是一个抽象类,是所有分析器的基类。只需要实现一个抽象方法,将text转换为TokenStream实例。

1
public TokenStream tokenStream(String fieldName, Reader reader)

TokenStream用于循环遍历所有terms。

下面是简单的SimpleAnalyzer类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public final class SimpleAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader
throws IOException {
Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
if (tokenizer == null) {
tokenizer = new LowerCaseTokenizer(reader);
setPreviousTokenStream(tokenizer);
} else
tokenizer.reset(reader);
return tokenizer;
} }

LowerCasetokenizer对象根据文本中的非字母字符来分割文本,并将所有字母换成小写形式。resusableTokenStream()方法是一个可选方法,分析器可以实现这个方法实现更好的效率,因为这个方法实现了重复利用前面线程创建的TokenStream。

Token

token是分析过程中产生的基本单元。一个token带有其text值已经其他一些元数据,例如起始位置终止位置偏移量、类型等。文本被tokenlize之后,每个token的位置信息都是相对于前面一个token的位置的增量值进行保存,表示所有token都是连续的。

在tokenlize之后,每个token连接field形成term被传递给索引。位置增量(position increment)、起点(start)和终点偏移量(end offset)和有效负载(payload)是token一起附带到索引的元数据。

位置增量使得当前token和前一个token在位置上关联起来。一般来说位置增量为1,表示每个token存在于field中唯一且连续的位置上。位置增量因子会直接影响短语查询(phrase queries)和跨度查询(span queries),因为这些查询需要知道field中各个term之间的距离。

如果位置增量大于1,则允许token之间有空隙,可以用这个空隙来表示被删除的单词。
位置增量为0的token表示该token放置与与前一个token相同的位置上。同义词分析器可以通过0增量来表示插入的同义词。

分析TokenStream

TokenStream是一个能在被调用后产生token序列的类,TokenStream类有两种不同类型:Tokenizer类和TokenFilter类。注意TokenFilter类可以封装另一个TokenStream抽象类对象。

Tokenizer对象通过java.io.Reaser对象读取字符并创建token,而TokenFilter则负责处理输入的token,然后通过新增、删除或修改属性的方式来产生新的token。

当分析器从它的tokenStream方法或者reusableTokenStream方法返回tokenStream对象后,就可以用一个tokenizer对象创建tokens序列,然后再链接任意数量的tokenFilter对象来对这些tokens进行修改。这被称为分析器链(analyzer chain)。

下面是Lucene的核心Tokenizer类和TokenFilter类。

Class name Description
TokenStream Abstract Tokenizer base class.
Tokenizer TokenStream whose input is a Reader.
CharTokenizer Parent class of character-based tokenizers, with abstract isTokenChar() method. Emits tokens for contiguous blocks when isTokenChar() returns true. Also provides the capability to normalize (for example, lowercase) characters. Tokens are limited to a maximum size of 255 characters.
WhitespaceTokenizer CharTokenizer with isTokenChar() true for all nonwhitespace characters.
KeywordTokenizer Tokenizes the entire input string as a single token.
LetterTokenizer Tokenizes the entire input string as a single token. CharTokenizer with isTokenChar() true when Character.isLetter is true.
LowerCaseTokenizer LetterTokenizer that normalizes all characters to lowercase.
SinkTokenizer A Tokenizer that absorbs tokens, caches them in a private list, and can later iterate over the tokens it had previously cached. This is used in conjunction with TeeTokenizer to “split” a TokenStream.
StandardTokenizer Sophisticated grammar-based tokenizer, emitting tokens for high-level types like email addresses (see section 4.3.2 for more details). Each emitted token is tagged with a special type, some of which are handled specially by StandardFilter.
TokenFilter TokenStream whose input is another TokenStream.
LowerCaseFilter Lowercases token text.
StopFilter Removes words that exist in a provided set of words.
PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri.
TeeTokenFilter Splits a TokenStream by passing each token it iterates through into a SinkTokenizer. It also returns the token unnmodified to its caller.
ASCIIFoldingFilter Maps accented characters to their unaccented counterparts.
CachingTokenFilter Saves all tokens from the input stream and can replay the stream once reset is called.
LengthFilter Accepts tokens whose text length falls within a specified range.
StandardFilter Designed to be fed by a StandardTokenizer. Removes dots from acronyms and ’s (apostrophe followed by s) from words with apostrophes.

下面是生成一个分析链的代码。

1
2
3
4
5
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(true,
new LowerCaseTokenizer(reader),
stopWords);
}

在这个分析器中,LowerCaseTokenizer对象会通过Reader对象输出原始tokens,然后这些token将会被StopFilter处理。

观察分析器分析过程

通常情况下,分析过程所产生的token会在没有展示的情况下用于索引操作或者检索。下面我们具体观察一些具体的分析过程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import java.io.IOException;

public class AnalyzerDemo {
private static final String[] examples = {
"The quick brown fox jumped over the lazy dog",
"XY&Z Corporation - xyz@example.com"
};

private static final Analyzer[] analyzers = new Analyzer[] {
new WhitespaceAnalyzer(),
new SimpleAnalyzer(),
new StopAnalyzer(Version.LUCENE_30),
new StandardAnalyzer(Version.LUCENE_30)
};

public static void main(String[] args) throws IOException {

String[] strings = examples;
if (args.length > 0) { // A
strings = args;
}

for (String text : strings) {
analyze(text);
}
}

private static void analyze(String text) throws IOException {
System.out.println("Analyzing \"" + text + "\"");
for (Analyzer analyzer : analyzers) {
String name = analyzer.getClass().getSimpleName();
System.out.println(" " + name + ":");
System.out.print(" ");
AnalyzerUtils.displayTokens(analyzer, text); // B
System.out.println("\n");
}
}
}

// #A Analyze command-line strings, if specified
// #B Real work done in here

下面的AnalyzerUtils调用了analyzer对text进行分析,并将得到的token直接输出,而不是用于索引。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import junit.framework.Assert;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;

import java.io.IOException;
import java.io.StringReader;

// From chapter 4
public class AnalyzerUtils {
public static void displayTokens(Analyzer analyzer,
String text) throws IOException {
displayTokens(analyzer.tokenStream("contents", new StringReader(text))); //A
}

public static void displayTokens(TokenStream stream)
throws IOException {

TermAttribute term = stream.addAttribute(TermAttribute.class);
while(stream.incrementToken()) {
System.out.print("[" + term.term() + "] "); //B
}
}
/*
#A Invoke analysis process
#B Print token text surrounded by brackets
*/

public static int getPositionIncrement(AttributeSource source) {
PositionIncrementAttribute attr = source.addAttribute(PositionIncrementAttribute.class);
return attr.getPositionIncrement();
}

public static String getTerm(AttributeSource source) {
TermAttribute attr = source.addAttribute(TermAttribute.class);
return attr.term();
}

public static String getType(AttributeSource source) {
TypeAttribute attr = source.addAttribute(TypeAttribute.class);
return attr.type();
}

public static void setPositionIncrement(AttributeSource source, int posIncr) {
PositionIncrementAttribute attr = source.addAttribute(PositionIncrementAttribute.class);
attr.setPositionIncrement(posIncr);
}

public static void setTerm(AttributeSource source, String term) {
TermAttribute attr = source.addAttribute(TermAttribute.class);
attr.setTermBuffer(term);
}

public static void setType(AttributeSource source, String type) {
TypeAttribute attr = source.addAttribute(TypeAttribute.class);
attr.setType(type);
}

public static void displayTokensWithPositions
(Analyzer analyzer, String text) throws IOException {

TokenStream stream = analyzer.tokenStream("contents",
new StringReader(text));
TermAttribute term = stream.addAttribute(TermAttribute.class);
PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);

int position = 0;
while(stream.incrementToken()) {
int increment = posIncr.getPositionIncrement();
if (increment > 0) {
position = position + increment;
System.out.println();
System.out.print(position + ": ");
}

System.out.print("[" + term.term() + "] ");
}
System.out.println();
}

public static void displayTokensWithFullDetails(Analyzer analyzer,
String text) throws IOException {

TokenStream stream = analyzer.tokenStream("contents", // #A
new StringReader(text));

TermAttribute term = stream.addAttribute(TermAttribute.class); // #B
PositionIncrementAttribute posIncr = // #B
stream.addAttribute(PositionIncrementAttribute.class); // #B
OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class); // #B
TypeAttribute type = stream.addAttribute(TypeAttribute.class); // #B

int position = 0;
while(stream.incrementToken()) { // #C

int increment = posIncr.getPositionIncrement(); // #D
if (increment > 0) { // #D
position = position + increment; // #D
System.out.println(); // #D
System.out.print(position + ": "); // #D
}

System.out.print("[" + // #E
term.term() + ":" + // #E
offset.startOffset() + "->" + // #E
offset.endOffset() + ":" + // #E
type.type() + "] "); // #E
}
System.out.println();
}
/*
#A Perform analysis
#B Obtain attributes of interest
#C Iterate through all tokens
#D Compute position and print
#E Print all token details
*/

public static void assertAnalyzesTo(Analyzer analyzer, String input,
String[] output) throws Exception {
TokenStream stream = analyzer.tokenStream("field", new StringReader(input));

TermAttribute termAttr = stream.addAttribute(TermAttribute.class);
for (String expected : output) {
Assert.assertTrue(stream.incrementToken());
Assert.assertEquals(expected, termAttr.term());
}
Assert.assertFalse(stream.incrementToken());
stream.close();
}

public static void displayPositionIncrements(Analyzer analyzer, String text)
throws IOException {
TokenStream stream = analyzer.tokenStream("contents", new StringReader(text));
PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);
while (stream.incrementToken()) {
System.out.println("posIncr=" + posIncr.getPositionIncrement());
}
}

public static void main(String[] args) throws IOException {
System.out.println("SimpleAnalyzer");
displayTokensWithFullDetails(new SimpleAnalyzer(),
"The quick brown fox....");

System.out.println("\n----");
System.out.println("StandardAnalyzer");
displayTokensWithFullDetails(new StandardAnalyzer(Version.LUCENE_30),
"I'll email you at xyz@example.com");
}
}

/*
#1 Invoke analysis process
#2 Output token text surrounded by brackets
*/
1
2
3
4
public static void main(String[] args) throws IOException {
AnalyzerUtils.displayTokensWithFullDetails(new SimpleAnalyzer(),
"The quick brown fox....");
}
1
2
3
4
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]

可以看到,每个token都被置于与前一token邻接的位置上,这里所有的token都是单词类型的。

属性
需要注意的TokenStream不会显式生成包含所有token属性的token对象,而是是必须与token对应的可重用的属性结构进行交互获得这些属性,这么做主要是考虑扩展性和效率。

TokenStream继承类AttributeSource。AttributeSouce是一种有效并通用的类,用于提供可扩展的属性并且不需要运行时类型转换。Lucene在分析期间可以使用预定义的属性,同样我们也可以加入预定义的属性,需要实现Attribute接口。

Lucene内置的token属性:

Token attribute interface Description
TermAttribute Token’s text
PositionIncrementAttribute Position increment (defaults to 1)
OffsetAttribute Start and end character offset
TypeAttribute Token’s type (defaults to word)
FlagsAttribute Bits to encode custom flags
PayloadAttribute Per-token byte[] payload (see section 6.5)

通过这个可重用的API,我们可以首先通过调用addAttribute方法来获取所需要的属性,该方法会返回一个实现对应接口的具体类的实例。然后我们可以调用TokenStream的incrementToken()方法来顺序访问所有的token。如果该方法成功移动到下一个token则会返回true,这时之前获得的属性实例都会将内部状态修改为下一个token的属性,我们可以通过这些属性实例进行交互获得token的属性值。

1
2
3
4
5
6
7
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(text));

PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);
while (stream.incrementToken()) {
System.out.println("posIncr=" + posIncr.getPositionIncrement());
}

注意通过上面的属性实例是双向的,我们也可以为token设置属性,

另外,我们也可以控制PositionIncrement,也就是控制是否移到下一个词,也就是通过保持位置不变,我们可以在当前位置添加很多词,而不是原来当前位置的一个词,通过TokenPositionIncrementAttribute.setPositionIncrement(0)可以实现,同样也可以跳过一些词,一般token与token之间的增量是1.
通过这种方法我们可以在同一位置添加同义词,从而实现同义词的去querying。

另外有时我们需要对当前处理的token进行一个完整的备份,用于之后回到当前这个状态。我们可以通过调用captureState来实现记录当前状态,然会一个State对象(保存了所有的状态,属性)。之后我们可以调用restoreStore进行恢复。注意这个方法的代价很高,一般应该尽量避免。

起始和结束位置偏移量可以用于TermVector类进行索引,通常可用于高亮现实搜索结果。

我们还可以设置token类型,如StandardAnalyzer对I’ll email you at xyz@exam- ple.com的处理结果。

1
2
3
4
1: [i'll:0->4:<APOSTROPHE>]
2: [email:5->10:<ALPHANUM>]
3: [you:11->14:<ALPHANUM>]
5: [xyz@example.com:18->33:<EMAIL>]

token类型还可以用于metaphone和同义词分析器。但是默认情况下。Lucene并不将token类型编入索引,而只是在分析时使用,因此如果有这个需求时,我们需要使用TypeAsPayload的token filter将类型作为有效负载记录下来。

TokenFilter的顺序很重要

TokenFilter链在处理token时顺序很重要,如果过滤停用词时,我们需要首先使用LowerCaseFilter,然后再使用StopFilter。如果使用错误的顺序StopFilter,LowerCaseFilter,那么The有可能不会被过滤掉,因为StopFilter默认所有字符已经是小写所以只会匹配the

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// correct version
public class StopAnalyzer2 extends Analyzer {

private Set stopWords;

public StopAnalyzer2() {
stopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
}

public StopAnalyzer2(String[] stopWords) {
this.stopWords = StopFilter.makeStopSet(stopWords);
}

public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(true, new LowerCaseFilter(new LetterTokenizer(reader)),
stopWords);
}
}

// wrong version
public class StopAnalyzerFlawed extends Analyzer {
private Set stopWords;

public StopAnalyzerFlawed() {
stopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
}

public StopAnalyzerFlawed(String[] stopWords) {
this.stopWords = StopFilter.makeStopSet(stopWords);
}

/**
* Ordering mistake here
*/
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseFilter(
new StopFilter(true, new LetterTokenizer(reader),
stopWords));
}
}

使用内置分析器

Lucene的一些常用分析器:

Analyzer Steps taken
WhitespaceAnalyzer Splits tokens at whitespace.
SimpleAnalyzer Divides text at nonletter characters and lowercases.
StopAnalyzer Divides text at nonletter characters, lowercases, and removes stop words.
KeywordAnalyzer Treats entire text as a single token.
StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes email addresses, acronyms, Chinese-Japanese-Korean characters, alphanumer- ics, and more. It also lowercases and removes stop words.

StopAnalyzer
StopAnalyzer分析器除了完成基本的token拆分和小写化功能之外,还负责移除停用词(stop words)。StopAnalyzer类内置了如下一个常用常用英文停用词集合,该集合有ENGLISH_STOP_WORDS_SET定义,默认包括:

1
2
3
4
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it", "no", "not", "of", "on",
"or", "such","that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"

StopAnalyzer有一个可重载的构造方法,允许通过这个方法传入子集的停用词集合。

StandardAnalyzer
StandardAnalyzer是最复杂强大也是最使用的Lucene内置分析器。

分析器的选择

大部分应用程序都不使用任意一种内置分析器,而是选择创建自己的分析器链,因为很多情况下都有特殊的需求,如自定义停用词列表,特殊的tokenlization操作等。

后面的部分将介绍如何创建自己的实用分析器,包括两种常用功能:近音词查询和同义词扩展。

近音词查询(Sounds-like querying)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
public class MetaphoneAnalyzerTest extends TestCase {
public void testKoolKat() throws Exception {
RAMDirectory directory = new RAMDirectory();
Analyzer analyzer = new MetaphoneReplacementAnalyzer();

IndexWriter writer = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();
doc.add(new Field("contents", //#A
"cool cat",
Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();

IndexSearcher searcher = new IndexSearcher(directory);

Query query = new QueryParser(Version.LUCENE_30, //#B
"contents", analyzer) //#B
.parse("kool kat"); //#B

TopDocs hits = searcher.search(query, 1);
assertEquals(1, hits.totalHits); //#C
int docID = hits.scoreDocs[0].doc;
doc = searcher.doc(docID);
assertEquals("cool cat", doc.get("contents")); //#D

searcher.close();
}

/*
#A Index document
#B Parse query text
#C Verify match
#D Retrieve original value
*/

public static void main(String[] args) throws IOException {
MetaphoneReplacementAnalyzer analyzer =
new MetaphoneReplacementAnalyzer();
AnalyzerUtils.displayTokens(analyzer,
"The quick brown fox jumped over the lazy dog");

System.out.println("");
AnalyzerUtils.displayTokens(analyzer,
"Tha quik brown phox jumpd ovvar tha lazi dag");
}
}

关键是MetaphoneReplacementAnalyzer。

1
2
3
4
5
6
public class MetaphoneReplacementAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new MetaphoneReplacementFilter(
new LetterTokenizer(reader));
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public class MetaphoneReplacementFilter extends TokenFilter {
public static final String METAPHONE = "metaphone";

private Metaphone metaphoner = new Metaphone();
private TermAttribute termAttr;
private TypeAttribute typeAttr;

public MetaphoneReplacementFilter(TokenStream input) {
super(input);
termAttr = addAttribute(TermAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}

public boolean incrementToken() throws IOException {
if (!input.incrementToken()) //#A
return false; //#A

String encoded;
encoded = metaphoner.encode(termAttr.term()); //#B
termAttr.setTermBuffer(encoded); //#C
typeAttr.setType(METAPHONE); //#D
return true;
}
}

这里的核心是将一个词转化为它的语音词根(phonetic root),也就是Metaphne algorithm,这里使用了Apache Commons Codec project的实现。

具体实现是将每个incoming的token的text在同一位置替换为该token的phonetic root,并设置为类型为METAPHONE。

1
2
3
4
String encoded;
encoded = metaphoner.encode(termAttr.term());
termAttr.setTermBuffer(encoded);
typeAttr.setType(METAPHONE);
1
2
3
4
5
6
7
8
9
10
public static void main(String[] args) throws IOException {
MetaphoneReplacementAnalyzer analyzer =
new MetaphoneReplacementAnalyzer();
AnalyzerUtils.displayTokens(analyzer,
"The quick brown fox jumped over the lazy dog");

System.out.println("");
AnalyzerUtils.displayTokens(analyzer,
"Tha quik brown phox jumpd ovvar tha lazi dag");
}

可以发现samples经过metaphone encoder处理后是完全一样的:

1
2
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]

在实际情况下,我们只在特殊情况下使用sound-like querying,因为sound-like querying通常也会匹配许多完全不相关的结果。Google的策略是只有在用户现有的query存在拼写错误或者几乎没有匹配结果的情况下,才会使用sounld-like querying为用户提供suggestion。

同义词查询

一种处理同义词的方法是使得analyzer将token的同义词插入到现在正在处理的token stream中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public void testJumps() throws Exception {
TokenStream stream =
synonymAnalyzer.tokenStream("contents", // #A
new StringReader("jumps")); // #A
TermAttribute term = stream.addAttribute(TermAttribute.class);
PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);

int i = 0;
String[] expected = new String[]{"jumps", // #B
"hops", // #B
"leaps"}; // #B
while(stream.incrementToken()) {
assertEquals(expected[i], term.term());

int expectedPos; // #C
if (i == 0) { // #C
expectedPos = 1; // #C
} else { // #C
expectedPos = 0; // #C
} // #C
assertEquals(expectedPos, // #C
posIncr.getPositionIncrement()); // #C
i++;
}
assertEquals(3, i);
}

SynonymAnalyzer首先需要detect具有同义词的token的出现,然后将这个token的synonyms插入到相同位置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.util.Version;
import java.io.Reader;

// From chapter 4
public class SynonymAnalyzer extends Analyzer {
private SynonymEngine engine;

public SynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(
new StopFilter(true,
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(
Version.LUCENE_30, reader))),
StopAnalyzer.ENGLISH_STOP_WORDS_SET),
engine
);
return result;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;
import java.io.IOException;
import java.util.Stack;
import lia.analysis.AnalyzerUtils;

// From chapter 4
public class SynonymFilter extends TokenFilter {
public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";

private Stack<String> synonymStack;
private SynonymEngine engine;
private AttributeSource.State current;

private final TermAttribute termAtt;
private final PositionIncrementAttribute posIncrAtt;

public SynonymFilter(TokenStream in, SynonymEngine engine) {
super(in);
synonymStack = new Stack<String>(); //#1
this.engine = engine;

this.termAtt = addAttribute(TermAttribute.class);
this.posIncrAtt = addAttribute(PositionIncrementAttribute.class);
}

public boolean incrementToken() throws IOException {
if (synonymStack.size() > 0) { //#2
String syn = synonymStack.pop(); //#2
restoreState(current); //#2
termAtt.setTermBuffer(syn);
posIncrAtt.setPositionIncrement(0); //#3
return true;
}

if (!input.incrementToken()) //#4
return false;

if (addAliasesToStack()) { //#5
current = captureState(); //#6
}

return true; //#7
}

private boolean addAliasesToStack() throws IOException {
String[] synonyms = engine.getSynonyms(termAtt.term()); //#8
if (synonyms == null) {
return false;
}
for (String synonym : synonyms) { //#9
synonymStack.push(synonym);
}
return true;
}
}

/*
#1 Define synonym buffer
#2 Pop buffered synonyms
#3 Set position increment to 0
#4 Read next token
#5 Push synonyms onto stack
#6 Save current token
#7 Return current token
#8 Retrieve synonyms
#9 Push synonyms onto stack
*/

考虑到SynonymEngine的可扩展性,因此SynoymEngine类被设计成只有一个方法的接口:

1
2
3
public interface SynonymEngine {
String[] getSynonyms(String s) throws IOException;
}

下面是一个简单的测试实现方法,实际上Lucene提供了一个基于WordNet的强大的SynonymEngine类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class TestSynonymEngine implements SynonymEngine {
private static HashMap<String, String[]> map = new HashMap<String, String[]>();

static {
map.put("quick", new String[] {"fast", "speedy"});
map.put("jumps", new String[] {"leaps", "hops"});
map.put("over", new String[] {"above"});
map.put("lazy", new String[] {"apathetic", "sluggish"});
map.put("dog", new String[] {"canine", "pooch"});
}

public String[] getSynonyms(String s) {
return map.get(s);
}
}

完整测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
public class SynonymAnalyzerTest extends TestCase {
private IndexSearcher searcher;
private static SynonymAnalyzer synonymAnalyzer =
new SynonymAnalyzer(new TestSynonymEngine());

public void setUp() throws Exception {
RAMDirectory directory = new RAMDirectory();

IndexWriter writer = new IndexWriter(directory,
synonymAnalyzer, //#1
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field("content",
"The quick brown fox jumps over the lazy dog",
Field.Store.YES,
Field.Index.ANALYZED)); //#2
writer.addDocument(doc);

writer.close();

searcher = new IndexSearcher(directory, true);
}

public void tearDown() throws Exception {
searcher.close();
}

public void testJumps() throws Exception {
TokenStream stream =
synonymAnalyzer.tokenStream("contents", // #A
new StringReader("jumps")); // #A
TermAttribute term = stream.addAttribute(TermAttribute.class);
PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);

int i = 0;
String[] expected = new String[]{"jumps", // #B
"hops", // #B
"leaps"}; // #B
while(stream.incrementToken()) {
assertEquals(expected[i], term.term());

int expectedPos; // #C
if (i == 0) { // #C
expectedPos = 1; // #C
} else { // #C
expectedPos = 0; // #C
} // #C
assertEquals(expectedPos, // #C
posIncr.getPositionIncrement()); // #C
i++;
}
assertEquals(3, i);
}

/*
#A Analyze with SynonymAnalyzer
#B Check for correct synonyms
#C Verify synonyms positions
*/

public void testSearchByAPI() throws Exception {

TermQuery tq = new TermQuery(new Term("content", "hops")); //#1
assertEquals(1, TestUtil.hitCount(searcher, tq));

PhraseQuery pq = new PhraseQuery(); //#2
pq.add(new Term("content", "fox")); //#2
pq.add(new Term("content", "hops")); //#2
assertEquals(1, TestUtil.hitCount(searcher, pq));
}

/*
#1 Search for "hops"
#2 Search for "fox hops"
*/

public void testWithQueryParser() throws Exception {
Query query = new QueryParser(Version.LUCENE_30, // 1
"content", // 1
synonymAnalyzer).parse("\"fox jumps\""); // 1
assertEquals(1, TestUtil.hitCount(searcher, query)); // 1
System.out.println("With SynonymAnalyzer, \"fox jumps\" parses to " +
query.toString("content"));

query = new QueryParser(Version.LUCENE_30, // 2
"content", // 2
new StandardAnalyzer(Version.LUCENE_30)).parse("\"fox jumps\""); // B
assertEquals(1, TestUtil.hitCount(searcher, query)); // 2
System.out.println("With StandardAnalyzer, \"fox jumps\" parses to " +
query.toString("content"));
}

/*
#1 SynonymAnalyzer finds the document
#2 StandardAnalyzer also finds document
*/
}

词干分析(Stemming analysis)

域分析(Field variations)

语言分析(Stemming analysis)