实现简单的搜索功能
使用Lucene进行搜索时,我们可以自己构建查询语句,也可以是使用Lucene的QueryParser类将用户输入的文本转换为Query对象。
对于term的检索
1 | import junit.framework.TestCase; |
解析用户输入的查询表达式:QueryParser
Lucene的搜索方法需要一个Query对象作为参数。对查询表达式的解析就是将用户的query text,例如mock OR junit
,转换为对应的Query对象。
QueryParser为查询解析提供了强大的支持,可以处理复杂的条件表达式,最后生成的Query 会非常庞大而复杂。
QueryParser类需要使用一个分析器将查询语句分割为多个项。
1 | QueryParser parser = new QueryParser(Version matchVersion, |
Version matchVersion用于告诉Lucene使用哪个版本,以实现向后兼容。
field用于指示默认搜索的field,除了query里明确指示了应该搜索哪个域,例如title:lucene
。
analyzer指定了用于分解query的分析器。
1 | public void testQueryParser() throws Exception { |
query expressions以及其QueryParser分解
Query expression | Matches documents that |
---|---|
java | Contain the term java in the default field |
java junit | Contain the term java or junit, or both, in the default field |
java OR junit | Contain the term java or junit, or both, in the default field |
+java +junit | Contain both java and junit in the default field |
java AND junit | Contain both java and junit in the default field |
title:ant | Contain the term ant in the title field |
title:extreme –subject:sports | Contain extreme in the title field and don’t have sports in the subject field |
title:extreme AND NOT subject:sports | Contain extreme in the title field and don’t have sports in the subject field |
(agile OR extreme) AND methodology | Contain methodology and must also contain agile and/or extreme, all in the default field |
title:”junit in action” | Contain the exact phrase “junit in action” in the title field |
title:”junit action”~5 | Contain the terms junit and action within five positions of one another, in the title field |
java* | Contain terms that begin with java, like javaspaces, javaserver, java.net, and the exact tem java itself. |
java~ | Contain terms that are close to the word java, such as lava |
lastmodified: [1/1/09 TO 12/31/09] | Have lastmodified field values between the dates January 1, 2009 and December 31, 2009 |
使用IndexSearcher
创建IndexSearcher
1 | Directory dir = FSDirectory.open(new File("/path/to/index")); |
IndexReaser完成了诸如打开索引系统文件和提供底层reader API等繁重的工作,而IndexSearcher则要简单很多。打开一个IndexReader需要较大的系统开销,所有在搜索期间最好重复使用同一个或有限的几个IndexReader实例,只在有必要的时候才创建新的IndexReader。
另外还可以从Directory直接创建IndexSearcher,这时会创建这个searcher自己的IndexReader,searcher关闭时这个reader也会关闭,消耗比较大。
IndexReader的读操作总是基于它创建时的快照,如果index发生了改变而且我们需要IndexReader同步这些修改,我们需要重新创建一个IndexReader。如果已经有建立的reader,这时可以调用IndexReaser.reopen,会conver所有的change,同时资源消耗会远小于直接创建一个新的reader。
1 | IndexReader newReader = reader.reopen(); |
如果索引有所变更,reopen方法返回一个新的reader,这是程序必须关闭reader并创建新的searcher。在实际的应用程序中,可能还有多个线程在使用旧的reader,我们要注意保证线程安全。
实现搜索功能
在建立IndexSearcher实例后,使用search(Query, int)
方法进行检索。另外更好高级的操作还有过滤和排序。
search method | when to use |
---|---|
TopDocs search(Query query, int n) | Straightforward searches. The int n parameter specifies how many top-scoring documents to return. |
TopDocs search(Query query, Filter filter, int n) | Searches constrained to a subset of available documents, based on filter criteria. |
TopFieldDocs search(Query query, Filter filter, int n, Sort sort) | Searches constrained to a subset of available documents based on filter criteria, and sorted by a custom Sort object |
void search(Query query, Collector results) | Used when you have custom logic to implement for each document visited, or you’d like to collect a different subset of documents than the top N by the sort criteria. |
void search(Query query, Filter filter, Collector results) | Same as previous, except documents are only accepted if they pass the filter criteria. |
Working with TopDocs
TopDocs method or attribute | Return value |
---|---|
totalHits | Number of documents that matched the search |
scoreDocs | Array of ScoreDoc instances that contains the results |
getMaxScore() | Returns best score of all matches, if scoring was done while searching (when sorting by field, you separately control whether scores are computed) |
搜索结果分页
大部分情况下用户只在首页结果进行查找,但是分页仍是需要的,主要有下面两种解决方案:
- 返回多页的结果并保存在ScoreDocs和IndexSearcher中。
- 重新发送查询请求(requerying)。
Requerying一般是更好的解决方案,可以减少保存用户state,在很多用户的情况下,保存很多用户的state代价是非常大的。Lucene通常查询速度非常快,而且操作系统通常会cache最近查询的数据,使得很多情况下需要的数据还在RAM中。
Near-real-time search
Lucene 2.9开始引入近实时检索,使得我们可以快速检索最近index新添加的内容,即是IndexWriter还没有把内容写入到磁盘。许多应用程序在提供搜索功能的同时,也在不断索引新添加的内容,需要维持一个不能段时间关闭的writter,而且需要在搜索时反映出这些新添加的内容。如果writer和reader处于同一个JVM中,我们就可以使用近实时检索(Near-real-time search)。
近实时检索可以是我们能够检索新创建但还未完成提交的段进行检索。
1 | import org.apache.lucene.util.Version; |
Lucene的评分机制(scoring)
How Lucene scores
Lucene similarity scoring formula:
其中d指document,t指term,q指query。
通过这个方程我们可以计算文档对于query的评分,通常还需要将评分进行归一化处理,也就是除以最大评分。评分越大说明文档和query的匹配程度越高。Lucene根据匹配文档的得分进行排序,然后将结果返回。
This score is the raw score, which is a floating-point number >= 0.0. Typically, if an application presents the score to the end user, it’s best to first normalize the scores by dividing all scores by the maximum score for the query. The larger the similarity score, the better the match of the document to the query. By default Lucene returns documents reverse-sorted by this score, meaning the top documents are the best matching ones. Table 3.5 describes each of the factors in the scoring formula.
Boost factors are built into the equation to let you affect a query or field’s influence on score. Field boosts come in explicitly in the equation as the boost(t.field in d) factor, set at indexing time. The default value of field boosts, logically, is 1.0. During indexing, a document can be assigned a boost, too. A document boost factor implicitly sets the starting field boost of all fields to the specified value. Field-specific boosts are multiplied by the starting value, giving the final value of the field boost factor. It’s pos- sible to add the same named field to a document multiple times, and in such situations the field boost is computed as all the boosts specified for that field and document mul- tiplied together. Section 2.5 discusses index-time boosting in more detail.
In addition to the explicit factors in this equation, other factors can be computed on a per-query basis as part of the queryNorm factor. Queries themselves can have an impact on the document score. Boosting a Query instance is sensible only in a multi- ple-clause query; if only a single term is used for searching, changing its boost would impact all matched documents equally. In a multiple-clause Boolean query, some doc- uments may match one clause but not another, enabling the boost factor to discrimi- nate between matching documents. Queries also default to a 1.0 boost factor.
Most of these scoring formula factors are controlled and implemented as a sub- class of the abstract Similarity class. DefaultSimilarity is the implementation used unless otherwise specified. More computations are performed under the covers of DefaultSimilarity; for example, the term frequency factor is the square root of the actual frequency. Because this is an “in action” book, it’s beyond the book’s scope to delve into the inner workings of these calculations. In practice, it’s extremely rare to need a change in these factors. Should you need to change them, please refer to Similarity’s Javadocs, and be prepared with a solid understanding of these factors and the effect your changes will have.
It’s important to note that a change in index-time boosts or the Similarity meth- ods used during indexing, such as lengthNorm, require that the index be rebuilt for all factors to be in sync.
Let’s say you’re baffled as to why a certain document got a good score to your Query. Lucene offers a nice feature to help provide the answer.
Factor | Description |
---|---|
tf(t in d) |
Term frequency factor for the term (t) in the document (d)—how many times the term t occurs in the document. |
idf(t) |
Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. |
boost(t.field in d) |
Field and document boost, as set during indexing (see section 2.5). You may use this to statically boost certain fields and certain docu- ments over others. |
lengthNorm(t.field in d) |
Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. |
coord(q, d) |
Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents. |
queryNorm(q) |
Normalization value for a query, given the sum of the squared weights of each of the query terms. |