Lucene(1)_Introduction

Introduction

Lucene是一个强大的开源信息检索工具库，通过Lucene我们可以轻易将搜索功能加到我们的应用程序中。

通常一个搜索程序需要包含的组件如下：

其中Lucene为深色部分的组件提供了强大的可扩展的工具库。

在获取内容后，为了对这些内容进行高效检索，所需要的做的就是为这些文档建立索引，而在建立索引之前，我们需要分析文档，以把文档分解为token的集合，然后对这些token建立索引。而分解文档为token就是Lucene的第一个任务，有许多问题需要在这一步解决，比如如何处理词组的问题，如何处理拼写错误，如何处理同义词关系等。对于中文等语言，甚至词与词之间都没有边界，这时还需要进行中文分词。

Lucene提供了许多分析器可以让我们轻松定制所需要的文档分析器。

在文档分析后，我们就可以对文档建立索引，用于高效检索。Lucene也提供了强大的支持。

对于搜索功能，通常是客户提交一个搜索请求，然后系统根据请求返回文档。Lunece提供了一个称为查询解析器的（QueryParser）的开发包用于处理用户的请求。查询请求可以包含布尔运算、短语查询或通配符查询。下一步是根据解析后的查询，结合前面建立的索引得到匹配查询的文档。这一系列非常复杂，Lucene同样提供了强大的支持，可以让我们轻松实现结果检索、过滤、排序等功能。

常见的搜索模型有如下3种：

纯布尔模型（pure boolean model） — 只检查查询与文档是否匹配，没有评分，没有排序。
向量空间模型（vector space model） — query和document都作为基于token空间的向量模型，通过计算向量距离作为匹配概率，并用于排序。
概率模型（probabilistic model） — 采用全概率方法来计算文档和查询语句的匹配概率。

Example

下面是一个对指定文件夹以.txt结尾的文件进行index的代码：

package lia.meetlucene;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.io.FileReader;

// From chapter 1

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Indexer {

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      throw new IllegalArgumentException("Usage: java " + Indexer.class.getName()
        + " <index dir> <data dir>");
    }
    String indexDir = args[0];         //1
    String dataDir = args[1];          //2

    long start = System.currentTimeMillis();
    Indexer indexer = new Indexer(indexDir);
    int numIndexed;
    try {
      numIndexed = indexer.index(dataDir, new TextFilesFilter());
    } finally {
      indexer.close();
    }
    long end = System.currentTimeMillis();

    System.out.println("Indexing " + numIndexed + " files took "
      + (end - start) + " milliseconds");
  }

  private IndexWriter writer;

  public Indexer(String indexDir) throws IOException {
    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexWriter(dir,            //3
                 new StandardAnalyzer(       //3
                     Version.LUCENE_30),//3
                 true,                       //3
                             IndexWriter.MaxFieldLength.UNLIMITED); //3
  }

  public void close() throws IOException {
    writer.close();                             //4
  }

  public int index(String dataDir, FileFilter filter)
    throws Exception {

    File[] files = new File(dataDir).listFiles();

    for (File f: files) {
      if (!f.isDirectory() &&
          !f.isHidden() &&
          f.exists() &&
          f.canRead() &&
          (filter == null || filter.accept(f))) {
        indexFile(f);
      }
    }

    return writer.numDocs();                     //5
  }

  private static class TextFilesFilter implements FileFilter {
    public boolean accept(File path) {
      return path.getName().toLowerCase()        //6
             .endsWith(".txt");                  //6
    }
  }

  protected Document getDocument(File f) throws Exception {
    Document doc = new Document();
    doc.add(new Field("contents", new FileReader(f)));      //7
    doc.add(new Field("filename", f.getName(),              //8
                Field.Store.YES, Field.Index.NOT_ANALYZED));//8
    doc.add(new Field("fullpath", f.getCanonicalPath(),     //9
                Field.Store.YES, Field.Index.NOT_ANALYZED));//9
    return doc;
  }

  private void indexFile(File f) throws Exception {
    System.out.println("Indexing " + f.getCanonicalPath());
    Document doc = getDocument(f);
    writer.addDocument(doc);                              //10
  }
}

/*
#1 Create index in this directory
#2 Index *.txt files from this directory
#3 Create Lucene IndexWriter
#4 Close IndexWriter
#5 Return number of documents indexed
#6 Index .txt files only, using FileFilter
#7 Index file content
#8 Index file name
#9 Index file full path
#10 Add document to Lucene index
*/

简单的搜索程序：

package lia.meetlucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

// From chapter 1

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Searcher {

  public static void main(String[] args) throws IllegalArgumentException,
        IOException, ParseException {
    if (args.length != 2) {
      throw new IllegalArgumentException("Usage: java " + Searcher.class.getName()
        + " <index dir> <query>");
    }

    String indexDir = args[0];               //1 
    String q = args[1];                      //2   

    search(indexDir, q);
  }

  public static void search(String indexDir, String q)
    throws IOException, ParseException {

    Directory dir = FSDirectory.open(new File(indexDir)); //3
    IndexSearcher is = new IndexSearcher(dir);   //3   

    QueryParser parser = new QueryParser(Version.LUCENE_30, // 4
                                         "contents",  //4
                     new StandardAnalyzer(          //4
                       Version.LUCENE_30));  //4
    Query query = parser.parse(q);              //4   
    long start = System.currentTimeMillis();
    TopDocs hits = is.search(query, 10); //5
    long end = System.currentTimeMillis();

    System.err.println("Found " + hits.totalHits +   //6  
      " document(s) (in " + (end - start) +        // 6
      " milliseconds) that matched query '" +     // 6
      q + "':");                                   // 6

    for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = is.doc(scoreDoc.doc);               //7      
      System.out.println(doc.get("fullpath"));  //8  
    }

    is.close();                                //9
  }
}

/*
#1 Parse provided index directory
#2 Parse provided query string
#3 Open index
#4 Parse query
#5 Search index
#6 Write search stats
#7 Retrieve matching document
#8 Display filename
#9 Close IndexSearcher
*/

索引过程的核心类

IndexWriter
这个类负责创建新索引或者打开已有索引，以及向索引中添加、删除或更新被索引文档的信息。可以将IndexWritter看做为索引写入操作提供支持的类。IndexWritter需要开辟空间来存储索引，该功能有Directory完成。
Directory
Directory类描述了索引的存放位置，这是一个抽象类，它子类负责具体指向索引的存储路径，如前面例子中的FSDirectory.open方法来获取真实路径。
Analyzer
文本文件在被索引之前需要经过Analyzer处理。Analyzer在IndexWriter构造器中被指定，负责将文档拆分为tokens，用于index。Analyzer同样是一个抽象类，有很多子类负责不同具体的实现。
Document
Document代表一篇文档，但不是原始的text文档，而是抽象文档—fields的集合，fields表示文档的的一些元数据，例如标题，作者，创立日期，summary，filename，first paragraph等。不同的的元数据都作为文档不同的field单独存储并被索引。在不同field的相同token具有不同意义和重要性，比如title的更加重要。在搜索时我们也可以指定token一定要出现在某个field。— Document即是一个包含多个Field对象的容器，Field是一个包含能被索引的文本内容的类。
Field
索引中的每个文档都包含一个或多个不同的field（域），每个field都有一个名字（域名）和对应的内容(text)，以及一组选项说明lucene如何index这个field的内容。文档可以拥有多个同名的field，但是在建立索引时，这些field内容按照顺序被处理，就像被连接在一起作为一个text处理。

搜索过程的核心类

IndexSearcher
IndexSearcher用于搜索索引。最基本的使用是传入一个query对象和top N参数，返回一个TopDocs对象包含若干结果。

Directory dir = FSDirectory.open(new File("/tmp/index"));
        IndexSearcher searcher = new IndexSearcher(dir);
        Query q = new TermQuery(new Term("contents", "lucene"));
        TopDocs hits = searcher.search(q, 10);
searcher.close();

Term
Term对象是搜索功能的基本单元，与Field对象十分类似，只不过一个是query的组成单元，一个是Document的组成单元。Term同样包含一对字符串元素：名字和内容（text）。

1 2	Query q = new TermQuery(new Term("contents", "lucene")); TopDocs hits = searcher.search(q, 10);

上面代码表示寻找contexts域（field）包含单词lucene的前10个document。

Query
Query类对象是查询的参数，也是一个抽象类，具体实现有TermQuery, BooleanQuery, PhraseQuery…
TermQuery
TermQuery是最基本最简单的查询类型，用来匹配指定域(field)中包含特定内容的文档。
TopDocs
TopDocs类是一个简单的指针容器，指向前N个排名的搜索结果。TopDocs记录前N个结果的int docID和浮点型分数。