Lucene(2)_Index

Lucene的文档建模

文档是Lucene索引和检索的基本单位。文档为一个包含一个或多个field(域)的容器，如title, keywords, author, summary, content等，而field的内容才是真正被索引和检索的内容（经过分解为tokens之后）。如下面的文档(已经建模和区分为各个field)：

title : hello, world
key world: blog, hello
author: zebangchen
content: I have always believed that the man who has
begun to live more seriously within begins to live
more simply without. In an age of extravagance and
waste, I wish I could show to the world how few the
real wants of humanity are.

在对原始文档进行索引时，首先需要将数据转换为Lucene所能识别的文档和field。在随后的搜索过程中，被搜索对象为field的内容，如搜索title : lucene时，搜索结果为title域包含单词lucene的所有文档。

进一步每个域(field)可以进行下面的操作：

field的内容(域值)可以不被检索。
field被索引后，可以选择性存储项向量(term vector)，也就是只是针对这个field的内容的索引。
域值可以被单独存储。

Index过程

Lucene_2

提取plain text和创建document

索引的第一步就是从源文件获取文档然后创建文档，因为很多源文件并不是plain text，例如pdf，xml，html带有大量标签的文档，microsoft文档等。Lucene提供的Tika框架提供了从各个格式文件提取plain text的工具。

分析文档

下一步是建立lucene文档及其field，然后将document通过IndexWriter对象的addDocument传递给Lucene进行索引操作。首先是分析各个field的内容分割为tokens，这一步可以有很多可选操作，如toLowerCase，去stopword等，同样还需要处理tokens，例如stem操作等。最后得到的各个field的tokens会被用于建立index。

向索引添加文档

分析得到的token，文档会被用于建立倒排索引(inverted index)。

Lucene的索引数据结构非常丰富和强大，这里只做一个简要的介绍。Lucene索引包含一个或多个segment(段)。每个段都是一个独立的索引，索引了一部分文档，也就是每个段索引的文档都是不同的，是整个文档集合的一个子集。当索引了一部分文档后，由于内存限制或其他原因，我们需要刷新缓存区的内容将其写入到磁盘中，一个新的段就会被建立，其中包含这部分文档的索引。

在搜索索引时，会访问每个段，然后合并在这些索引段的结果并返回。

一般每个段(索引)都包含多个文件，格式为_x.扩展名，X表示段名称，扩展名用来表示索引的各个不同类型文件（项向量(term vector), 存储的域(stored field), 倒排索引(inverted index))。也可以设置使用混合文件格式，则会将这些不同类型的文件都压缩为一个单一的文件:_X.cfs，这中哦该方式能在搜索期间减少打开文件的数量。

另外还有一个特殊文件，为段文件(segments file)，表示为_<N>。该文件指向其他所有正在使用的段。Lucene在检索时，首先会打开该文件，然后依次打开其所指向的文件。N是一个整数，称为the generation，Lucene每次向index提交更新时N都会被加一。

随着时间推移，索引会有越来越多的段，特别是程序打开和关闭writter频繁时。根据设置IndexWriter类会周期性的合并一些段，合并段的选取策略由MergePolicy类决定。

基本索引操作

向索引添加文档

addDocument(Document) — 使用默认分析器添加文档，该分析器在创建IndexWriter对象时指定，用于指定将plain text拆分为tokens(tokenization)的策略。
addDocument(Document, Analyzer) — 使用指定的分析器进行tokenization。

import junit.framework.TestCase;

import lia.common.TestUtil;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.index.Term;

import java.io.IOException;

// From chapter 2
public class IndexingTest extends TestCase {
  protected String[] ids = {"1", "2"};
  protected String[] unindexed = {"Netherlands", "Italy"};
  protected String[] unstored = {"Amsterdam has lots of bridges",
                                 "Venice has lots of canals"};
  protected String[] text = {"Amsterdam", "Venice"};

  private Directory directory;

  protected void setUp() throws Exception {     //1
    directory = new RAMDirectory();

    IndexWriter writer = getWriter();           //2

    for (int i = 0; i < ids.length; i++) {      //3
      Document doc = new Document();
      doc.add(new Field("id", ids[i],
                        Field.Store.YES,
                        Field.Index.NOT_ANALYZED));
      doc.add(new Field("country", unindexed[i],
                        Field.Store.YES,
                        Field.Index.NO));
      doc.add(new Field("contents", unstored[i],
                        Field.Store.NO,
                        Field.Index.ANALYZED));
      doc.add(new Field("city", text[i],
                        Field.Store.YES,
                        Field.Index.ANALYZED));
      writer.addDocument(doc);
    }
    writer.close();
  }

  private IndexWriter getWriter() throws IOException {            // 2
    return new IndexWriter(directory, new WhitespaceAnalyzer(),   // 2
                           IndexWriter.MaxFieldLength.UNLIMITED); // 2
  }

  protected int getHitCount(String fieldName, String searchString)
    throws IOException {
    IndexSearcher searcher = new IndexSearcher(directory); //4
    Term t = new Term(fieldName, searchString);
    Query query = new TermQuery(t);                        //5
    int hitCount = TestUtil.hitCount(searcher, query);     //6
    searcher.close();
    return hitCount;
  }

  public void testIndexWriter() throws IOException {
    IndexWriter writer = getWriter();
    assertEquals(ids.length, writer.numDocs());            //7
    writer.close();
  }

  public void testIndexReader() throws IOException {
    IndexReader reader = IndexReader.open(directory);
    assertEquals(ids.length, reader.maxDoc());             //8
    assertEquals(ids.length, reader.numDocs());            //8
    reader.close();
  }

  /*
    #1 Run before every test
    #2 Create IndexWriter
    #3 Add documents
    #4 Create new searcher
    #5 Build simple single-term query
    #6 Get number of hits
    #7 Verify writer document count
    #8 Verify reader document count
  */

我们需要传入三个变量来创建IndexWiter类：

Directory类，索引存储位置。
分析器(analyzer)，用于tokenized fields为tokens，进一步用于indexing。
MaxFieldLength.UNLIMITED，用于告诉IndexWriter对文档中所有的token建立索引。

IndexWriter类如果检查到之前没有索引在Dicrectory中，则会创建新的索引，否则会将内容加入到存在的索引中。

一旦索引已经被建立或者已经存在，我们就可以循环处理每篇document加入索引。对于每篇plain text文档，我们建立一个Document对象，然后加入其所有的域以及对应的域选项(field options)。

删除索引中的文档

deleteDocuments(Term)，删除包含term的所有文档（在索引中删除））
deleteDocuments(Term[])，删除包含任意一个term的文档
deleteDocuments(Query)，删除匹配给定query的文档
deleteDocuments(Query[])，删除匹配任意一个query的问阿哥
deleteAll()，删除index中的所有文档。

public void testDeleteBeforeOptimize() throws IOException {
    IndexWriter writer = getWriter();
    assertEquals(2, writer.numDocs()); //A
    writer.deleteDocuments(new Term("id", "1"));  //B
    writer.commit();
    assertTrue(writer.hasDeletions());    //1
    assertEquals(2, writer.maxDoc());    //2
    assertEquals(1, writer.numDocs());   //2   
    writer.close();
  }

  public void testDeleteAfterOptimize() throws IOException {
    IndexWriter writer = getWriter();
    assertEquals(2, writer.numDocs());
    writer.deleteDocuments(new Term("id", "1"));
    writer.optimize();                //3
    writer.commit();
    assertFalse(writer.hasDeletions());
    assertEquals(1, writer.maxDoc());  //C
    assertEquals(1, writer.numDocs()); //C    
    writer.close();
  }

  /*
    #A 2 docs in the index
    #B Delete first document
    #C 1 indexed document, 0 deleted documents
    #1 Index contains deletions
    #2 1 indexed document, 1 deleted document
    #3 Optimize compacts deletes
  */

在执行delete后，真正的删除操作并不会马山执行，而是放入内存缓冲区。同样我们要调用writter的commit()或close()来执行实际的删除操作。

更新索引中的文档

Lucene无法只更新文档的某个域，而是删除旧文档，然后向索引中添加新问昂。

updateDocument(Term, Document)，首先删除包含term的所有文档，然后使用writter的默认分析器添加新文档
updateDocument(Term, Document, Analyzer)，与上面功能一致，区别是指定分析器。


public void testUpdate() throws IOException {

  assertEquals(1, getHitCount("city", "Amsterdam"));

  IndexWriter writer = getWriter();

  Document doc = new Document();                   //A            
  doc.add(new Field("id", "1",
                    Field.Store.YES,
                    Field.Index.NOT_ANALYZED));    //A
  doc.add(new Field("country", "Netherlands",
                    Field.Store.YES,
                    Field.Index.NO));              //A  
  doc.add(new Field("contents",                    
                    "Den Haag has a lot of museums",
                    Field.Store.NO,
                    Field.Index.ANALYZED));       //A
  doc.add(new Field("city", "Den Haag",
                    Field.Store.YES,
                    Field.Index.ANALYZED));       //A

  writer.updateDocument(new Term("id", "1"),       //B
                        doc);                      //B
  writer.close();

  assertEquals(0, getHitCount("city", "Amsterdam"));//C   
  assertEquals(1, getHitCount("city", "Haag"));     //D  
}

/*
  #A Create new document with "Haag" in city field
  #B Replace original document with new version
  #C Verify old document is gone
  #D Verify new document is indexed
*/

Lucene(2)_Index

Lucene的文档建模

Index过程

提取plain text和创建document

分析文档

向索引添加文档

基本索引操作

向索引添加文档

删除索引中的文档

更新索引中的文档

域选项(field options)

索引数字、日期和时间

优化索引

其他Directory子类

并发、线程安全及锁机制

高级索引概念