Autocomplete

Posted on 2016-12-18 | In Developing | Comments:

Autocomplete?

Autocomplete是现代搜索引擎的一个重要功能之一，当用户只输入了query的一部分时我们可以根据输入的这部分预测用户想输入的内容，作为suggestion提供给用户选择，可以有效的节省用户时间和提高搜索体验，同时可能提高搜索质量。

先看一下公司Paul大神实现的awesome的Autocomplete:

最直接的实现方式是，将query log中的queries(query dictionary)构建一棵Trie树（前缀树or字典树)，并且对于每个query都给出一个评分（基于count或者其他更加高级的方式）。然后对于用户已经输入的query，去Trie树中搜索以已经输入的内容为前缀的query。

然而这种方式是非常理想化的，在实际使用中有很多问题。最根本的一个问题是，如果用户的query不是以前缀方式出现在Trie树中，匹配就会失败（用户的query常常会有typos或者query里的词顺序本来就是可以交换的）。

解决方法是，我们不应该拘泥于前缀（暂时先忘记前缀树），而应该选择在query dictionary(query log里的所有query)寻找尽量和用户已经输入的query相似的queries作为suggestions。Query的similarity可以通过很多种方式进行衡量，最常用的包括Levenshtein-distance（编辑距离）或n-gram distances。这里我们讲使用的是Levenshtein-distance。

所以我们的问题现在转换为：

Input: (D, q, n)
D -- queries dictionary(millions of queries or more)
q -- The query input by the current user
n -- Maximum Levenshtein-distance allowed from q

Output: (suggestions)
suggestions -- The queries from D whose Levenshtein-distance are smaller than n

Let's EasyMock

Posted on 2016-12-05 | In java | Comments:

入职之后开发和之前自己的project或者在startup的一个明显不同是，colleague的code review和开发的整个生命周期的维护（设计，实现，测试，qa，部署，AB test，logging）。

在单元测试时，由于很多类的功能都是依赖于其他类的功能，如果在测试一个类是全部引入其依赖的类的对象，会导致高度耦合和复杂度，首先一个问题是，单元测试失败时我们无法定位是哪里的问题，是当前正在测试的类的问题还是其依赖的类的问题。

单元测试的目的是检测单个function的逻辑是否能工作正常，而不是各个类耦合在一起能否一起完成整体功能（有intergration测试）。所以我们需要模拟依赖的各个类，使得其确定能工作正常（设定其行为）。

EasyMock就是一个经常使用的Mock工具，我们可以借助其快速生成我们想要的mock object。

Installation

Maven

<dependency>
  <groupId>org.easymock</groupId>
  <artifactId>easymock</artifactId>
  <version>3.4</version>
  <scope>test</scope>
</dependency>

Mocking

The first Mock Object

假设我们现在有下面的test class：

import org.junit.*;

public class ExampleTest {

  private ClassUnderTest classUnderTest;

  private Collaborator mock;

  @Before
  public void setUp() {
    classUnderTest = new ClassUnderTest();
    classUnderTest.setListener(mock);
  }

  @Test
  public void testRemoveNonExistingDocument() {
    // This call should not lead to any notification
    // of the Mock Object:
    classUnderTest.removeDocument("Does not exist");
  }
}

对于大部分的mock，我们只需要静态import EasyMock的方法

import static org.easymock.EasyMock.*;
import org.junit.*;

public class ExampleTest {
  private ClassUnderTest classUnderTest;
  private Collaborator mock;
}

对于创建我们所需要的Mock object，我们只需要：

创建mock object
设定我们希望的mock的行为(function及其返回结果)
切换对象到replay状态

创建mock object我们可以使用mock方法或者annotations：

@Before
public void setUp() {
  mock = mock(Collaborator.class); // 1
  classUnderTest = new ClassUnderTest();
  classUnderTest.setListener(mock);
}

@Test
public void testRemoveNonExistingDocument() {
  // 2 (we do not expect anything)
  replay(mock); // 3
  classUnderTest.removeDocument("Does not exist");
}

在上面的测试中，我们没有指定任何mock的行为，然后replay it。也就是我们expect no calls，所以在第三步时我们call any of the interface methods时，会得到AssertionError。

java.lang.AssertionError:
  Unexpected method call documentRemoved("Does not exist"):
    at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:29)
    at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:44)
    at $Proxy0.documentRemoved(Unknown Source)
    at org.easymock.samples.ClassUnderTest.notifyListenersDocumentRemoved(ClassUnderTest.java:74)
    at org.easymock.samples.ClassUnderTest.removeDocument(ClassUnderTest.java:33)
    at org.easymock.samples.ExampleTest.testRemoveNonExistingDocument(ExampleTest.java:24)
      ...

Using annotations

import static org.easymock.EasyMock.*;
import org.easymock.EasyMockRunner;
import org.easymock.TestSubject;
import org.easymock.Mock;
import org.junit.Test;
import org.junit.runner.RunWith;

@RunWith(EasyMockRunner.class)
public class ExampleTest {

  @TestSubject
  private ClassUnderTest classUnderTest = new ClassUnderTest(); // 2

  @Mock
  private Collaborator mock; // 1

  @Test
  public void testRemoveNonExistingDocument() {
    replay(mock);
    classUnderTest.removeDocument("Does not exist");
  }
}

mock会由runner初始化，可以不用添加setUp方法来初始化mock。

然后有时我们需要其他的runner，这时我们可以JUnit rule来代替runner。

import static org.easymock.EasyMock.*;
import org.easymock.EasyMockRule;
import org.easymock.TestSubject;
import org.easymock.Mock;
import org.junit.Rule;
import org.junit.Test;

public class ExampleTest {

  @Rule
  public EasyMockRule mocks = new EasyMockRule(this);

  @TestSubject
  private ClassUnderTest classUnderTest = new ClassUnderTest();

  @Mock
  private Collaborator mock;

  @Test
  public void testRemoveNonExistingDocument() {
    replay(mock);
    classUnderTest.removeDocument("Does not exist");
  }
}

另外annotation可以下面的optional的element：

type，用于指定mock的类型(nice, strict)。
name，用于指定mock名字。
fieldName，用于specifying the target field name where the mock should be injected.

@Mock(type = MockType.NICE, name = "mock", fieldName = "someField")
private Collaborator mock;

@Mock(type = MockType.STRICT, name = "anotherMock", fieldName = "someOtherField")
private Collaborator anotherMock;

Statistical Machine Translation IBM Models 1 and 2

Posted on 2016-07-13 | In NLP | Comments:

IBM model是统计机器翻译(statistical machine translation, SMT)非常经典的模型，虽然到现在已经非常古老了，但是其中的一些核心idea到现在还在被使用与翻译系统中，比如alignment模型。之前上京都大学的自然语言处理的课程时就有接触过，最近研究中有个子问题可能可以借鉴IBM model的一些idea，所以又IBM model翻出来重新看了一遍，这里做一个记录和总结，主要参考Columbia University由Michael Collins主讲的课程COMS W4705: Natural Language Processing的notes。

Introduction

IBM model被提出时主要是用于解决翻译French到English。

$f = f_1, f_2…f_m $表示一个French句子，长度为$m$，$f_i$ for $j \in {1…m}$ 表示句子中第$i$个词。

同样的，定义$e = e_1, e_2…e_l$表示一个英语句子。

在SMT系统中，假设我们有一个训练数据集，也就是一些翻译的例子，每个例子也就是French句子以及对应的翻译的English句子，$(f^{(k)}, e^{(k)})$，表示第k个翻译例子。

$$\begin{aligned} f^{(k)} = f_{1}^{(k)}, f_{2}^{(k)}...f_{m_k}^{(k)} \\ e^{(k)} = e_{1}^{(k)}, e_{2}^{(k)}...e_{l_k}^{(k)} \end{aligned}$$

The Nosiy-Channel Approch(消息信道模型)

IBM model也是使用noisy-channel模型的一个例子，主要包括两个部分：

A language model，用于计算任何英语句子$e$存在的概率$p(e)$，比如我们可以用trigram language model来计算，或者最简单直接的unigram language model。一般language model可以通过大量的English corpus就能直接估计出来。
A translation model，用于计算给定一个English句子$e$，对应的French句子是$f$的条件概率$p(f|e)$。这个概率的计算方法会在后面的内容介绍，一般是从给定的翻译例子的训练数据中估计出来。给定了English句子$e$后，$p(f|e)$主要由两个选择决定：first，the length of French sentence: $m$；second, a choice of the $m$ words $f_1…f_m$。

这里可能会有很多觉得奇怪，任务不是给定法语句子，要翻译为英语句子嘛，为什么前面的translation model是先给定英语句子，其对应法语句子为$f$的概率。下面会给出解释：

我们最直接的任务的表示是($E$表示所有的英语句子)：

$$\begin{aligned} arg \max_{e\in E} p(e | f) \end{aligned}$$

通过贝叶斯规则我们可以转换问题为：

$$\begin{aligned} arg \max_{e\in E} p(e|f) &= arg \max_{e\in E} \frac{p(e, f)}{p(f)}\\ &= arg \max_{e\in E} \frac {p(e)p(f|e)}{\sum_{e} p(e)p(f|e)} \\ &= arg \max_{e\in E} p(e)p(f|e) \end{aligned}$$

所以问题就转换为了求：

$$\begin{aligned} e^{*} = arg \max_{e\in E} p(e)p(f|e) \end{aligned}$$

也就是前面定义的language model和translation model可以解决的问题。一个可能的翻译$e$的score分为两部分：first，the language-model score $p(e)$，给出一个先验概率用于估计这个句子是个合法的英语句子的概率。second，the translation-model score $p(f|e)$，也就是$f$是$e$的翻译的概率。

之所以把原来的问题转换为求：$e^{*} = arg \max_{e\in E} p(e)p(f|e)$（也就是noisy-channel approach），主要的一个好处是可以引入language model。可以有效的改进翻译结果的fluency和语法的合法程度。

只要解决了提到的两个model的参数估计，就可以得到问题的solution。language-model的score一般比较直接，通过分析大规模的English corpus生成的gram model就可以得到，比如trigram model，bigram model。

Topic Segmentation and alignment of multiple documents

Posted on 2016-07-09 | In research | Comments:

对于网页文档进行更细粒度的分析可以有效改进检索的精度。一般来说一个文档都是有一个主题的，同时一篇文档中可能包含多个sub-topic，比如一篇关于分析日元汇率上升的文档中，可能包含近期日元的走势及分析，进一步可能包含英国脱欧的分析，对日本旅游业的影响，制造业的影响，与其他货币的比较等内容。尽管这篇文档整体主题是关于经济的，然是具体其中也包含了各个子话题。如果可以识别划分关于不同子话题的段落，可以有效的改进检索。同样的，多个文档可能包含这些子话题的段落，如果能将这些文档的段落中具有相同子话题的文档关联起来，可以进一步挖掘文档间的关系用于改进检索，因为有些用户只是关注文档中某一特定子话题的内容。

这个任务就是identifying similar segments from multiple similar documents，一个直观的展示是：

论文Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents给出了一个基于Mutual information(MI)的解决方法，以及带有term weights的Weighted mutual information(WMI)的解决方案。

问题可以定义为：

terms集合 $T=\lbrace t_1, t_2, …, t_l\rbrace $
documents集合 $D=\lbrace d_1, d_2, …, d_n\rbrace $
文档 $d$ 的sentences $S_d=\lbrace s_1, s_2, …, s_m\rbrace$,其中$m$表示文档 $d$的sentences的个数。
Segments集合（准确的说是topic集合，这里的每个segment不是各个文档的片段，而是一个全局topic）$\widehat{S}=\lbrace \widehat{s_1}, \widehat{s_2}, …, \widehat{s_p}\rbrace $ ,也就是每个文档的$\widehat{s_1}$的topic都是相同的，尽管有些documents的$\widehat{s_1}$是空的，也就是不存在对应的sentence。也就是正如前面所说的，论文中这里指的segment其实就是topic（比较迷惑），而topic数目是要预定义的（尽管这个数字一般是没法确定，取一个大的数字是比较合适的，但是增加计算复杂度。）

所要解决的问题就是对每个文档进行segmentation以及找出多个文档中各个segment的alignment：
$$\begin{aligned} Seg_d : \lbrace s_1, s_2, ..., s_{n_d}\rbrace \to \lbrace \widehat{s_{d1}}, \widehat{s_{d2}}, ..., \widehat{s_{?}}\rbrace \end{aligned}$$

$\lbrace \widehat{s_{d1}}, \widehat{s_{d2}}, ..., \widehat{s_{?}}\rbrace$ 表示这个文档的segmentation，其中每个$\widehat{s}$表示文档的片段。以及各个文档的segmentation的alignment，也就是对于每个文档： $$\begin{aligned} Ali_d=\lbrace \widehat{s_{d1}}, \widehat{s_{d2}}, ..., \widehat{s_{?}}\rbrace \to \lbrace \widehat{s_1}, \widehat{s_2}, ..., \widehat{s_p}\rbrace \end{aligned}$$

(这里原文的表述和公式相当让人迷惑，所以做了调整)

其基本假设本质上还是把sentence作为bag of words处理，并假设term和各个topic关联程度不同。比如toyota会和汽车topic关联比较紧密（也就是在关于汽车topic的段落中toyota出现可能性会比关于文学topic的段落中大，也就是topic-model的基本假设），以此定义了mutual inforamtion(MI)，其实就是clustering中MI的定义。
$$\begin{aligned} I(X;Y) = \sum_{x\in X}\sum_{y\in Y}p(x, y)log\frac{p(x,y)}{p(x)p(y)} \end{aligned}$$

对多个文档进行segmentation以及alignment的问题也就变成了如何划分每个文档以及alignment，使得最终的MI取得最大值，也就是MI就是这个问题的目标函数。具体在这问题中，MI也就是：
$$\begin{aligned} I(\widehat{T};\widehat{S}) = \sum_{\widehat{t}\in \widehat{T}}\sum_{\widehat{s}\in \widehat{S}}p(\widehat{t}, \widehat{s})log\frac{p(\widehat{t},\widehat{s})}{p(\widehat{t})p(\widehat{s})} \end{aligned}$$

Java Wordnet Interface

Posted on 2016-02-12 | In research | Comments:

MIT Java Wordnet Interface

JWI是一个java实现的与Wordnet交互的工具包，通过JWI我们可以搜索词form对应的词汇(具有具体语义)及其synset，gloss，上下级词汇等。

Getting Started

JWI交互的主要接口是edu.mit.jwi.IDictionary，标准实现类为edu.mit.jwi.Dictionary，我们可以通过指定Wordnet所在的位置的url简单的初始化一个Dictionary。

public void testDictionary() throws IOException {
	// construct the URL to the Wordnet dictionary directory
	String wnhome = System.getenv("WNHOME");
	String path = wnhome + File.separator + "dict"; 
	URL url = new URL("file", null, path);
	// construct the dictionary object and open it
	IDictionary dict = new Dictionary(url); dict.open();
	// look up first sense of the word "dog"
	IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN); 
	IWordID wordID = idxWord.getWordIDs().get(0);
	IWord word = dict.getWord(wordID);
	System.out.println("Id = " + wordID); 
	System.out.println("Lemma = " + 	word.getLemma()); 
	System.out.println("Gloss = " + 	word.getSynset().getGloss());
}

Id = WID-2084071-n-?-dog
Lemma = dog
Gloss = a member of the genus Canis (probably descended from the common
wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night"

为了提高速度，JWI2.2提供了一个新的IDicitionary实现类：edu.mit.jwi.RAMDictionary，允许将整个Dicitionary加载到RAM中，

public void testRAMDictionary(File wnDir) throws Exception {
	// construct the dictionary object and open it
	IRAMDictionary dict = new RAMDictionary(wnDir, ILoadPolicy.NO_LOAD); 
	dict.open();
	// do something
	trek(dict);
	// now load into memory
	System.out.print("\nLoading Wordnet into memory...");
	long t = System.currentTimeMillis();
	dict.load(true);
	System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t);
	// try it again, this time in memory
	trek(dict); 
}
	
public void trek(IDictionary dict){
	int tickNext = 0;
	int tickSize = 20000;
	int seen = 0;
	System.out.print("Treking across Wordnet"); 
	long t = System.currentTimeMillis(); 
	for(POS pos : POS.values()) {
		for(Iterator <IIndexWord> i = dict.getIndexWordIterator(pos); i.hasNext(); ) {
			for(IWordID wid : i.next().getWordIDs()){
				seen += dict.getWord(wid).getSynset().getWords().size(); 
				if(seen > tickNext) {
					System.out.print(".");
					tickNext = seen + tickSize; 
				}
			}
		}
	}
	System.out.printf("done (%1d msec)\n", System.currentTimeMillis()-t); System.out.println("In my trek I saw " + seen + " words");
}

1
2
3

Treking across Wordnet...........................done (3467 msec) In my trek I saw 522858 words
Loading Wordnet into memory...done (5728 msec)
Treking across Wordnet...........................done (205 msec) In my trek I saw 522858 words

需要注意的是，JWI要求作为参数的传入必须是词的词根root，否则可能会无法返回在wordnet正确的词汇，比如dog，我们应该传入dog作为参数，dogs则有可能会导致结果有误。我们可以通过edu.mit.jwi.morph.WordnetStemmer来过来的词root，之后以此输入词典进行查找。

Retrieve synonyms

也就是取得一个term的某种语义意思的synset中的其他词汇。

public void getSynonyms(IDictionary dict){
	// look up first sense of the word "dog"
	IIndexWord idxWord = dict.getIndexWord("dog", POS.NOUN); 
	IWordID wordID = idxWord.getWordIDs().get(0); // 1st meaning 
	IWord word = dict.getWord(wordID);
	ISynset synset = word.getSynset();
	// iterate over words associated with the synset
	for(IWord w : synset.getWords()) System.out.println(w.getLemma());
}