System Design_大型网站架构发展

Posted on 2015-11-05 | In System design | Comments:

大型网站架构发展

大型网站的技术挑战主要来自于庞大的用户，高并发的访问和海量的数据，任何简单的业务一旦需要处理以P计的数据和数亿的用户，就会变得很棘手，这也是大型网站架构需要解决的问题。

初始阶段的网站架构

System_design_c1_1

由于用户量并不大，可以直接将应用程序、数据库、文件等所有资源放在一台服务器上。

应用服务和数据服务分离

System_design_c1_

需要将应用和数据分离，应用服务器需要处理大量的业务逻辑，需要强大的计算能力，数据库需要快速磁盘检索和数据缓存，所以需要更快的硬盘和更大的内存，文件服务器需要存储大量文件以及文件检索，需要更大的硬盘和内存。这样也方便扩展增加服务器。

使用缓存改善网站性能

System_design_c1_3

80%的业务集中在20%的数据上，如果将热门的数据缓存在内存中，可以很大的改善网站的相应速度。缓存可以分为两种，本地缓存和专门的分布式缓存服务器，考虑到扩展性，一般使用远程分布式缓存，以集群的方式部署大内存的服务器作为专门的缓存服务器。

使用应用服务器集群改善网站的并发处理能力。

使用集群是完整解决高并发、海量数据的常用手段，也就是横向扩展(通过增加机器的方式扩展，与之相对的是纵向扩展，也就是通过增强一台机器的性能进行扩展)。对于大型网站而言，不管多强大的服务器，都无法满足持续增长的业务需求，这种情况下，更恰当的做法是增加一台服务器分担原有服务器的压力。

通过这种方式可以持续增加服务器不断改善系统性能，从而实现系统的可伸缩性。应用服务器实现集群也是网站可伸缩性集群架构设计中较简单成熟的一种。

数据库读写分离

网站使用缓存后，可以很大程度减轻数据库的压力，但随着用户的进一步增加，数据库仍会因为负载压力过高成为网站的瓶颈，需要数据库读写分离来减轻负载。

应用服务其在写数据时，访问主数据库，主数据库通过主从复制将数据同步到从数据库，这样当应用服务器读数据的时候，就可以通过从数据库获得数据，

使用反向代理和CDN加速网站响应

CDN和反向代理的基本原理都是缓存，区别在于CDB部署在网络提供商的机房，反向代理则部署在网站的中心机房，当用户到达中心机房后，首先访问反向代理服务器，如果反向代理服务器中缓存着用户请求的资源，就将其直接诶返回给用户。

一方面可以加快访问速度，也可以减轻后端服务器的压力。

使用分布式文件系统和分布式数据库系统

分布式数据库系统是网站数据库拆分的最后手段，只有在单表数据规模¥非常大是才使用。一般网站更常用的数据库拆分手段是业务分库，将不同业务的数据库部署到不同的数据库服务器上。

使用NoSQL和搜索引擎

随着网站业务越累越复杂，对数据存储和检索的需求也越来越复杂，网站需要使用一些非关系数据库技术如NoSQL和非数据库查询技术如搜索引擎。

NoSQL和搜索引擎都对可伸缩的分布式特性具有很好的支持。应用服务器则通过一个统一的数据访问模块访问各种数据，减轻应用程序管理多数据源的问题。

业务拆分

大型网站为了应对日益复杂的业务场景，通过分而治之的手段将整个网站业务分成不同的产品线，如首页，支付，订单，卖家，买家等，分到不同的业务团队负责。

具体到技术上，也会根据产品线划分，将一个网站拆分成许多不同的应用，每个应用独立部署，应用之间可以通过超链接建立关系，也可以通过消息队列进行数据分发，最常用的还是通过访问同一个存储系统来构成一个关联的完整系统。

分布式服务

随着业务拆分越累越小，存储系统越来越大，应用系统的整体复杂度呈指数增加，部署维护困难。由于所有的应用要和所有数据库系统连接，这些连接是服务器规模的平方，会导致数据库资源不够。

由于很多子应用系统都需要进行很多相同的业务操作，如用户管理，商品管理等，可以将这些公用的服务提取出来，独立部署，有这些可复用的业务连接数据库，并对外提供公共服务。应用系统只需要通过分布式服务调用公共业务服务完成具体业务操作。

Python_MongoDB

Posted on 2015-10-16 | In Python | Comments:

基本操作

1.Install PyMongo(MongoDB的python driver)
pip install pymongo

2.Import pymongo

from pymongo import MongoClient

3.Create a connection

client = MongoClient()
如果没有具体参数配置，则会默认连接到本地端口27017启动的mongodb实例，也就是mongodb启动的默认服务端口。

client = MongoClient("mongodb://mongodb0.example.net:27019")

4.Access Database Objects
取得数据库实例，数据库也就是多个mongodb集合(collection)的组合。可以直接指定不存在数据库名然后插入数据，就相当于直接创建了该数据库以及相应集合。

db = client.noexist
db.blog.insert({"title": "this is the first blog"})

也可以通过字典形式获得¥数据库实例：
db = client['primer']

5.Access Collection Objects

Collection(集合)也就相当于传统数据库的表，存放无确定模式的一组文档(document)，文档就相当于传统数据库的行，但是没有确定的模式，可以缺省键，也可以随时新增键。

table = db.blog
table = db['blog']

Python_virtualenv沙箱

Posted on 2015-10-11 | In Python | Comments:

最近开发需要使用到python的沙箱环境，以隔离不同程序的开发，这里记录一下python沙箱virtualenv的了解，参考virtualenv，官网virtualenv-documentation.

在同时开发多个应用程序时，可能各个程序所需要的模块版本不同，如A程序需要jinjia 2.6, B程序需要jinja 2.7。默认情况所有通过pip安装的第三方包都会被放到python的site-packages目录下，则会造成冲突。

可以通过下面命令安装virtualenv：

1	sudo pip install virtualen

通过virtualenv模块我们可以轻松为每个应用程序设置独立的python运行环境。假设已经切换到A项目目录下，可以通过以下命令建立新的运行环境：

1	virtualenv --no-site-packages env

--no-site-packages参数指不引入任何默认环境已经引入的库，还可以通过--python设置所需要的python版本环境。

上面命令会在当前目录建立一个env文件夹，记录了环境的设置和第三方库。可以通过执行下面命令进入该环境：

1	source venv/bin/activate

之后正常安装需要的第三方包，都只会影响当前这个env环境，不会影响系统python环境。

如果需要退出env环境，可以使用：

1	deactivate

virtualenv的原理是把系统Python复制一份到virtualenv的环境，用命令source venv/bin/activate进入一个virtualenv环境时，virtualenv会修改相关环境变量，让命令python和pip均指向当前的virtualenv环境。参考virtualenv

DBpedia Spotlight

Posted on 2015-10-08 | In research | Comments:

DBpedia Spotlight

Introduction

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. “Michael Jordan”), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan the basketball player). It can also be used for building your solution for Named Entity Recognition, Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.

Text annotation has the potential of enhancing a wide range of applications, including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.

Take a look at our Known Uses page for other examples of how DBpedia Spotlight can be used. If you use DBpedia Spotlight in your project, please add a link to http://spotlight.dbpedia.org. If you use it in a paper, please use the citation available here.

You can try out DBpedia Spotlight through our Web Application or Web Service endpoints. The Web Application is a user interface that allows you to enter text in a form and generates an HTML annotated version of the text with links to DBpedia. The Web Service endpoints provide programmatic access to the demo, allowing you to retrieve data also in XML or JSON.

Demo web application.

Glossary

Context: the context refers to the “the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning.”
OntologyClass: an ontology class represents a set of resources sharing similar characteristics. Resources can be of several types: Person, Organisation, Location, FloweringPlant, etc. All of these classes are organized in a domain model (i.e. schema, ontology). The “type” or the “ontology class” of a resource comes from this ontology.
Phrase Recognition: See Spotting.
Resource: a resource is any entity or concept in our target knowledge base (e.g. DBpedia). We take this name from RDF (Resource Description Framework), as a generic name for things, concepts, ideas “that can be identified on the Web, even when they cannot be directly retrieved on the Web.”
Spotting: We call Spotting or Phrase Recognition the task of selecting, from some textual document given as input, phrases that should be annotated by the system. This is closely related to Keyphrase Extraction and Named Entity Recognition, for instance. In Keyphrase Extraction, the system tries to guess the “important” phrases, according to some definition of importance. Meanwhile, in Named Entity Recognition, the system focuses on specific entity types (commonly Person, Location and Organization), and the notion of importance is usually irrelevant. We describe some of these and several other strategies for phrase recognition below.
SurfaceForm: a surface form is the phrase used to refer to a resource in text. For example: “Barack Obama”, “President Obama” and “Obama” are all surface forms for the resource dbpedia:Obama.
Token: each individual element extracted after tokenizing the text more. Tokens are the individual words in the context, or slightly modified versions of these words (e.g. running -> run)
Topic: a topic is a broad categorization of knowledge into areas of interest. For example, text can belong to Business, Politics, Sports or Arts topics.

User’s manual

DBpedia Spotlight is a tool for annotating mentions of DBpedia concepts in plain text.

We offer three basic functions: Annotate, Disambiguate and Candidates (Best K). They can be accessed from a Scala/Java API, REST Web Service and from a user interface on the Web (HTML/Javascript). For the Scala/Java API, there are a number of configuration parameters that can be used to instruct the annotation and disambiguation functions. The classes DefaultAnnotator, DefaultDisambiguator and DefaultParagraphDisambiguator offer the configuration that we found to provide the best results. The configuration interface offers ways to control the quality of the output of the two above tasks.

Architecture

The DBpedia Spotlight Architecture is composed by the following modules:

Web application, a demonstration client (HTML/Javascript interface) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.
Web Service, a RESTful/SOAP? Web API that exposes the functionality of annotating and/or disambiguating entities in text.
Annotation Java/Scala API, exposing the underlying logic that performs the annotation/disambiguation.
Indexing Java/Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.
Evaluation module, where we test disambiguators, log results and use those to train our system to perform better.

External dependencies:

DBpedia Extraction Framework, (only for the index module) extracting the necessary data from the Wikipedia dumps.
Lucene 2.9.3, providing the low level indexing framework used by DBpedia Spotlight.
LingPipe 4.0.0, providing the string matching implementation used for the Spotter module.

System Requirements

Java 1.6+
Scala 2.9+
Spotlight JAR
Spotlight Library JARs
Lucene disambiguation index
Spotter dictionary
large RAM to set the heap size big enough for the Spotter (approx. 8G)
Maven 3 for the automagic installation of dependencies.
Indexing Java/Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.

Programmatic usage

If you want to use DBpedia Spotlight in your Java/Scala code, take a look at core/SpotlightFactory to see how you can create your objects, and then look at rest/Candidates.java to see how you can wire them together.

Online Usage

Refer to User’s manual。

Content Negotiation

You can request different types of output by setting the Accept request header. For example, in order to request JSON output, you can add Accept:application/json to the request headers.

One example using cURL:

1
2

curl "http://spotlight.dbpedia.org/rest/annotate?text=President%20Michelle%20Obama%20called%20Thursday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20"\
 -H "Accept:application/json"

The content types we currently support are:

text/html
application/xhtml+xml
text/xml
application/json

The application/xhtml+xml comes with embedded RDFa that you can give to the RDFa Distiller and get RDF triples in Turtle, RDF+XML, etc. as output.

If your input text is long, you may prefer using POST instead of GET.

curl -i -X POST \
    -H "Accept:application/json" \
    -H "content-type:application/x-www-form-urlencoded" \
    -d "disambiguator=Document&confidence=-1&support=-1&text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package" \
       http://spotlight.dbpedia.org/rest/annotate/

Please note that you must use content-type application/x-www-form-urlencoded for POST requests.

The following are 4 examples, each consists of a query url and the result.

Example 1: without type restriction

http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20

returns the XML

<Annotation text="President Obama called Wednesday on Congress to extend a tax break
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20">
  <Resources>
    <Resource URI="http://dbpedia.org/resource/Barack_Obama"
      support="5761" types="Person,Politician,President" surfaceForm="President Obama" offset="0"
      similarityScore="0.31504717469215393" percentageOfSecondRank="-1.0"/>
    <Resource URI="http://dbpedia.org/resource/United_States_Congress"
      support="8569" types="Organisation,Legislature" surfaceForm="Congress" offset="36" 
      similarityScore="0.2348192036151886" percentageOfSecondRank="0.8635579006818564"/>
    <Resource URI="http://dbpedia.org/resource/Tax_break"
      support="32" types="" surfaceForm="tax break" offset="57"
      similarityScore="0.35041093826293945" percentageOfSecondRank="-1.0"/>
    <Resource URI="http://dbpedia.org/resource/Student"
      support="1701" types="" surfaceForm="students" offset="71"
      similarityScore="0.32534149289131165" percentageOfSecondRank="-1.0"/>
    <Resource URI="http://dbpedia.org/resource/Policy"
      support="557" types="" surfaceForm="policy" offset="148"
      similarityScore="0.3228176236152649" percentageOfSecondRank="-1.0"/>
  </Resources>
</Annotation>

Example 2: with type restriction

http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20&types=Person,Organisation

returns the XML

<Annotation text="President Obama called Wednesday on Congress to extend a tax break
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20" types="Person,Organisation">
  <Resources>
    <Resource URI="http://dbpedia.org/resource/Barack_Obama"
      support="5761" types="Person,Politician,President" surfaceForm="President Obama" offset="0" 
      similarityScore="0.31504717469215393" percentageOfSecondRank="-1.0"/>
    <Resource URI="http://dbpedia.org/resource/United_States_Congress"
      support="8569" types="Organisation,Legislature" surfaceForm="Congress" offset="36" 
      similarityScore="0.2348192036151886" percentageOfSecondRank="0.8635579006818564"/>
  </Resources>
</Annotation>

Example 3: with SPARQL restriction

http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20&sparql=SELECT+DISTINCT+%3Fx%0D%0AWHERE+%7B%0D%0A%3Fx+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FOfficeHolder%3E+.%0D%0A%3Fx+%3Frelated+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FChicago%3E+.%0D%0A%7D

returns the XML

<Annotation text="President Obama called Wednesday on Congress to extend a tax break
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20" 
sparql="SELECT DISTINCT ?x WHERE { ?x a <http://dbpedia.org/ontology/OfficeHolder>; . 
?x ?related <http://dbpedia.org/resource/Chicago>;  }" 
policy="whitelist"> 
  <Resources> 
    <Resource URI="http://dbpedia.org/resource/Barack_Obama" 
      support="5761" types="Person,Politician,President" surfaceForm="President Obama" offset="0" 
      similarityScore="0.2730408310890198" percentageOfSecondRank="-1.0"/> 
  </Resources> 
</Annotation>

Example 4: Candidates Interface

The parameters are the same as in Example 1, but you will send your request to http://spotlight.dbpedia.org/rest/candidates

http://spotlight.dbpedia.org/rest/candidates?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20

returns XML

<annotation text="President Obama on Monday will call for a new minimum tax rate for individuals making more than $1 million a year to ensure that they pay at least the same percentage of their earnings as other taxpayers, according to administration officials. ">
<surfaceForm name="individuals" offset="67">
  <resource label="Individual" uri="Individual" contextualScore="0.26683980226516724" percentageOfSecondRank="-1.0" support="312" priorScore="0.0" finalScore="0.26683980226516724"/>
  <resource label="The Individuals (New Jersey band)" uri="The_Individuals_%28New_Jersey_band%29" contextualScore="0.011762913316488266" percentageOfSecondRank="-1.0" support="17" priorScore="0.0" finalScore="0.011762913316488266"/>
  <resource label="The Individuals (Chicago band)" uri="The_Individuals_%28Chicago_band%29" contextualScore="0.0" percentageOfSecondRank="-1.0" support="0" priorScore="0.0" finalScore="0.0"/>
</surfaceForm>
<surfaceForm name="officials" offset="233">
  <resource label="Official" uri="Official" contextualScore="0.1324356347322464" percentageOfSecondRank="-1.0" support="196" priorScore="0.0" finalScore="0.1324356347322464"/>
  <resource label="Rugby league match officials" uri="Rugby_league_match_officials" contextualScore="0.04376954212784767" percentageOfSecondRank="-1.0" support="9" priorScore="0.0" finalScore="0.04376954212784767"/>
</surfaceForm>
<surfaceForm name="President Obama" offset="0">
  <resource label="Presidency of Barack Obama" uri="Presidency_of_Barack_Obama" contextualScore="0.5634340643882751" percentageOfSecondRank="-1.0" support="134" priorScore="0.0" finalScore="0.5634340643882751"/>
</surfaceForm>
<surfaceForm name="1 million" offset="97">
  <resource label="Million" uri="Million" contextualScore="0.527919590473175" percentageOfSecondRank="-1.0" support="492" priorScore="0.0" finalScore="0.527919590473175"/>
</surfaceForm>
<surfaceForm name="percentage" offset="156">
  <resource label="Percentage" uri="Percentage" contextualScore="0.6362485885620117" percentageOfSecondRank="-1.0" support="165" priorScore="0.0" finalScore="0.6362485885620117"/>
</surfaceForm>
<surfaceForm name="earnings" offset="176">
  <resource label="Income" uri="Income" contextualScore="0.5776156187057495" percentageOfSecondRank="-1.0" support="648" priorScore="0.0" finalScore="0.5776156187057495"/>
</surfaceForm>
<surfaceForm name="taxpayers" offset="194">
  <resource label="Tax" uri="Tax" contextualScore="0.7484055757522583" percentageOfSecondRank="-1.0" support="1540" priorScore="0.0" finalScore="0.7484055757522583"/>
  <resource label="TaxPayers&apos; Alliance" uri="TaxPayers%27_Alliance" contextualScore="0.12765906751155853" percentageOfSecondRank="-1.0" support="15" priorScore="0.0" finalScore="0.12765906751155853"/>
  <resource label="The Taxpayer (Luxembourg)" uri="The_Taxpayer_%28Luxembourg%29" contextualScore="0.024930020794272423" percentageOfSecondRank="-1.0" support="3" priorScore="0.0" finalScore="0.024930020794272423"/>
  <resource label="The Taxpayers" uri="The_Taxpayers" contextualScore="0.0" percentageOfSecondRank="-1.0" support="0" priorScore="0.0" finalScore="0.0"/>
</surfaceForm>
</annotation>

Installation

Refer to Installation.

Web service

This page gives an introduction on how to use the DBpedia Spotlight Web Service. The available service endpoints are listed below and described in more details in the User’s Manual.

Spotting

Spotting : takes text as input and recognizes entities/concepts to annotate. Several spotting techniques are available, such as dictionary lookup and Named Entity Recognition (NER).

Disambiguate

Disambiguation: takes spotted text input, where entities/concepts have already been recognized and marked as wiki markup or xml. Chooses an identifier for each recognized entity/concept given the context.

Supported types (POST/GET):XML, JSON, HTML, RDFa, NIF

Annotate

Annotation: runs spotting and disambiguation. Takes text as input, recognizes entities/concepts to annotate and chooses an identifier for each recognized entity/concept given the context.

Supported types (POST/GET):XML, JSON, HTML, RDFa, NIF

Candidates

Similar to annotate, but returns a ranked list of candidates instead of deciding on one. These list contains some properties as described below:

support: how prominent is this entity, i.e. number of inlinks in Wikipedia;
priorScore: normalized support;
contextualScore: score from comparing the context representation of an entity with the text (e.g. cosine similartity with if-icf weights);
percentageOfSecondRank: measure by how much the winning entity has won by takingcontextualScore_2ndRank / contextualScore_1stRank, which means the lower this score, the further the first ranked entity was “in the lead”;
finalScore: combination of all of them;
Supported types (POST/GET):XML, JSON

Examples

Example 1: Simple request

text= “President Obama called Wednesday on Congress to extend a tax break for students included in last year’s economic stimulus package, arguing that the policy provides more generous assistance.”
confidence = 0.2; support=20
whitelist all types.

curl http://spotlight.dbpedia.org/rest/annotate \
  --data-urlencode "text=President Obama called Wednesday on Congress to extend a tax break
  for students included in last year's economic stimulus package, arguing
  that the policy provides more generous assistance." \
  --data "confidence=0.2" \
  --data "support=20"

Example 2: Using SPARQL for filtering

This example demonstrates how to keep the annotations constrained to only politicians related to Chicago.

text= “President Obama called Wednesday on Congress to extend a tax break for students included in last year’s economic stimulus package, arguing that the policy provides more generous assistance.”
confidence = 0.2; support=20
whitelist sparql = SELECT DISTINCT ?politician WHERE { ?politician a http://dbpedia.org/ontology/officeholder /http://dbpedia.org/ontology/officeholder . ?politician ?related http://dbpedia.org/resource/chicago /http://dbpedia.org/resource/chicago }

curl http&amp;&#35;58&#59;//spotlight.dbpedia.org/rest/annotate \
  &amp;&#35;45&#59;&amp;&#35;45&#59;data&amp;&#35;45&#59;urlencode &amp;quot&#59;text&amp;&#35;61&#59;President Obama called Wednesday on Congress to extend a tax break
  for students included in last year&amp;&#35;39&#59;s economic stimulus package, arguing
  that the policy provides more generous assistance.&amp;quot&#59; \
  &amp;&#35;45&#59;&amp;&#35;45&#59;data &amp;quot&#59;confidence&amp;&#35;61&#59;0.2&amp;quot&#59; \
  &amp;&#35;45&#59;&amp;&#35;45&#59;data &amp;quot&#59;support&amp;&#35;61&#59;20&amp;quot&#59; \
 &amp;&#35;45&#59;&amp;&#35;45&#59;data&amp;&#35;45&#59;urlencode &amp;quot&#59;sparql&amp;&#35;61&#59;SELECT DISTINCT ?x WHERE &amp;&#35;123&#59; ?x a &amp;lt&#59;http&amp;&#35;58&#59;//dbpedia.org/ontology/OfficeHolder&amp;gt&#59; . ?x ?related &amp;lt&#59;http&amp;&#35;58&#59;//dbpedia.org/resource/Chicago&amp;gt&#59; . &amp;&#35;125&#59;&amp;quot&#59;

Notice: Due to system resources restrictions, for this demo we only use the first 2000 results returned for each query (default for the public DBpedia SPARQL endpoint). However you are welcome to download the software+data and install in your server for real world use cases.

Attention: Make sure to encode your SPARQL query before adding it as the value of the //&sparql// parameter - see java.net.URLEncoder.encode().

Content Negotiation

You can request different types of output by setting the Accept request header. For example, in order to request JSON output, you can add Accept:application/json to the request headers.

One example using cURL:

curl "http://spotlight.dbpedia.org/rest/annotate?text=President%20Michelle%20Obama%20called%20Thursday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20" -H "Accept:application/json"

The content types we currently support are:

text/html
application/xhtml+xml
text/xml
application/json

The application/xhtml+xml comes with embedded RDFa that you can give to the RDFa Distiller and get RDF triples in Turtle, RDF+XML, etc. as output.
If your input text is long, you may prefer using POST instead of GET.

curl -i -X POST \
   -H "Accept:application/json" \
   -H "content-type:application/x-www-form-urlencoded" \
   -d "disambiguator=Document&confidence=-1&support=-1&text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package" \
   http://spotlight.dbpedia.org/dev/rest/annotate/

Please not that you must use content-type application/x-www-form-urlencoded for POST requests.

Run from a JAR

This page describes how to run DBpedia Spotlight in your own server by using a pre-packaged JAR. We assume that you are running these commands on a bash command line (Linux) and have wget, curl and java installed.

Requirements

Java 1.6+
RAM of appropriate size for the spotter lexicon you need

Quickstart

The commands below will help you to obtain a pre-packaged lightweight deployment to get you started.

Lucene:

wget http://spotlight.dbpedia.org/download/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip
unzip dbpedia-spotlight-quickstart-0.6.5.zip
cd dbpedia-spotlight-quickstart-0.6.5/
./run.sh

Older jars are downloadable from: https://github.com/dbpedia-spotlight/dbpedia-spotlight/downloads

Statistical:

wget http://spotlight.sztaki.hu/downloads/version-0.1/en.tar.gz
wget http://spotlight.sztaki.hu/downloads/version-0.1/dbpedia-spotlight.jar
tar xvf en.tar.gz 
java -jar dbpedia-spotlight.jar /data/spotlight/en/model_en http://localhost:2222/rest

Test your installation

In order to test your new installation, run:

curl http://localhost:2222/rest/annotate \
  -H "Accept: text/xml" \
  --data-urlencode "text=Brazilian state-run giant oil company Petrobras signed a three-year technology and research cooperation agreement with oil service provider Halliburton." \
  --data "confidence=0" \
  --data "support=0"

Now you can study more about how to call your newly installed Web Service, which parameters are accepted, etc. here.

Upgrade your models

Lucene:

The files you’ve downloaded above contain only a very small subset of the DBpedia resources. They are used to demonstrate DBpedia Spotlight in a lightweight environment. Please see our Downloads for more information on other alternatives that are more useful in real world scenarios. See below one example.

First rename your small model files:

1 2	mv data/index data/index-small mv data/spotter.dict data/spotter-small.dict

Now obtain new copies with larger models:

cd data
wget http://spotlight.dbpedia.org/download/release-0.5/context-index-compact.tgz
tar zxvf context-index-compact.tgz
mv index-withSF-withTypes-compressed index
wget http://spotlight.dbpedia.org/download/release-0.4/surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz
gunzip surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz
mv surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary spotter.dict

If you are using the largest spotter dict, you may need to increase the java heap space — e.g. -Xmx10G in your command line.

Statistical:

We offer only the complete model with this option. You can download the newest models from http://spotlight.sztaki.hu/downloads/

Two Backend version

Statistical backend

Refer to Statistical backend.

Lucene backend

Refer to Lucene backend).

检索&聚类评价

Posted on 2015-10-08 | In research | Comments:

检索Evaluation

Evaluation Criteria of Unranked Retrieval

-	Retrieved	Not-retrieved
relevante	TP(true positive)	FN(false negative)
Non-relevant	FP(false positive)	TN(true negative)

Precision AND Recall

$precision = \frac{TP}{TP+FP}$ $recall = \frac{TP}{TP+FN}$

Precision-Recall Curve

通常precision和recall是呈一种负相关的关系，recall越高，precision越低。precision-recall curve以recall为横坐标，以precision为纵坐标，表示precision-recall变化关系。

F-Score

组合precision和recall的评价参数。

$F = \frac{1}{\alpha\frac 1P + (1-\alpha)\frac 1R} = \frac{(\beta^{2}+1)PR}{\beta^2P+R}$

通常我们取$\beta=1,2,3..=，1$，也就是$\alpha=0.5$，也就是harmonic mean，也写作$F_1$，当取$\beta=1,2,3..$，也就得到$F_1, F_2, F_3..$：

$F_1 = \frac{2}{\frac 1P + \frac 1R}=\frac{2PR}{R+P}$

Precision for Rankded retrieval

P@k AND R@k

用于表示在top k结果中的precision和Recall，如P@10，表示top 10结果的precision。

MRR

也就是Mean reciprocal rank，reciprocal rank表示对于一个query第一个正确结果排名位置的倒数值，也就是$RR=\frac {1}{rank}$。MRR表示多个query的RR的平均值：

$MRR=\frac{1}{|Q|}\sum^{|Q|}_{i=1}\frac {1}{rank_i}$

维基百科给出了一个很好的例子：

query	results	correct response	rank	reciprocal rank
cat	cattern,cati,cats	cats	3	1/3
torus	toril,tori,toruses	tori	2	1/2
virus	viruses,virii,viri	viruses	1	1

$MRR=\frac {(\frac 13 + \frac 12 + 1)}{3}=0.61$

MAP

Mean average precision，上面的MRR适合只考虑第一个正确结果检索性能使用，如果有多个正确结果且都要考虑，则应该使用MAP。

假设query q有m个correct document(or query suggestion etc.).

$AP(q)=\frac 1m \sum^{i=m}_{i=1}P(RankSet_i)$

$RankSet_i$表示document i排序所在的位置及以上的位置所有的document。假设docuement i=3，排在第7的位置，则$RankSet_3$为1-7的documents，$P(RankSet_i)$即是这个set的precision。

MAP为多个query的AP的mean：

$MAP=\frac{1}{|Q|}\sum^{|Q|}_{i=1}{AP(i)}$

DCG and nDCG

DCG：Discounted cumulative gain
nDCG：Normalized Discounted cumulative gain

给每个文档指定一个得分，我们希望的得分越高的文档排在越前面。
假设一个排名是: $G=[5,2,3,0,10,3]$

cumulativfe gain可以计算如下：

$CG[k]=\sum^{k}_{i=1}G[i]$

加上一个惩罚系数，排名越后惩罚越多：

$DCG[k]=\sum^{k}_{i=1}\frac{G[i]}{log_2(1+i)}$

存在理想的排序$G^`=[10,5,3,3,2,0]$，使得DCG的得分最高。

使用这个最高得分normal实际的排序，得到就是nDCG得分。

$nDCG[k]=\frac{DCG[k]}{DCG^`[k]}$

Other Criteria of Effective IR

diversity
credibility
comprehensibility

Clustering Evaluaition

参考Evaluation of clustering