<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ankit Jain &#187; lucene</title>
	<atom:link href="http://ankitjain.info/ankit/tag/lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://ankitjain.info/ankit</link>
	<description>» It’s all about Ankit and Web! «</description>
	<lastBuildDate>Thu, 02 Jun 2011 16:54:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Lucene StopWords</title>
		<link>http://ankitjain.info/ankit/2009/05/27/lucene-search-ignore-word-list/</link>
		<comments>http://ankitjain.info/ankit/2009/05/27/lucene-search-ignore-word-list/#comments</comments>
		<pubDate>Tue, 26 May 2009 20:17:10 +0000</pubDate>
		<dc:creator>Ankit</dc:creator>
				<category><![CDATA[Programming/Code]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://ankitjain.info/ankit/?p=314</guid>
		<description><![CDATA[Lucene is a open-source high performance full-text search engine and has libraries for almost all well known languages (Java, C#, PHP, Python, C). This post is about describing StopWords for a full-text search engine (Lucene). ( Lucene can also be used to index database table rows. The advantage for having Lucene search instead of (database [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org/" target="_blank">Lucene</a> is a open-source high performance full-text search engine and has libraries for almost all well known languages (Java, C#, PHP, Python, C). This post is about describing StopWords for a full-text search engine (Lucene).</p>
<p>( Lucene can also be used to index database table rows. The advantage for having Lucene search instead of (database software&#8217;s) in-build full-text search engine is Lucene ranks search results based on their relevancy. For example assume you have a product table with &lt;title , description&gt; fields and you want to give higher rank to &#8216;title&#8217; over &#8216;description&#8217;. )</p>
<p>A <strong>stopword</strong> is a language-word that has no significance meaning in a keyword based search system (e.g. Google). Lucene also has a set of such words for English language and these are simply ignored while analyzing/tokenizing text. You can find them inside <code>org/apache/lucene/analysis/StopAnalyzer.java</code> file declared as <code>StopAnalyzer.ENGLISH_STOP_WORDS</code> constant.</p>
<blockquote><p><code>public static final String[] ENGLISH_STOP_WORDS = {<br />
    "a", "an", "and", "are", "as", "at", "be", "but", "by",<br />
    "for", "if", "in", "into", "is", "it",<br />
    "no", "not", "of", "on", "or", "such",<br />
    "that", "the", "their", "then", "there", "these",<br />
    "they", "this", "to", "was", "will", "with"<br />
  };</code></p></blockquote>
<p>You can also specify your own <strong>stopwords</strong> while indexing text. Use <a href="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html#StandardAnalyzer(java.util.Set)"  target="_blank">StandardAnalyzer&#8217;s constructor</a> and pass a set of words as agrument. These will be ignored while indexing.</p>
<p>- ankit</p>
]]></content:encoded>
			<wfw:commentRss>http://ankitjain.info/ankit/2009/05/27/lucene-search-ignore-word-list/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

