Lucene StopWords

Lucene is a open-source high performance full-text search engine and has libraries for almost all well known languages (Java, C#, PHP, Python, C). This post is about describing StopWords for a full-text search engine (Lucene).

( Lucene can also be used to index database table rows. The advantage for having Lucene search instead of (database software’s) in-build full-text search engine is Lucene ranks search results based on their relevancy. For example assume you have a product table with <title , description> fields and you want to give higher rank to ‘title’ over ‘description’. )

A stopword is a language-word that has no significance meaning in a keyword based search system (e.g. Google). Lucene also has a set of such words for English language and these are simply ignored while analyzing/tokenizing text. You can find them inside org/apache/lucene/analysis/ file declared as StopAnalyzer.ENGLISH_STOP_WORDS constant.

public static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"

You can also specify your own stopwords while indexing text. Use StandardAnalyzer’s constructor and pass a set of words as agrument. These will be ignored while indexing.

– ankit

Tagged on: , ,

One thought on “Lucene StopWords

  1. Pingback: Lessons learned: Upgrading to Sitecore 7 & Lucene code -

Leave a Reply

Your email address will not be published. Required fields are marked *