分词器和过滤器

分词器和过滤器

Posted by dym on February 5, 2014

本文迁移自老博客,原始链接为 https://seven.blog.ustc.edu.cn/%e5%88%86%e8%af%8d%e5%99%a8%e5%92%8c%e8%bf%87%e6%bb%a4%e5%99%a8/

lucene内置的标准分词器可以对英文和中文进行分词,由于英文可以直接根据空格进行分词,所以lucene直接使用空格对英文进行分词就可以达到目的;lucene内置的中文分词器不大智能,将汉字两两的进行划分,使用这种方法可能会破坏句子的原意。 我使用的是开源的中文分词包MMAnalyzer,网上找了一下这个分词包的大概优点是: 1、支持英文、数字、中文(简体)混合分词 2、常用的数量和人名的匹配 3、超过22万词的词库整理 4、实现正向最大匹配算法 5、词典的动态扩展 6、分词效率: 第一次分词需要1-2秒(读取词典),之后速度基本与Lucene自带分词器持平。 MMAnalyzer分词器的具体使用方法: Analyzer analyzer = new MMAnalyzer(); 然后用这个analyzer的analyzer.tokenStream()来对某一个字符串进行分词 分词技术对于搜索引擎的开发是非常关键的,因为分词的粒度决定了索引如何建立和如何对用户输入的关键词句进行切分,如果切分的不得当,会导致所建立的索引是不准确的,而且对用户的搜索词句的语义也把握不准。另外,索引阶段和搜索阶段使用的分词器一定要相同,这是显而易见的,否则会很不准确。 数学之美中说过:中文分词以统计语言模型为基础,经过几十年的发展与完善,今天基本上可以看做一个已经解决的问题。书中提到了分词的层次概念,即使用基本词表和复合词表以及各自对应的语言模型。首先使用基本词表和语言模型对句子进行小颗粒度的分词,然后使用复合词表以及对应的语言模型进行二次分词。这应该是目前最好的方式。

lucene的过滤器是基于布尔运算的,具体就是使用了java的位操作,看了lucene的源代码,发现其实就是对BitSet进行操作,于是可以仿照源代码写一些过滤功能。

import java.io.IOException;
import java.util.BitSet;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.search.Filter;
public class yinyueFilter extends Filter{ 
    public BitSet bits(IndexReader reader) throws IOException {  
	BitSet bitSet = new BitSet(reader.maxDoc());
	bitSet.set(0, bitSet.size()-1,false);//设置位队列的每一位都问TRUE
	Term term = new Term("kind","mp3");
	TermDocs termDocs = reader.termDocs(term);//从索引中取出含term的文档
	while(termDocs.next()){
	    bitSet.set(termDocs.doc(), true);
        }	 
	Term term1=new Term("kind","wma");
	TermDocs termDocs1 = reader.termDocs(term1);
        while(termDocs1.next()){
	    bitSet.set(termDocs1.doc(), true);
        }	 
        return bitSet;
        }
}

调用时可以使用如下方法:

yinyueFilter filter = new yinyueFilter();
hits = searcher.search(multiKeys,filter);

对于搜索引擎中的高级搜索,也就是定制的搜索,这里可能会用到很多组合的限制条件,其实质就是对二进制位的各种逻辑操作。下面是我自己的简易搜索引擎中高级搜索部分的过滤器实现,使用的还是通过对Bitset对象的逻辑运算来对索引进行组合过滤,这里的过滤条件只粗略的选取了一部分,最终通过将各种过滤条件的位集合进行逻辑与操作得到最终的结果集:

import java.io.IOException;
import java.util.BitSet;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.search.Filter;
class gaojiFilter extends Filter {
	private static final long serialVersionUID = 1L;

	public BitSet bits(IndexReader reader) throws IOException {
		BitSet bitSet = new BitSet(reader.maxDoc());
		if (!author.isEmpty()) {
			System.out.println(author);
			bitSet.set(0, bitSet.size() - 1, false);
			Term term = new Term("author", author);
			TermDocs termDocs = reader.termDocs(term);//从索引中取出含term的文档
			while (termDocs.next()) {
				bitSet.set(termDocs.doc(), true);
			}
		} else if (author.isEmpty()) {
			bitSet.set(0, bitSet.size() - 1, true);
			System.out.print("success");
		}
		BitSet bitSet1 = new BitSet(reader.maxDoc());
		if (!publisher.isEmpty())

		{
			System.out.print(publisher);
			System.out.println("nima");
			bitSet1.set(0, bitSet1.size() - 1, false);
			Term term1 = new Term("publisher", publisher);
			TermDocs termDocs1 = reader.termDocs(term1);
			while (termDocs1.next()) {
				bitSet1.set(termDocs1.doc(), true);
			}
		} else if (publisher.isEmpty()) {
			System.out.print("usb");
			bitSet1.set(0, bitSet1.size() - 1, true);
		}
		BitSet bitSet2 = new BitSet(reader.maxDoc());
		bitSet2.set(0, bitSet2.size() - 1, true);
		if (nokey != null) {
			Term term2 = new Term("keywords", nokey);
			TermDocs termDocs2 = reader.termDocs(term2);
			while (termDocs2.next()) {
				bitSet2.set(termDocs2.doc(), false);
			}
			Term term3 = new Term("title", nokey);
			TermDocs termDocs3 = reader.termDocs(term3);
			while (termDocs3.next()) {
				bitSet2.set(termDocs3.doc(), false);
			}
			Term term4 = new Term("describe", nokey);
			TermDocs termDocs4 = reader.termDocs(term4);
			while (termDocs4.next()) {
				bitSet2.set(termDocs4.doc(), false);
			}
		}
		BitSet bitSet3 = new BitSet(reader.maxDoc());
		bitSet3.set(0, bitSet3.size() - 1, false);// 设置位队列的每一位都问TRUE
		if (kind.equals("picture")) {
			Term term = new Term("kind", "jpg");
			TermDocs termDocs = reader.termDocs(term);
			while (termDocs.next()) {
				bitSet3.set(termDocs.doc(), true);
			}
			Term term1 = new Term("kind", "gif");
			TermDocs termDocs1 = reader.termDocs(term1);
			while (termDocs1.next()) {
				bitSet3.set(termDocs1.doc(), true);
			}
		} else if (kind.equals("document")) {
			Term term = new Term("kind", "doc");
			TermDocs termDocs = reader.termDocs(term);
			while (termDocs.next()) {
				bitSet3.set(termDocs.doc(), true);
			}
			Term term1 = new Term("kind", "xls");
			TermDocs termDocs1 = reader.termDocs(term1);
			while (termDocs1.next()) {
				bitSet3.set(termDocs1.doc(), true);
			}
			Term term2 = new Term("kind", "ppt");
			TermDocs termDocs2 = reader.termDocs(term2);
			while (termDocs2.next()) {
				bitSet3.set(termDocs2.doc(), true);
			}
			Term term3 = new Term("kind", "pdf");
			TermDocs termDocs3 = reader.termDocs(term3);
			while (termDocs3.next()) {
				bitSet3.set(termDocs3.doc(), true);
			}
			Term term4 = new Term("kind", "txt");
			TermDocs termDocs4 = reader.termDocs(term4);
			while (termDocs4.next()) {
				bitSet3.set(termDocs4.doc(), true);
			}
		} else if (kind.equals("mp3")) {
			Term term = new Term("kind", "mp3");
			TermDocs termDocs = reader.termDocs(term);
			while (termDocs.next()) {
				bitSet3.set(termDocs.doc(), true);
			}
			Term term1 = new Term("kind", "wma");
			TermDocs termDocs1 = reader.termDocs(term1);
			while (termDocs1.next()) {
				bitSet3.set(termDocs1.doc(), true);
			}
		} else if (kind.equals("vedio")) {
			Term term = new Term("kind", "mp4");
			TermDocs termDocs = reader.termDocs(term);
			while (termDocs.next()) {
				bitSet3.set(termDocs.doc(), true);
			}
			Term term1 = new Term("kind", "flv");
			TermDocs termDocs1 = reader.termDocs(term1);
			while (termDocs1.next()) {
				bitSet3.set(termDocs1.doc(), true);
			}
			Term term2 = new Term("kind", "rmvb");
			TermDocs termDocs2 = reader.termDocs(term2);
			while (termDocs2.next()) {
				bitSet3.set(termDocs2.doc(), true);
			}
		} else if (kind.equals("all")) {
			bitSet3.set(0, bitSet3.size() - 1, true);
		}
		BitSet finalbitSet = new BitSet(reader.maxDoc());
		for (int i = 0; i < finalbitSet.size(); i++) {
			finalbitSet.set(i,
					bitSet.get(i) && bitSet1.get(i) && bitSet2.get(i)
							&& bitSet3.get(i));
		}
		return finalbitSet;
	}
}

以上的过滤器功能并不是很强,但表明了lucene中过滤器的实现思想,这种思想不止可以用在搜索引擎的搜索结果过滤上,还可以用在其他很多涉及集合操作的实际问题上。