lucene4.9之analyzer -

w62268458

浏览: 13787 次
性别:
来自: 广州

最近访客更多访客>>

CheungGQ

sunjy22

worldseme

whisper527

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene4.9之analyzer

博客分类：

lucene

lucene java

查看分词器分出的词组

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;

public class AnalyzerTest {
	@Test
	public void analyzer() throws IOException {
		String text = "小笑话_总统的房间 Room .txt";
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
		TokenStream tokenStream = analyzer.tokenStream("name", text);
		OffsetAttribute attribute = tokenStream.addAttribute(OffsetAttribute.class);
		

		tokenStream.reset();
		while (tokenStream.incrementToken()) {
			System.out.println("token: " + tokenStream.reflectAsString(true));

			System.out.println("token start offset: " + attribute.startOffset());
			System.out.println("token end offset: " + attribute.endOffset());
		}
		tokenStream.end();
		tokenStream.close();
	}
	
	/**
	 * 测试分词器输出
	 * 	WhitespaceAnalyzer 以空格作为切词标准，不对语汇单元进行其他规范化处理
    	SimpleAnalyzer 以非字母符来分割文本信息，并将语汇单元统一为小写形式，并去掉数字类型的字符
    	StopAnalyzer 该分析器会去除一些常有a,the,an等等，也可以自定义禁用词
    	StandardAnalyzer Lucene内置的标准分析器，会将语汇单元转成小写形式，并去除停用词及标点符号
    	CJKAnalyzer 能对中，日，韩语言进行分析的分词器，对中文支持效果一般。
    	SmartChineseAnalyzer 对中文支持稍好，但扩展性差
	 * @throws IOException
	 */
	@Test
	public void testCharTermAttribute () throws IOException {
		String text = "小笑话_总统的房间 Room .txt";
		//Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
		//Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_4_9);
		Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_4_9);
		TokenStream tokenStream = analyzer.tokenStream("name", text);
		CharTermAttribute termAtt = tokenStream.addAttribute(CharTermAttribute.class);

		tokenStream.reset();
		while (tokenStream.incrementToken()) {
			System.out.println(termAtt.toString());
		}
		tokenStream.end();
		tokenStream.close();
	}
}

分享到：

lucene4.9之Query | lucene4.9初体验（源码附件）

2015-01-28 09:35
浏览 451
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene4.9之analyzer

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene4.9之analyzer

评论

发表评论

相关推荐

lucene原理

lucene之boost

lucene4.9之highlight

lucene4.9之Query

lucene4.9初体验（源码附件）

lucene4.9初体验

最近访客更多访客>>