使用Lucene3和IKAnalyzer对一段文本进行分词-技术博客集

使用Lucene3和IKAnalyzer对一段文本进行分词
编程技术 / houtizong 发布于 3年前 94

import java.io.IOException;import java.io.StringReader;import java.util.ArrayList;import java.util.List;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.TermAttribute;import org.wltea.analyzer.lucene.IKAnalyzer;/** * 对一段文字进行分词 * @author Administrator */public class IkAnalyzerWord {  private String resource;  private List<String> result = new ArrayList<String>();  public IkAnalyzerWord(String resource) throws IOException {    this.resource = resource;    analyzer();  }  private void analyzer() throws IOException {    Analyzer analyzer = new IKAnalyzer();    TokenStream ts = analyzer.tokenStream("*", new StringReader(resource));    ts.addAttribute(TermAttribute.class);    //public <A extends Attribute> A addAttribute(Class<A> attClass)    //The caller must pass in a Class<? extends Attribute> value.    //This method first checks if an instance of that class is already in this AttributeSource and returns it.    //Otherwise a new instance is created, added to this AttributeSource and returned.    while (ts.incrementToken()) {      TermAttribute ta = ts.getAttribute(TermAttribute.class);      //public <A extends Attribute> A getAttribute(Class<A> attClass)      //The caller must pass in a Class<? extends Attribute> value.       //Returns the instance of the passed in Attribute contained in this AttributeSource      result.add(ta.term());      //Returns the Token's term text.    }  }  public List<String> getResult() {    return this.result;  }  public static void main(String[] args) throws IOException {    IkAnalyzerWord ik = new IkAnalyzerWord("今天的大风终于小了，但是又起雾了今天的大风终于小了，但是又起雾了");    System.out.println(ik.getResult());  }}

输出结果（我设置了stopword词典）

[大风, 终于, 小了, 雾, 大风, 终于, 小了, 雾]

上一篇：Java判断文件类型是二进制文件还是文本文件

下一篇：Java获取文件类型Mime Type的各种方法

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<pre name="code" class="java">import java.io.IOException;import java.io.StringReader;import java.util.ArrayList;import java.util.List;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.TermAttribute;import org.wltea.analyzer.lucene.IKAnalyzer;/** * 对一段文字进行分词 * @author Administrator */public class IkAnalyzerWord { private String resource; private List&lt;String&gt; result = new ArrayList&lt;String&gt;(); public IkAnalyzerWord(String resource) throws IOException { this.resource = resource; analyzer(); } private void analyzer() throws IOException { Analyzer analyzer = new IKAnalyzer(); TokenStream ts = analyzer.tokenStream(&quot;*&quot;, new StringReader(resource)); ts.addAttribute(TermAttribute.class); //public &lt;A extends Attribute&gt; A addAttribute(Class&lt;A&gt; attClass) //The caller must pass in a Class&lt;? extends Attribute&gt; value. //This method first checks if an instance of that class is already in this AttributeSource and returns it. //Otherwise a new instance is created, added to this AttributeSource and returned. while (ts.incrementToken()) { TermAttribute ta = ts.getAttribute(TermAttribute.class); //public &lt;A extends Attribute&gt; A getAttribute(Class&lt;A&gt; attClass) //The caller must pass in a Class&lt;? extends Attribute&gt; value. //Returns the instance of the passed in Attribute contained in this AttributeSource result.add(ta.term()); //Returns the Token's term text. } } public List&lt;String&gt; getResult() { return this.result; } public static void main(String[] args) throws IOException { IkAnalyzerWord ik = new IkAnalyzerWord(&quot;今天的大风终于小了，但是又起雾了今天的大风终于小了，但是又起雾了&quot;); System.out.println(ik.getResult()); }}</pre> 输出结果（我设置了stopword词典） <pre name="code" class="java">[大风, 终于, 小了, 雾, 大风, 终于, 小了, 雾]</pre> </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客