哈尔滨理工大学

软件与微电子学院

实验报告

（2020-2021第二学期）

课程名称：	信息搜索技术
班级:	软件18- 1 班
学号:	1814010130
姓名:	张立辉

哈尔滨理工大学软件与微电子学院

实验名称：	实验二				专业	软件工程
姓名	张立辉	学号	1814010130		班级	软件18-1

一、实验目的：

1.会调用lucene的API
2.比较不同的分词方法
3.学会进行检索的预处理方法

二、实验内容：

Lucene多种分词器示例
Lucene分词

三、实验设备及软件环境：

版本 Windows 10 家庭中文版
版本号 20H2
操作系统内部版本 19042.928
elasticsearch-6.5.1
kibana-6.5.1-windows-x86_64
logstash-7.12.0

四、实验过程及结果：

Lucene多种分词器示例

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by FernFlower decompiler)
//

import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class VariousAnalyzers {
    private static String str = "中华人民共和国简称中国，拥有13亿人口，";

    public VariousAnalyzers() {
    }

    public static void main(String[] args) throws IOException {
        Analyzer analyzer = null;
        analyzer = new StandardAnalyzer();
        System.out.println("标准分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        Analyzer analyzer = new WhitespaceAnalyzer();
        System.out.println("空格分析" + analyzer.getClass());
        printAnalyzer(analyzer);
        Analyzer analyzer = new SimpleAnalyzer();
        System.out.println("简单分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        Analyzer analyzer = new CJKAnalyzer();
        System.out.println("二分法分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        Analyzer analyzer = new KeywordAnalyzer();
        System.out.println("关键字分词" + analyzer.getClass());
        printAnalyzer(analyzer);
        Analyzer analyzer = new StopAnalyzer();
        System.out.println("停用词分词" + analyzer.getClass());
        printAnalyzer(analyzer);
    }

    public static void printAnalyzer(Analyzer analyzer) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();
        CharTermAttribute teAttribute = (CharTermAttribute)toStream.getAttribute(CharTermAttribute.class);

        while(toStream.incrementToken()) {
            System.out.print(teAttribute.toString() + "|");
        }

        System.out.println("\n");
        analyzer.close();
    }
}

Lucene分词

//代码清单2-1 Lucene分词
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class StdAnalyzer{
    private static String strCh = "中华人民共和国简称中国，是一个有13亿人口的国家";
    private static String strEn = "Dogs can not achieve a place,eyes can reach;";
    public static void main(String[] args) throws IOException{
        System.out.println("StandardAnalyzer对中文分词");
        stdAnalyzer(strCh);
        System.out.println("StandardAnalyzer对英文分词");
        stdAnalyzer(strEn);
    }
    public static void stdAnalyzer(String str) throws IOException{
        Analyzer analyzer = null;
        analyzer = new StandardAnalyzer();
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str,reader);
        toStream.reset();
        CharTermAttribute teAttribute = toStream.getAttribute(CharTermAttribute.class);
        System.out.println("分词结果：");
        while(toStream.incrementToken()){
            System.out.print(teAttribute.toString()+"|");
        }
        System.out.println("\n");
        analyzer.close();
    }
}

五、总结：

通过本次实验，会调用lucene的API，比较不同的分词方法，学会进行检索的预处理方法

实验成绩：指导教师：年月日

信息搜索技术实验三

zlhui • 2021 年 04 月 29 日