How can I get the list of unique terms from a specific field in Lucene?(如何从 Lucene 的特定字段中获取唯一术语列表?)
问题描述
我有一个来自包含多个字段的大型语料库的索引.这些字段中只有一个包含文本.我需要根据该字段从整个索引中提取唯一词.有谁知道我如何在 java 中使用 Lucene 做到这一点?
I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?
推荐答案
你正在寻找 术语向量(字段中所有单词的集合以及每个单词的使用次数,不包括停用词).您将使用 IndexReader 的 getTermFreqVector(docid, field) 用于索引中的每个文档,并用它们填充 HashSet
.
You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet
with them.
替代方法是使用 terms() 并只选择您感兴趣的领域的术语:
The alternative would be to use terms() and pick only terms for the field you're interested in:
IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
final Term term = terms.term();
if (term.field().equals("field_name")) {
uniqueTerms.add(term.text());
}
}
这不是最佳解决方案,您正在阅读然后丢弃所有其他字段.Lucene 4 中有一个类 Fields
,它返回 terms(field) 仅适用于单个字段.
This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields
in Lucene 4, that returns terms(field) only for a single field.
这篇关于如何从 Lucene 的特定字段中获取唯一术语列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何从 Lucene 的特定字段中获取唯一术语列表?


基础教程推荐
- 减少 JVM 暂停时间 >1 秒使用 UseConcMarkSweepGC 2022-01-01
- FirebaseListAdapter 不推送聊天应用程序的单个项目 - Firebase-Ui 3.1 2022-01-01
- Java Keytool 导入证书后出错,"keytool error: java.io.FileNotFoundException &拒绝访问" 2022-01-01
- 如何使用 Java 创建 X509 证书? 2022-01-01
- 降序排序:Java Map 2022-01-01
- 设置 bean 时出现 Nullpointerexception 2022-01-01
- 在 Libgdx 中处理屏幕的正确方法 2022-01-01
- 无法使用修饰符“public final"访问 java.util.Ha 2022-01-01
- “未找到匹配项"使用 matcher 的 group 方法时 2022-01-01
- Java:带有char数组的println给出乱码 2022-01-01