在 Lucene 中获取词频

2023-06-29Java开发问题

本文介绍了在 Lucene 中获取词频的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

有没有一种快速简便的方法从 Lucene 索引中获取词频，而无需通过 TermVectorFrequencies 类来完成，因为对于大型集合来说这需要大量时间?

Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?

我的意思是，有没有像 TermEnum 这样的东西，它不仅有文档频率，还有词频?

What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?

更新:使用 TermDocs 太慢了.

UPDATE: Using TermDocs is way too slow.

推荐答案

使用TermDocs 获取给定文档的词频.与文档频率一样，您可以使用感兴趣的术语从 IndexReader 获取术语文档.

您不会找到比 TermDocs 更快的方法而不失一些通用性.TermDocs 直接从索引段中的.frq"文件中读取，其中每个术语频率按文档顺序列出.

You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.

如果这太慢"，请确保您已优化索引以将多个段合并为一个段.按顺序遍历文档(跳过没问题，但不能高效地在文档列表中来回跳转).

If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).

您的下一步可能是进行额外处理，以创建一个更专业的文件结构，省略 SkipData.就我个人而言，我会寻找更好的算法来实现我的目标，或者提供更好的硬件——大量内存，或者保存 RAMDirectory，或者提供给操作系统以在其自己的文件缓存系统上使用.

Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.

这篇关于在 Lucene 中获取词频的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！

The End

相关推荐

如何使用 JAVA 向 COM PORT 发送数据?

如何使报表页面方向更改为“rtl"?

在 Eclipse 项目中使用西里尔文 .properties 文件

有没有办法在 Java 中检测 RTL 语言?

如何在 Java 中从 DB 加载资源包消息?

如何更改 Java 中的默认语言环境设置以使其保持一致?

热门文章

热门精品源码

最新VIP资源