Get term frequencies in Lucene(在 Lucene 中获取词频)
问题描述
有没有一种快速简便的方法从 Lucene 索引中获取词频,而无需通过 TermVectorFrequencies 类来完成,因为对于大型集合来说这需要大量时间?
Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?
我的意思是,有没有像 TermEnum 这样的东西,它不仅有文档频率,还有词频?
What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?
更新:使用 TermDocs 太慢了.
UPDATE: Using TermDocs is way too slow.
推荐答案
使用TermDocs 获取给定文档的词频.与文档频率一样,您可以使用感兴趣的术语从 IndexReader 获取术语文档.
您不会找到比 TermDocs 更快的方法而不失一些通用性.TermDocs 直接从索引段中的.frq"文件中读取,其中每个术语频率按文档顺序列出.
You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.
如果这太慢",请确保您已优化索引以将多个段合并为一个段.按顺序遍历文档(跳过没问题,但不能高效地在文档列表中来回跳转).
If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).
您的下一步可能是进行额外处理,以创建一个更专业的文件结构,省略 SkipData.就我个人而言,我会寻找更好的算法来实现我的目标,或者提供更好的硬件——大量内存,或者保存 RAMDirectory,或者提供给操作系统以在其自己的文件缓存系统上使用.
Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.
这篇关于在 Lucene 中获取词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 Lucene 中获取词频
				
        
 
            
        基础教程推荐
- 多个组件的复杂布局 2022-01-01
 - Java Swing计时器未清除 2022-01-01
 - 不推荐使用 Api 注释的描述 2022-01-01
 - Java 实例变量在两个语句中声明和初始化 2022-01-01
 - 如何在 Spring @Value 注解中正确指定默认值? 2022-01-01
 - 从 python 访问 JVM 2022-01-01
 - 验证是否调用了所有 getter 方法 2022-01-01
 - 如何在 JFrame 中覆盖 windowsClosing 事件 2022-01-01
 - 大摇大摆的枚举 2022-01-01
 - 在 Java 中创建日期的正确方法是什么? 2022-01-01
 
    	
    	
    	
    	
    	
    	
    	
    	
						
						
						
						
						
				
				
				
				