问题描述
我正在构建一个 Lucene 索引并添加文档.
I'm building a Lucene Index and adding Documents.
我有一个多值字段,在本例中我将使用类别.
I have a field that is multi-valued, for this example I'll use Categories.
一个项目可以有很多类别,例如,牛仔裤可以属于服装、裤子、男装、女装等.
An Item can have many categories, for example, Jeans can fall under Clothing, Pants, Men's, Women's, etc.
将字段添加到文档时,逗号会有所不同吗?Lucene 会直接忽略它们吗?如果我将逗号更改为空格会有所不同吗?这会自动使该字段成为多值吗?
When adding the field to a document, do commas make a difference? Will Lucene simply ignore them? if I change commas to spaces will there be a difference? Does this automatically make the field multi-valued?
String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call
categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma
doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document
我这样做正确吗?还是有其他方法可以创建多值字段?
Am I doing this correctly? or is there another way to create multivalued fields?
感谢任何帮助/建议.
推荐答案
这将是为每个文档索引多值字段的更好方法
This would be a better way to index multiValued fields per document
String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call
String [] categoriesForItems = categoriesForItem.split(",");
for(String cat : categoriesForItems) {
doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document
}
当同名的多个字段出现在一个文档中时,倒排索引和术语向量都会按照添加字段的顺序在逻辑上将字段的标记相互附加.
Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.
同样在分析阶段,两个不同的值将通过 setPositionIncrementGap() 自动通过位置增量分隔.让我解释一下为什么需要这样做.
Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.
文档 D1 中的类别"字段有两个值 - foo bar"和foo baz"现在,如果您要进行短语查询bar foo",则不应出现 D1.这是通过在同一字段的两个值之间添加额外的增量来确保的.
Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.
如果您自己连接字段值并依赖分析器将其拆分为多个值,bar foo"将返回 D1,这是不正确的.
If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.
这篇关于向 Lucene 文档添加多值字符串字段,逗号重要吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!


大气响应式网络建站服务公司织梦模板
高端大气html5设计公司网站源码
织梦dede网页模板下载素材销售下载站平台(带会员中心带筛选)
财税代理公司注册代理记账网站织梦模板(带手机端)
成人高考自考在职研究生教育机构网站源码(带手机端)
高端HTML5响应式企业集团通用类网站织梦模板(自适应手机端)