Best way to test for existing string against a large list of comparables(针对大量可比对象测试现有字符串的最佳方法)
问题描述
Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.
But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?
Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.
BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.
EDIT: I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...
Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:
var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
Console.WriteLine(match.Value);
// returns AB and CD
So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.
On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.
When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.
这篇关于针对大量可比对象测试现有字符串的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:针对大量可比对象测试现有字符串的最佳方法


基础教程推荐
- 如何激活MC67中的红灯 2022-01-01
- SSE 浮点算术是否可重现? 2022-01-01
- MS Visual Studio .NET 的替代品 2022-01-01
- 为什么Flurl.Http DownloadFileAsync/Http客户端GetAsync需要 2022-09-30
- rabbitmq 的 REST API 2022-01-01
- 将 Office 安装到 Windows 容器 (servercore:ltsc2019) 失败,错误代码为 17002 2022-01-01
- 如何在 IDE 中获取 Xamarin Studio C# 输出? 2022-01-01
- 将 XML 转换为通用列表 2022-01-01
- c# Math.Sqrt 实现 2022-01-01
- 有没有办法忽略 2GB 文件上传的 maxRequestLength 限制? 2022-01-01