Count CLOB Duplicates in a large Oracle Table(计算大型 Oracle 表中的 CLOB 重复项)
问题描述
我有一个 Oracle 数据库表 LOG_MESSAGES
,其中有一个名为 MESSAGE
的 CLOB 列.某些行包含相同的 MESSAGE
.
I have an Oracle database table LOG_MESSAGES
with a CLOB column called MESSAGE
. Some of the rows contain the same MESSAGE
.
对于每个至少有一个重复的 MESSAGE
,我想知道重复的数量.相当多的这些 CLOB 很大(> 100 kB),因此转换为 VARCHAR2
是不可能的.由于GROUP BY
等很多传统方法不适用于CLOB
,请大神指教一下?
For each MESSAGE
which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2
is out of question. Since many traditional methods such as GROUP BY
do not work with CLOB
, could someone please enlighten me?
有关信息,该表非常大(大约 1 TB).因此,非常感谢优化方法.
For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.
预先感谢您的帮助.
推荐答案
我认为这个问题经常被问到,但不幸的是,似乎没有完美的方法来做到这一点.不过有一些方法可以很好地工作.
I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.
搜索clob group by"或clob distinct",你会在这个网站上看到几个点击.
Search for "clob group by" or "clob distinct" and you will see several hits just on this website.
一种方法是编写一个 PL/SQL 脚本,在表中的所有 clob 之间执行 DBMS_LOB.COMPARE
但效率可能是 O(n^2) 的顺序,即会让你的目的变得很慢.
One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE
between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.
另一种被广泛接受的方法是使用 dbms_crypto 获取 clob 的散列值(我认为这允许对 clob 进行散列),然后对散列值进行分组.存在哈希冲突的可能性,但概率很小.我读了大约 2^80(不过数字可能是错误的).这不会像第一种方法那么慢,但计算散列也需要不可忽略的时间.
Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.
我建议先尝试散列方法,如果这看起来太慢,请寻找替代方法.
I would suggest try the hash approach first and if that seems too slow, look for alternatives.
这篇关于计算大型 Oracle 表中的 CLOB 重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:计算大型 Oracle 表中的 CLOB 重复项


基础教程推荐
- SQL Server:只有 GROUP BY 中的最后一个条目 2021-01-01
- 在 VB.NET 中更新 SQL Server DateTime 列 2021-01-01
- ERROR 2006 (HY000): MySQL 服务器已经消失 2021-01-01
- 无法在 ubuntu 中启动 mysql 服务器 2021-01-01
- SQL Server 2016更改对象所有者 2022-01-01
- Sql Server 字符串到日期的转换 2021-01-01
- SQL Server 中单行 MERGE/upsert 的语法 2021-01-01
- 将数据从 MS SQL 迁移到 PostgreSQL? 2022-01-01
- 如何在 SQL Server 的嵌套过程中处理事务? 2021-01-01
- 使用pyodbc“不安全"的Python多处理和数据库访问? 2022-01-01