MySQL快速从大数据库中删除重复项

MySQL remove duplicates from big database quick(MySQL快速从大数据库中删除重复项)
本文介绍了MySQL快速从大数据库中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我有很大的(>Mil 行)MySQL 数据库被重复项搞砸了.我认为它可能占整个数据库的 1/4 到 1/2.我需要快速摆脱它们(我的意思是查询执行时间).这是它的外观:
id(索引)|文本1 |文本2 |文本3
文本 1 &text2 组合应该是唯一的,如果有任何重复,则只应保留一个 text3 NOT NULL 组合.示例:

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...变成:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

新 ID 可以是任何东西,它们不依赖于旧表 ID.
我试过这样的事情:

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

或 SELECT DISTINCT 和其他变体.
虽然他们在小型数据库上工作,但我的查询执行时间非常长(实际上从未结束;> 20 分钟)

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

有没有更快的方法来做到这一点?请帮我解决这个问题.

Is there any faster way to do that? Please help me solve this problem.

推荐答案

我相信这会做到,使用重复键 + ifnull():

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

应该比任何需要 group by 或 distinct 或子查询,甚至 order by 的东西快得多.这甚至不需要文件排序,这会降低大型临时表的性能.仍然需要对原始表进行全面扫描,但无法避免.

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

这篇关于MySQL快速从大数据库中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

相关文档推荐

ibtmp1是非压缩的innodb临时表的独立表空间,通过innodb_temp_data_file_path参数指定文件的路径,文件名和大小,默认配置为ibtmp1:12M:autoextend,也就是说在文件系统磁盘足够的情况下,这个文件大小是可以无限增长的。 为了避免ibtmp1文件无止境的暴涨导致
SQL query to group by day(按天分组的 SQL 查询)
What does SQL clause quot;GROUP BY 1quot; mean?(SQL 子句“GROUP BY 1是什么意思?意思是?)
MySQL groupwise MAX() returns unexpected results(MySQL groupwise MAX() 返回意外结果)
MySQL SELECT most frequent by group(MySQL SELECT 按组最频繁)
Include missing months in Group By query(在 Group By 查询中包含缺失的月份)