ChatGPT解决这个技术问题 Extra ChatGPT

Remove duplicate rows in MySQL

I have a table with the following fields:

id (Unique)
url (Unique)
title
company
site_id

Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

After running this query, I can remove duplicates using a server side script.

But, I want to know if this can be done only using SQL query.

Quick question: do always want duplicate (title, company, site_id) to not exist? If so, I'd set up a constraint in the database to enforce title, company, and site_id to be unique. Which would mean you wouldn't need a cleanup process. And it only takes a single line of SQL.
Please refer this link of stackoverflow.It worked for me as a charm.
I can recommend this solution (posted in another thread): stackoverflow.com/a/4685232/195835
You can also check this answer
I don't think your snippet (using HAVING) which identifies rows to be deleted works; it will delete all the rows which have duplicates, whereas I think the requirement is to keep one.

s
styvane

A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...


Interesting, but the assumptions the IGNORE clause makes for removing those duplicates is a concern that might not match needs. Incorrect values being truncated to the closest acceptable match sound good to you?
Just for the record if your using InnoDB then you may have an issue with it, there is a known bug about using ALTER IGNORE TABLE with InnoDB databases.
The aforementioned bug @DarkMantis referred to and it's solution.
For InnoDB tables execute the following query first: set session old_alter_table=1;
This is no longer support in 5.7.4, dev.mysql.com/doc/refman/5.7/en/alter-table.html
0
0b10011

If you don't want to alter the column properties, then you can use the query below.

Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
    AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
    AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);

In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND `a`.`title` <=> `b`.`title`
    AND `a`.`company` <=> `b`.`company`
    AND `a`.`site_id` <=> `b`.`site_id`;

this solution is not working properly , i tried to make some duplicate records and it does something like (20 rows affected) but if you run it again it will show you (4 rows affected) and so on until you reach (0 rows affected) which is kinda suspicious and here is what works best for me , it's almost the same but it works in one run, I edited the solution
@Nassim: You must be doing something different from this answer because it works perfectly for me (in MySQL).
For anyone that was confused like me, the NULL comparison terms are needed because NULL does not equal NULL in MySQL. If the relevant columns are guaranteed to not be NULL, you can leave these terms out.
Yes, the accepted answer is no longer valid, since MYSQL 5.7 so this should really be the accepted answer as it's universal and does not require temporary table creation either.
VERY SLOW if there are MANY copies of a given record (e.g. 100 to be reduced to 1), and many records with that condition. Recommend stackoverflow.com/a/4685232/199364 instead. IMHO, ALWAYS use the linked approach; its an inherently faster technique.
A
Andomar

MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

From Kostanos' suggestion in the comments: The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id

@andomar, this works fine except when one of the fields in the where clause contain nulls. Example: sqlfiddle.com/#!2/983f3/1
Is the Insert SQL an expensive one? I'm wondering because it times out in my MySQL database.
The only slow query here it the DELETE one, in case when you have big database. This query could be faster: DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
@Kostanos Not just DELETE, but also INSERT to the temporary table, it took me a long time. So an index for tmp table could help a lot, create index tmpTable_id_index on tmpTable (id), at least for me.
If your tables are large, it's worth wild adding an index with:- create temporary table tmpTable (id int, PRIMARY KEY (id));
C
César Revert-Gomar

Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.

However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.

The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).

This is how it can be achieved. Given we have a table employee, with the following columns:

employee (id, first_name, last_name, start_date, ssn)

In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

Technical explanation

Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table

Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates

Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries

Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table

⇒ Using this approach, 1.6M registers were converted into 6k in less than 200s.

Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.

✔ Variation for keeping the last entry instead of the first one

Sometimes we need to keep the last duplicated entry instead of the first one.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest

✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found

Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

On line #3, a new column n_duplicates is created

On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter) The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.

✔ Variation for regenerating the auto-incremental field id

Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically

✔ Further variations

Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;

The issue with this is that you need the space to duplicate your data.
Thank you @Cesar for sharing this solution. I got inspired with it and get out of my troubles with messed up two tables with no index no keys. I just had to be careful with correct ORDER BY while importing data not to overwrite valuable info with duplicate records with no values. But yes, I fixed both tables!
N
Nimantha

If the IGNORE statement won't work like in my case, you can use the below statement:

CREATE TABLE your_table_deduped LIKE your_table;


INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
         index2_id;

RENAME TABLE your_table TO your_table_with_dupes;

RENAME TABLE your_table_deduped TO your_table;

#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);

#OPTIONAL
DROP TABLE your_table_with_dupes;

works great if you have innoDB setting with foreign key constraint.
@magdmartin, but won't foreign constraints prevent table deletion?
IGNORE statement didn't work for me and this worked great on deduping 5 million records. Cheers.
M
Mostafa -T

There is another solution :

DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...

How is this different from @rehriff's answer, which he submitted 6 months earlier?
@LawrenceDol I guess it's a bit more readable and also I think his answer were not the same at the time I answered and I think his answer got edited.
hmm. It takes too long for me while number of records was not big!
G
Giacomo1968

A solution that is simple to understand and works with no primary key:

add a new boolean column alter table mytable add tokeep boolean; add a constraint on the duplicated columns AND the new column alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep); set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint update ignore mytable set tokeep = true; delete rows that have not been marked as tokeep delete from mytable where tokeep is null; drop the added column alter table mytable drop tokeep;

I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.


This worked really well in mysql 5.7where accepted solution doesnt work anymore
This is just amazing, thank you so much
A
Andreas

This will delete the duplicate rows with same values for title, company and site. The first occurrence will be kept and rest all duplicates will be deleted

DELETE t1 FROM tablename t1
INNER JOIN tablename t2 
WHERE 
    t1.id < t2.id AND
    t1.title = t2.title AND
    t1.company=t2.company AND
    t1.site_ID=t2.site_ID;

f
faisalbhagat

if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution

-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;

E
Eduardo Rascon

I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)

That won't work if there's more than two duplicates of a group.
Unfortunately, MySQL does not allow you to select from the table you are deleting from ERROR 1093: You can't specify target table 'Table' for update in FROM clause
To solve the "You can't specify target table 'Table' for update in FROM..." error, use: DELETE FROM Table WHERE Table.idTable IN ( SELECT MAX(idTable) FROM (SELECT * FROM idTable) AS tmp GROUP BY field1, field2, field3 HAVING COUNT(*) > 1) which forces MySQL to create a temporally table. However it is very slow in large datasets... in such cases, I will recommend Andomar's code, which is much faster.
a
alockwood05

The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName; 
DROP TABLE tempTableName;  

Your 4th line should say TRUNCATE TABLE tableName and 5th line should say INSERT INTO tableName SELECT * FROM tempTableName;
S
Saad Mirza

Delete duplicate rows using DELETE JOIN statement MySQL provides you with the DELETE JOIN statement that you can use to remove duplicate rows quickly.

The following statement deletes duplicate rows and keeps the highest id:

DELETE t1 FROM contacts t1
    INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;

R
Rico Nguyen

I found a simple way. (keep latest)

DELETE t1 FROM tablename t1 INNER JOIN tablename t2 
WHERE t1.id < t2.id AND t1.column1 = t2.column1 AND t1.column2 = t2.column2;

why this not working on mysql? it gives Unexpected token near t1
I used it many times, they all worked. show me your actual script
a
artemiuz

Simple and fast for all cases:

CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*)  > 1);

DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);

Error Code: 1055. Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'dub.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
you could disable "hard control" with sql_mode, see stackoverflow.com/questions/23921117/disable-only-full-group-by
G
GMB

As of version 8.0 (2018), MySQL finally supports window functions.

Window functions are both handy and efficient. Here is a solution that demonstrates how to use them to solve this assignment.

In a subquery, we can use ROW_NUMBER() to assign a position to each record in the table within column1/column2 groups, ordered by id. If there is no duplicates, the record will get row number 1. If duplicate exists, they will be numbered by ascending id (starting at 1).

Once records are properly numbered in the subquery, the outer query just deletes all records whose row number is not 1.

Query :

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id, 
            ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
        FROM output
    ) t
    WHERE rn > 1
)

Interesting. Alternatively, for performance, we could create a temporary table with the subquery, but filtering for rn = 1. Then we drop the original and rename.
F
Francesco

I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables

this code works better anytime

CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;

tableToclean = the name of the table you need to clean

tableToclean_temp = a temporary table created and deleted


A
Anthony Vipond

This solution will move the duplicates into one table and the uniques into another.

-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);

-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
    (
    SELECT * 
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) > 1
    UNION
    SELECT *
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) = 1
) x

-- create the table with duplicate rows
INSERT jobs_dupes 
SELECT * 
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)

-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs, 
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs

Why did you take the union and not just SELECT * FROM jobs GROUP BY site_id, company, title, location?
f
fcdt

Delete duplicate rows with the DELETE JOIN statement:

DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
    t1.id < t2.id AND
    t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;

Why t1.id < t2.id rather than t1.id <> t2.id? That will prevent human error if t1 and t2 are switched around.
There have also been a few answers with DELETE...JOIN syntax already, with both natural and inner joins. This doesn't seem to add anything new.
A
Arun Solomon

To Delete the duplicate record in a table.

delete from job s 
where rowid < any 
(select rowid from job k 
where s.site_id = k.site_id and 
s.title = k.title and 
s.company = k.company);

or

delete from job s 
where rowid not in 
(select max(rowid) from job k 
where s.site_id = k.site_id and
s.title = k.title and 
s.company = k.company);

G
Giacomo1968

Here is what I used, and it works:

create table temp_table like my_table;

t_id is my unique column

insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;

A
Abdul Rehman

In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)

DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3; 
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;

Hope will help dev.


G
Giacomo1968

I have a table which forget to add a primary key in the id row. Though is has auto_increment on the id. But one day, one stuff replay the mysql bin log on the database which insert some duplicate rows.

I remove the duplicate row by

select the unique duplicate rows and export them select T1.* from table_name T1 inner join (select count(*) as c,id from table_name group by id) T2 on T1.id = T2.id where T2.c > 1 group by T1.id; delete the duplicate rows by id insert the row from the exported data. Then add the primary key on id


S
Skully

This is perfect if you are trying to delete one of the duplicates and leave the other. Note that without subqueries you would get a #1093 error.

DELETE FROM table_name
WHERE id IN (
    SELECT * FROM (SELECT n.id FROM table_name n
    WHERE n.column2 != "value"
    GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)

M
Michael Tel

I like to be a bit more specific as to which records I delete so here is my solution:

delete
from jobs c1
where not c1.location = 'Paris'
and  c1.site_id > 64218
and exists 
(  
select * from jobs c2 
where c2.site_id = c1.site_id
and   c2.company = c1.company
and   c2.location = c1.location
and   c2.title = c1.title
and   c2.site_id > 63412
and   c2.site_id < 64219
)

S
Syed Amir Bukhari

You can easily delete the duplicate records from this code..

$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");

if(mysql_num_rows($qry2) > 1){
    while($row = mysql_fetch_array($qry2)){
        $city_arry[] = $row;

        }

    $total = sizeof($city_arry) - 1;
        for($i=1; $i<=$total; $i++){


            mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");

            }
    }
    //exit;
}

This is very bad form- database tasks should be done in the DB, where they are much much faster, instead of sending data constantly between php/mysql because you know one better than the other.
S
Sunil

I had to do this with text fields and came across the limit of 100 bytes on the index.

I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.

ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);