Find duplicate records in MySQL

C

ChrisP

The key is to rewrite this query so that it can be used as a subquery.

SELECT firstname, 
   lastname, 
   list.address 
FROM list
   INNER JOIN (SELECT address
               FROM   list
               GROUP  BY address
               HAVING COUNT(id) > 1) dup
           ON list.address = dup.address;

Be careful with sub-queries. Sub-queries are/can be ridiculously bad for performance concerns. If this needs to happen often and/or with lots of duplicate records I would consider moving the processing out of the database and into a dataset.

It's a uncorrelated subquery, so it shouldn't be too bad assuming either query alone isn't poorly designed.

Lovely. Guess this is the sytax around "ERROR 1248 (42000): Every derived table must have its own alias"

This is the right idea, but again, as below, this only works if the addresses are guaranteed to be standardized...

+1 with this query you can find duplicates but also triplicates, quadruplicates..... and so on

D

DaveShaw

SELECT date FROM logs group by date having count(*) >= 2

This was the easiest working query to use with Laravel. Just had to add ->having(DB::raw('count(*)'), '>', 2) to the query. Many thanks!

Be careful with this answer. It returns only one of the duplicates. If you have more than 2 copies of the same record you wont see them all, and after deleting the record returned you will still have duplicates in your table.

Why >=2? Just use HAVING COUNT(*) > 1

@TerryLin Considering that this doesn't actually solve the originally stated problem (which was how to return all the duplicates) I disagree.

Can someone explain to me why this is so highly upvoted? It looks almost exactly like the first code in the original question, which the questioner says is inadequate. What am I missing?

A

Amal K

Why not just INNER JOIN the table with itself?

SELECT a.firstname, a.lastname, a.address
FROM list a
INNER JOIN list b ON a.address = b.address
WHERE a.id <> b.id

A DISTINCT is needed if the address could exist more than two times.

I too tested this, and it was almost 6 times slower compared to the accepted solution in my situation (latest MySQL, table of 120.000 rows). This might be due to it requiring a temporary table, run an EXPLAIN on both to see the differences.

I changed the last part of the query to WHERE a.id > b.id to filter out newer duplicates only, that way I can do a DELETE directly on the result. Switch the comparison to list the older duplicates.

This took 50 seconds to run, @doublejosh's answer took .13 seconds.

I must add that this answer gives duplicate answers despite the WHERE as in case one address is tripled, output rows are doubled. If it's quadruple, I believe the response will be tripled.

I tested this in leetcode "leetcode.com/problems/duplicate-emails". It was faster compared to the sub-query.

E

Erick Martim

I tried the best answer chosen for this question, but it confused me somewhat. I actually needed that just on a single field from my table. The following example from this link worked out very well for me:

SELECT COUNT(*) c,title FROM `data` GROUP BY title HAVING c > 1;

Works like a charm!

T

Tudor

Isn't this easier :

SELECT *
FROM tc_tariff_groups
GROUP BY group_id
HAVING COUNT(group_id) >1

?

worked for me where I had to just process ~10 000 duplicate rows in order to make them unique, much faster than load all 600 000 rows.

very much easier

Easier, but solves a slightly different problem. Accepted answer shows ALL rows of each duplicate. This answer shows ONE row of each duplicate, because that's how GROUP BY works.

d

dakshbhatt21

select `cityname` from `codcities` group by `cityname` having count(*)>=2

This is the similar query you have asked for and its 200% working and easy too. Enjoy!!!

d

doublejosh

Find duplicate users by email address with this query...

SELECT users.name, users.uid, users.mail, from_unixtime(created)
FROM users
INNER JOIN (
  SELECT mail
  FROM users
  GROUP BY mail
  HAVING count(mail) > 1
) dupes ON users.mail = dupes.mail
ORDER BY users.mail;

To find the actual duplicate you only need the inner query. This is way faster than the other answers.

G

George G

we can found the duplicates depends on more then one fields also.For those cases you can use below format.

SELECT COUNT(*), column1, column2 
FROM tablename
GROUP BY column1, column2
HAVING COUNT(*)>1;

M

Martijn Pieters

Finding duplicate addresses is much more complex than it seems, especially if you require accuracy. A MySQL query is not enough in this case...

I work at SmartyStreets, where we do address validation and de-duplication and other stuff, and I've seen a lot of diverse challenges with similar problems.

There are several third-party services which will flag duplicates in a list for you. Doing this solely with a MySQL subquery will not account for differences in address formats and standards. The USPS (for US address) has certain guidelines to make these standard, but only a handful of vendors are certified to perform such operations.

So, I would recommend the best answer for you is to export the table into a CSV file, for instance, and submit it to a capable list processor. One such is LiveAddress which will have it done for you in a few seconds to a few minutes automatically. It will flag duplicate rows with a new field called "Duplicate" and a value of Y in it.

+1 for seeing the difficulty involved in matching address strings, though you may want to specify that the OP's "duplicate records" question isn't complex in itself, but is when comparing addresses

j

jerdiggity

Another solution would be to use table aliases, like so:

SELECT p1.id, p2.id, p1.address
FROM list AS p1, list AS p2
WHERE p1.address = p2.address
AND p1.id != p2.id

All you're really doing in this case is taking the original list table, creating two pretend tables -- p1 and p2 -- out of that, and then performing a join on the address column (line 3). The 4th line makes sure that the same record doesn't show up multiple times in your set of results ("duplicate duplicates").

Works nice. If the WHERE is checking with LIKE then apostrophes are found as well. Makes the query slower, but in my case it is a one-timer.

C

Chad Birch

Not going to be very efficient, but it should work:

SELECT *
FROM list AS outer
WHERE (SELECT COUNT(*)
        FROM list AS inner
        WHERE inner.address = outer.address) > 1;

this works better than other queries, thanks

Q

Quassnoi

This will select duplicates in one table pass, no subqueries.

SELECT  *
FROM    (
        SELECT  ao.*, (@r := @r + 1) AS rn
        FROM    (
                SELECT  @_address := 'N'
                ) vars,
                (
                SELECT  *
                FROM
                        list a
                ORDER BY
                        address, id
                ) ao
        WHERE   CASE WHEN @_address <> address THEN @r := 0 ELSE 0 END IS NOT NULL
                AND (@_address := address ) IS NOT NULL
        ) aoo
WHERE   rn > 1

This query actially emulates ROW_NUMBER() present in Oracle and SQL Server

See the article in my blog for details:

Analytic functions: SUM, AVG, ROW_NUMBER - emulating in MySQL.

Not to nitpick, but FROM (SELECT ...) aoo is a subquery :-P

M

Martin Tonev

This also will show you how many duplicates have and will order the results without joins

SELECT  `Language` , id, COUNT( id ) AS how_many
FROM  `languages` 
GROUP BY  `Language` 
HAVING how_many >=2
ORDER BY how_many DESC

perfect because it still says how many entries are duplicated

GROUP BY only lists ONE of each duplicate. Suppose there are THREE? Or FIFTY?

G

Ghostman

 SELECT firstname, lastname, address FROM list
 WHERE 
 Address in 
 (SELECT address FROM list
 GROUP BY address
 HAVING count(*) > 1)

Tried this one too, but seems to just hang. Believe the return from the inner query does not satisfy the IN parameter format.

What do you mean doesn't satisfy the in parameter format? All IN needs is that your subquery has to return a single column. It's really pretty simple. It's more likely that your subquery is being generated on a column that isn't indexed so it's taking an inordinate amount of time to run. I would suggest if it's taking a long time to break it into two queries. Take the subquery, run it first into a temporary table, create an index on it then run the full query doing the subquery where your duplicate field in the temporary table.

I was worried IN required a comma separated list rather than a column, which was just wrong. Here's the query that worked for me:

SELECT users.name, users.uid, users.mail, from_unixtime(created) FROM users INNER JOIN ( SELECT mail FROM users GROUP BY mail HAVING count(mail) > 1 ) dup ON users.mail = dup.mail ORDER BY users.mail, users.created;

m

mabarroso

select * from table_name t1 inner join (select distinct <attribute list> from table_name as temp)t2 where t1.attribute_name = t2.attribute_name

For your table it would be something like

select * from list l1 inner join (select distinct address from list as list2)l2 where l1.address=l2.address

This query will give you all the distinct address entries in your list table... I am not sure how this will work if you have any primary key values for name, etc..

S

Sam

Fastest duplicates removal queries procedure:

/* create temp table with one primary column id */
INSERT INTO temp(id) SELECT MIN(id) FROM list GROUP BY (isbn) HAVING COUNT(*)>1;
DELETE FROM list WHERE id IN (SELECT id FROM temp);
DELETE FROM temp;

This obviously deletes only the first record from each group of duplicates.

s

slm

Personally this query has solved my problem:

SELECT `SUB_ID`, COUNT(SRV_KW_ID) as subscriptions FROM `SUB_SUBSCR` group by SUB_ID, SRV_KW_ID HAVING subscriptions > 1;

What this script does is showing all the subscriber ID's that exists more than once into the table and the number of duplicates found.

This are the table columns:

| SUB_SUBSCR_ID | int(11)     | NO   | PRI | NULL    | auto_increment |
| MSI_ALIAS     | varchar(64) | YES  | UNI | NULL    |                |
| SUB_ID        | int(11)     | NO   | MUL | NULL    |                |    
| SRV_KW_ID     | int(11)     | NO   | MUL | NULL    |                |

Hope it will be helpful for you either!

L

Lalit Patel

SELECT t.*,(select count(*) from city as tt where tt.name=t.name) as count FROM `city` as t where (select count(*) from city as tt where tt.name=t.name) > 1 order by count desc

Replace city with your Table. Replace name with your field name

G

Ghostman

    SELECT *
    FROM (SELECT  address, COUNT(id) AS cnt
    FROM list
    GROUP BY address
    HAVING ( COUNT(id) > 1 ))

t

tim

I use the following:

SELECT * FROM mytable
WHERE id IN (
  SELECT id FROM mytable
  GROUP BY column1, column2, column3
  HAVING count(*) > 1
)

U

Usman Yaqoob

    Find duplicate Records:

    Suppose we have table : Student 
    student_id int
    student_name varchar
    Records:
    +------------+---------------------+
    | student_id | student_name        |
    +------------+---------------------+
    |        101 | usman               |
    |        101 | usman               |
    |        101 | usman               |
    |        102 | usmanyaqoob         |
    |        103 | muhammadusmanyaqoob |
    |        103 | muhammadusmanyaqoob |
    +------------+---------------------+

    Now we want to see duplicate records
    Use this query:


   select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;

+--------------------+------------+---+
| student_name        | student_id | c |
+---------------------+------------+---+
| usman               |        101 | 3 |
| muhammadusmanyaqoob |        103 | 2 |
+---------------------+------------+---+

D

David

SELECT id, count(*) as c  
 FROM 'list'
GROUP BY id HAVING c > 1

This will return you the id with the number of times that id is repeated, or nothing in which case you will not have repeated id.

Change the id in the group by (ex: address) and it will return the number of times an address is repeated identified by the first found id with that address.

SELECT id, count(*) as c  
 FROM 'list'
GROUP BY address HAVING c > 1

I hope it helps. Enjoy ;)

G

Ganesh Krishnan

To quickly see the duplicate rows you can run a single simple query

Here I am querying the table and listing all duplicate rows with same user_id, market_place and sku:

select user_id, market_place,sku, count(id)as totals from sku_analytics group by user_id, market_place,sku having count(id)>1;

To delete the duplicate row you have to decide which row you want to delete. Eg the one with lower id (usually older) or maybe some other date information. In my case I just want to delete the lower id since the newer id is latest information.

First double check if the right records will be deleted. Here I am selecting the record among duplicates which will be deleted (by unique id).

select a.user_id, a.market_place,a.sku from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;

Then I run the delete query to delete the dupes:

delete a from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;

Backup, Double check, verify, verify backup then execute.

C

Chandan Mistry

SELECT * FROM bookings WHERE DATE(created_at) = '2022-01-11' AND code IN ( SELECT code FROM bookings GROUP BY code HAVING COUNT(code) > 1 ) ORDER BY id DESC

K

Kar.ma

Most of the answers here don't cope with the case when you have MORE THAN ONE duplicate result and/or when you have MORE THAN ONE column to check for duplications. When you are in such case, you can use this query to get all duplicate ids:

SELECT address, email, COUNT(*) AS QUANTITY_DUPLICATES, GROUP_CONCAT(id) AS ID_DUPLICATES
    FROM list
    GROUP BY address, email
    HAVING COUNT(*)>1;

https://i.stack.imgur.com/ZQG7C.png

If you want to list every result as a single line, you need a more complex query. This is the one I found working:

CREATE TEMPORARY TABLE IF NOT EXISTS temptable AS (    
    SELECT GROUP_CONCAT(id) AS ID_DUPLICATES
    FROM list
    GROUP BY address, email
    HAVING COUNT(*)>1
); 
SELECT d.* 
    FROM list AS d, temptable AS t 
    WHERE FIND_IN_SET(d.id, t.ID_DUPLICATES) 
    ORDER BY d.id;

https://i.stack.imgur.com/ZycST.png

a

aad

select address from list where address = any (select address from (select address, count(id) cnt from list group by address having cnt > 1 ) as t1) order by address

the inner sub-query returns rows with duplicate address then the outer sub-query returns the address column for address with duplicates. the outer sub-query must return only one column because it used as operand for the operator '= any'

C

Community

Powerlord answer is indeed the best and I would recommend one more change: use LIMIT to make sure db would not get overloaded:

SELECT firstname, lastname, list.address FROM list
INNER JOIN (SELECT address FROM list
GROUP BY address HAVING count(id) > 1) dup ON list.address = dup.address
LIMIT 10

It is a good habit to use LIMIT if there is no WHERE and when making joins. Start with small value, check how heavy the query is and then increase the limit.

how is this contributing anything to anything?

Find duplicate records in MySQL

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US