Login

I have a table with a very large amount of rows. Duplicates are not allowed but due to a problem with how the rows were created I know there are some duplicates in this table.
I need to eliminate the extra rows from the perspective of the key columns. Some other columns may have *slightly* different data but I do not care about that. I still need to keep one of these rows however. SELECT DISTINCT won't work because it operates on all columns and I need to suppress duplicates based on the key columns.

How can I delete the extra rows but still keep one efficiently?

Example query:

DELETE FROM Table
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM Table
GROUP BY Field1, Field2, Field3, ...
)
Here `fields` are column on which you want to group the duplicate rows.

You didn't say what version you were using, but in SQL 2005 and above, you can use a common table expression with the [OVER Clause][1]. It goes a little something like this:

WITH cte AS (
SELECT[foo], [bar],
row_number() OVER(PARTITION BY foo, bar ORDER BY baz) AS [rn]
FROM TABLE
)
DELETE cte WHERE [rn] > 1

Play around with it and see what you get.

(Edit: In an attempt to be helpful, someone edited the `ORDER BY` clause within the CTE. To be clear, you can order by anything you want here, it needn't be one of the columns returned by the cte. In fact, a common use-case here is that "foo, bar" are the group identifier and "baz" is some sort of time stamp. In order to keep the latest, you'd do `ORDER BY baz desc`)

[1]:

[To see links please register here]

Here's my twist on it, with a runnable example. **Note** this will only work in the situation where `Id` is unique, and you have duplicate values in other columns.

DECLARE @SampleData AS TABLE (Id int, Duplicate varchar(20))

INSERT INTO @SampleData
SELECT 1, 'ABC' UNION ALL
SELECT 2, 'ABC' UNION ALL
SELECT 3, 'LMN' UNION ALL
SELECT 4, 'XYZ' UNION ALL
SELECT 5, 'XYZ'

DELETE FROM @SampleData WHERE Id IN (
SELECT Id FROM (
SELECT
Id
,ROW_NUMBER() OVER (PARTITION BY [Duplicate] ORDER BY Id) AS [ItemNumber]
-- Change the partition columns to include the ones that make the row distinct
FROM
@SampleData
) a WHERE ItemNumber > 1 -- Keep only the first unique item
)

SELECT * FROM @SampleData

And the results:

Id Duplicate
----------- ---------
1 ABC
3 LMN
4 XYZ

Not sure why that's what I thought of first... definitely not the simplest way to go but it works.

phacochoerus965645

tetrode543

triploidite10548

rosenthal404