How do I find duplicate records?
I have an employee table. I need to find and delete all the records that
have the ssn and taxyear records duplicated. For example:
rec name ssn taxyear
--- ---- --- -------
1 John Doe 111-11-1111 1999
2 John Doe 111-11-1111 2000
3 John Doe 111-11-1111 1999
I need to find and delete record number 3.
Can anyone help? Thanks!
Re: How do I find duplicate records?
Nancy,
Try the following:
declare @recNum int
select @recNum = emp1.rec
from employee emp1
inner join employee emp2
on emp1.name = emp2.name
and emp1.ssn = emp2.ssn
and emp1.taxyear = emp2.taxyear
where emp1.rec > emp2.rec
delete from employee
where rec = @recNum
good luck!
"Nancy" <t@home.com> wrote:
>
>I have an employee table. I need to find and delete all the records that
>have the ssn and taxyear records duplicated. For example:
>
>rec name ssn taxyear
>--- ---- --- -------
>1 John Doe 111-11-1111 1999
>2 John Doe 111-11-1111 2000
>3 John Doe 111-11-1111 1999
>
>I need to find and delete record number 3.
>Can anyone help? Thanks!
Re: How do I find duplicate records?
A SQL Solution: Deleting Duplicate Data with a Single SQL Statement Using
RANK()
The Oracle 8i analytic function RANK() allows you to rank each item in a
group. (For more information about RANK(), see my 10-Minute Solution, "Performing
Top-N Queries in Oracle.") In our case, we are using this function to assign
dynamically sequential numbers in the group of duplicates sorted by the primary
key. With RANK(), grouping is specified in the PARTITION BY clause and sort
order for ranking is specified in the ORDER BY clause:
SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) SeqNumber
FROM Customers
ORDER BY LastName, FirstName;
Listing 7 shows the output of the above query.
Bingo! Now, values in the SeqNumber column, assigned by RANK(), allow you
to separate all duplicate rows (SeqNumber > 1) from non-duplicates (SeqNumber
= 1) and retrieve only those rows you want to delete:
SELECT ID, LastName, FirstName
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) AS SeqNumber
FROM Customers)
WHERE SeqNumber > 1;
Listing 8 shows the output of the above code. It contains seven duplicate
rows that have to be deleted. I tested this code on the Customers data set
with 500,000 rows total and 45,000 duplicates, and it took only 77 seconds
to count the duplicates.
Now you are ready to delete the duplicates by issuing the SQL DELETE command.
Here is the first version of it, which executed (for me) in 135 seconds:
DELETE
FROM CUSTOMERS
WHERE ID IN
(SELECT ID
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) AS SeqNumber
FROM Customers)
WHERE SeqNumber > 1);
You may notice that the last two statements rank all the rows in the table,
which is inefficient. Let's improve the last SQL SELECT statement by applying
RANK() only to the groups of duplicates instead of all rows.
The following syntax is much more efficient, even though it's not as concise
as the last SELECT ID above:
SELECT ID, LastName, FirstName
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) AS SeqNumber
FROM
(SELECT ID, LastName, FirstName
FROM Customers
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM Customers
GROUP BY LastName, FirstName
HAVING COUNT(*) > 1)))
WHERE SeqNumber > 1;
Counting the duplicates now took only 26 seconds, which amounted to a 67
percent performance gain.
Here is the improved SQL DELETE statement, which uses the improved SELECT
statement as a subquery:
DELETE
FROM Customers
WHERE ID IN
(SELECT ID
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) AS SeqNumber
FROM
(SELECT ID, LastName, FirstName
FROM Customers
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM Customers
GROUP BY LastName, FirstName
HAVING COUNT(*) > 1)))
WHERE SeqNumber > 1);
Now it took only 47 seconds to find and delete 45,000 duplicates from 500,000
rows, compared to the 135 seconds it took in my first version of DELETE.
That's a significant performance gain (65 percent).
By comparison, the DeleteDuplicates stored procedure clocked in at 56 seconds,
which is a little slower (19 percent) than just the SQL statement.
Deleting Duplicate Rows When There's No Primary Key
--------------------------------------------------------------------------------
Although it's a sign of bad database design, you may have a table with no
primary key. In that case, you can use this technique to delete duplicate
rows. Read on...
Replacing the PL/SQL stored procedure with my single SQL statement will get
you much more concise code and may improve your performance because there
is no overhead caused by the PL/SQL-to-SQL context switch in the stored procedure.
However, the performance comparison results between the SQL statement and
the PL/SQL procedure may vary, depending on the data set size and percentage
of duplicates. I would expect the PL/SQL procedure to get faster or even
outperform the SQL statement if the number of duplicates is relatively small—i.e.,
1 to 3 percent of all rows in the table.
What if your table doesn't have a primary key? You can use another technique
as well (see sidebar).
RANK()'s Additional Capabilities
The RANK() function allows you to select the row per group of duplicates
you want to keep. Let's say you need to keep the latest (or earliest) record
determined by the value in the RecDate field. In this case you just need
to include RecDate in the ORDER BY clause of RANK() in order to sort duplicates
within each group by RecDate in DESCending (or ASCending) order, and then
by ID.
Here is the syntax for keeping the latest record per group:
DELETE
FROM Customers
WHERE ID IN
(SELECT ID
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY RecDate DESC, ID) AS SeqNumber
FROM
(SELECT ID, LastName, FirstName, RecDate
FROM Customers
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM Customers
GROUP BY LastName, FirstName
HAVING COUNT(*) > 1)))
WHERE SeqNumber > 1);
The flexibility of this technique also allows you to control how many rows
per group you want to keep in the table. Let's say you have a database with
promotional or rebate list information and you have common business conditions
to enforce, such as "limit five entries per household" or "limit three rebates
per person." By specifying the number of rows to keep (3) in the WHERE and
HAVING clauses, your SELECT statement will do the job again and delete all
excessive (more than 3) rebate entries per person:
DELETE
FROM Customers
WHERE ID IN
(SELECT ID
FROM
(SELECT ID, LastName, FirstName,
RANK() OVER (PARTITION BY LastName,
FirstName ORDER BY ID) AS SeqNumber
FROM
(SELECT ID, LastName, FirstName
FROM Customers
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM Customers
GROUP BY LastName, FirstName
HAVING COUNT(*) > 3)))
WHERE SeqNumber > 3);
As you can see, using the RANK() function allows you to eliminate duplicates
in a single SQL statement and gives you more capabilities by extending the
power of your queries.