How do I find duplicate records?


DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Results 1 to 3 of 3

Thread: How do I find duplicate records?

Hybrid View

  1. #1
    Nancy Guest

    How do I find duplicate records?


    I have an employee table. I need to find and delete all the records that
    have the ssn and taxyear records duplicated. For example:

    rec name ssn taxyear
    --- ---- --- -------
    1 John Doe 111-11-1111 1999
    2 John Doe 111-11-1111 2000
    3 John Doe 111-11-1111 1999

    I need to find and delete record number 3.
    Can anyone help? Thanks!

  2. #2
    amy Guest

    Re: How do I find duplicate records?


    Nancy,
    Try the following:

    declare @recNum int

    select @recNum = emp1.rec
    from employee emp1
    inner join employee emp2
    on emp1.name = emp2.name
    and emp1.ssn = emp2.ssn
    and emp1.taxyear = emp2.taxyear
    where emp1.rec > emp2.rec

    delete from employee
    where rec = @recNum

    good luck!



    "Nancy" <t@home.com> wrote:
    >
    >I have an employee table. I need to find and delete all the records that
    >have the ssn and taxyear records duplicated. For example:
    >
    >rec name ssn taxyear
    >--- ---- --- -------
    >1 John Doe 111-11-1111 1999
    >2 John Doe 111-11-1111 2000
    >3 John Doe 111-11-1111 1999
    >
    >I need to find and delete record number 3.
    >Can anyone help? Thanks!



  3. #3
    Rajesh Guest

    Re: How do I find duplicate records?


    A SQL Solution: Deleting Duplicate Data with a Single SQL Statement Using
    RANK()
    The Oracle 8i analytic function RANK() allows you to rank each item in a
    group. (For more information about RANK(), see my 10-Minute Solution, "Performing
    Top-N Queries in Oracle.") In our case, we are using this function to assign
    dynamically sequential numbers in the group of duplicates sorted by the primary
    key. With RANK(), grouping is specified in the PARTITION BY clause and sort
    order for ranking is specified in the ORDER BY clause:
    SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) SeqNumber
    FROM Customers
    ORDER BY LastName, FirstName;
    Listing 7 shows the output of the above query.
    Bingo! Now, values in the SeqNumber column, assigned by RANK(), allow you
    to separate all duplicate rows (SeqNumber > 1) from non-duplicates (SeqNumber
    = 1) and retrieve only those rows you want to delete:

    SELECT ID, LastName, FirstName
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) AS SeqNumber
    FROM Customers)
    WHERE SeqNumber > 1;
    Listing 8 shows the output of the above code. It contains seven duplicate
    rows that have to be deleted. I tested this code on the Customers data set
    with 500,000 rows total and 45,000 duplicates, and it took only 77 seconds
    to count the duplicates.
    Now you are ready to delete the duplicates by issuing the SQL DELETE command.
    Here is the first version of it, which executed (for me) in 135 seconds:


    DELETE
    FROM CUSTOMERS
    WHERE ID IN
    (SELECT ID
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) AS SeqNumber
    FROM Customers)
    WHERE SeqNumber > 1);
    You may notice that the last two statements rank all the rows in the table,
    which is inefficient. Let's improve the last SQL SELECT statement by applying
    RANK() only to the groups of duplicates instead of all rows.
    The following syntax is much more efficient, even though it's not as concise
    as the last SELECT ID above:

    SELECT ID, LastName, FirstName
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) AS SeqNumber
    FROM
    (SELECT ID, LastName, FirstName
    FROM Customers
    WHERE (LastName, FirstName) IN
    (SELECT LastName, FirstName
    FROM Customers
    GROUP BY LastName, FirstName
    HAVING COUNT(*) > 1)))
    WHERE SeqNumber > 1;
    Counting the duplicates now took only 26 seconds, which amounted to a 67
    percent performance gain.
    Here is the improved SQL DELETE statement, which uses the improved SELECT
    statement as a subquery:

    DELETE
    FROM Customers
    WHERE ID IN
    (SELECT ID
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) AS SeqNumber
    FROM
    (SELECT ID, LastName, FirstName
    FROM Customers
    WHERE (LastName, FirstName) IN
    (SELECT LastName, FirstName
    FROM Customers
    GROUP BY LastName, FirstName
    HAVING COUNT(*) > 1)))
    WHERE SeqNumber > 1);
    Now it took only 47 seconds to find and delete 45,000 duplicates from 500,000
    rows, compared to the 135 seconds it took in my first version of DELETE.
    That's a significant performance gain (65 percent).
    By comparison, the DeleteDuplicates stored procedure clocked in at 56 seconds,
    which is a little slower (19 percent) than just the SQL statement.

    Deleting Duplicate Rows When There's No Primary Key

    --------------------------------------------------------------------------------

    Although it's a sign of bad database design, you may have a table with no
    primary key. In that case, you can use this technique to delete duplicate
    rows. Read on...


    Replacing the PL/SQL stored procedure with my single SQL statement will get
    you much more concise code and may improve your performance because there
    is no overhead caused by the PL/SQL-to-SQL context switch in the stored procedure.
    However, the performance comparison results between the SQL statement and
    the PL/SQL procedure may vary, depending on the data set size and percentage
    of duplicates. I would expect the PL/SQL procedure to get faster or even
    outperform the SQL statement if the number of duplicates is relatively small—i.e.,
    1 to 3 percent of all rows in the table.

    What if your table doesn't have a primary key? You can use another technique
    as well (see sidebar).

    RANK()'s Additional Capabilities
    The RANK() function allows you to select the row per group of duplicates
    you want to keep. Let's say you need to keep the latest (or earliest) record
    determined by the value in the RecDate field. In this case you just need
    to include RecDate in the ORDER BY clause of RANK() in order to sort duplicates
    within each group by RecDate in DESCending (or ASCending) order, and then
    by ID.

    Here is the syntax for keeping the latest record per group:

    DELETE
    FROM Customers
    WHERE ID IN
    (SELECT ID
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY RecDate DESC, ID) AS SeqNumber
    FROM
    (SELECT ID, LastName, FirstName, RecDate
    FROM Customers
    WHERE (LastName, FirstName) IN
    (SELECT LastName, FirstName
    FROM Customers
    GROUP BY LastName, FirstName
    HAVING COUNT(*) > 1)))
    WHERE SeqNumber > 1);
    The flexibility of this technique also allows you to control how many rows
    per group you want to keep in the table. Let's say you have a database with
    promotional or rebate list information and you have common business conditions
    to enforce, such as "limit five entries per household" or "limit three rebates
    per person." By specifying the number of rows to keep (3) in the WHERE and
    HAVING clauses, your SELECT statement will do the job again and delete all
    excessive (more than 3) rebate entries per person:
    DELETE
    FROM Customers
    WHERE ID IN
    (SELECT ID
    FROM
    (SELECT ID, LastName, FirstName,
    RANK() OVER (PARTITION BY LastName,
    FirstName ORDER BY ID) AS SeqNumber
    FROM
    (SELECT ID, LastName, FirstName
    FROM Customers
    WHERE (LastName, FirstName) IN
    (SELECT LastName, FirstName
    FROM Customers
    GROUP BY LastName, FirstName
    HAVING COUNT(*) > 3)))
    WHERE SeqNumber > 3);
    As you can see, using the RANK() function allows you to eliminate duplicates
    in a single SQL statement and gives you more capabilities by extending the
    power of your queries.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center