How To Protect A Database From Duplicate Data

ยท

3 min read

Have you ever been to the kitchen section of a fast-food restaurant? I can assure you that the phrase "order through chaos" will take on a whole new meaning to you if you have. Regardless of how high-end a restaurant is, there's a lot going on in the kitchen to ensure you don't complain too long about how slow your order is ๐Ÿ˜‚.

brick-wall-1834784_1280.jpg

SIMILARITIES BETWEEN SOFTWARE ENGINEERING AND RESTAURANTS

Restaurants are a great way to explain frontend, backend , and APIs (as I've done before), restaurants are also a great way to illustrate what the internet is like. Underneath these sleek applications and beautiful websites are numerous backend processes running non-stop which in a way ensures that the show must go on.

As a result of all these activities, applications like worker services, and other applications that are constantly running, there is a great possibility that duplicate data makes it into the database. Duplicate data is something software engineers are quite worried about. There are a ton of implementations that guard against duplicate data, however, there are those like me who feel duplicate data should be guarded against at the very place where they're to be sent... The database itself.

relatable explanation.jfif

A RELATABLE EXPLANATION

The best way to describe a database is to imagine an excel sheet, the column entries represent data properties, for example, your name, email addresses and other registration details of yours are on a column each, this means that when searching or trying to modify things, you're most likely to query the database using the column name as one of the query details.

In a database (data table is more precise in this case) we have what we call unique keys, what these keys are is values that are unique to a specific individual or row of data, for example, while your name isn't unique to just you, your email address, user ID, etc, is unique to you. We use the unique keys to prevent duplicate data because data belonging to a particular individual, item or row of data will have the same unique key because unique keys are used to identify very specific datasets, the unique keys are used as a parameter to instruct the database to reject values that have the same unique key (within a period of time).

The SQL code snippet below should illustrate the process of guarding against duplicate data on SQL tables using stored procedures

data-5933101_1280.jpg

A SAMPLE IMPLEMENTATION

CREATE PROCEDURE [dbo].DemoDatabase 
-- The property names/name of the columns on the data table are what we're declaring below
-- Note that I just made this up but it's not much different from some of the columns on a data table
@id guid,
@name varchar(50),
@comment varchar(1000),
@emailAddress varchar(50)

AS
if NOT EXISTS
-- "[dbo].DemoDatabase" is just a data table I made up, you should put the name of the data table you're working on.
-- The line of code below is where you specify the unique identifier that is used as a filter for duplicate values
(Select 1 from [dbo].DemoDatabase where id = @id)

BEGIN

SET NOCOUNT NO
-- This inserts values with the variable names specified in the first bracket into the data columns
INSERT INTO [dbo].DemoDatabase (id, name, comment, emailAddress) values (@id, @name, @comment, @emailAddress)

Select @@Identity

finally.jfif

FINALLY

Ultimately, the ability to guard a database and ensure the data within the database is as clean as possible is what separates great tech companies from tech companies that struggle, data is everything nowadays, data that is usable is invaluable. Duplicate data might seem harmless on a small scale, however, when dealing with millions of data sets, it will cost you a lot in terms of your storage space, servers and time spent to clean up said data should you need it. If you're lucky, duplicate data won't lead to loss or mismatch of customer data in a way that bleeds cash.