[AccessD] New article

Jim Dettman jimdettman at verizon.net
Tue Oct 5 08:11:06 CDT 2010


Susan,

<<I'll get hate mail for the natural/surrogate key comments, but ... that's
Okay. >>

  Well you certainly have a mix of apples and oranges in there on point
#6<g>. 

  And not really to get into it all again, but I posted a link to an EE
article I wrote on the topic a while back.  Not sure if anyone was able to
read it or not, so I've pasted it in below.  It looses all the formatting,
so some of the points are not emphasized, but you'll get the drift.  And you
don't get to see the comments made on the article, which you would get
on-line.

  Actually, here's the link again:

http://www.experts-exchange.com/Microsoft/Development/MS_Access/A_2041-The-g
reat-PK-debate-Natural-Keys-vs-Surrogates-again.html

  and I would be interested to hear if non-members can read the article
on-line or not.

  The article basically sums up where we arrived at on this list many years
back in terms of discussing primary keys.

FWIW,
Jim.

The great PK debate: Natural Keys vs. Surrogates (again)

I titled this "The great PK debate: Natural Keys vs. Surrogates (again)",
because on almost any developer list or forum that deals with databases,
this topic always comes up at one point or another and usually with very
strong opinions on both sides.  What started me on this article was that
several days ago I tripped over an article on primary key's that was just
flat out wrong.  

After poking around a bit on the net I realized that over the years, a lot
of myths and misconceptions have grown up around this topic.  It seems that
as the years go by, more and more gets published and discussed on this
topic, but things only get cloudier.  For example, are auto number fields
really a surrogate key?  Can they function as a primary key?   I hope that
through this article I will be able to clarify and explain fully enough the
answers to questions such as those and hopefully, another great PK debate
will not ensue (or at least if it does, you'll have plenty of ammo for the
debate<g>).

When I was looking at articles out on the net, one thing that struck me
about almost everything I read is that not many started off on the right
foot. So the first thing to clear up is what relational theory actually is.
Relational theory (the big "R" as some call it), was developed by E.F.
(Tedd) Codd while working at IBM.  His first paper "Derivability,
Redundancy, and Consistency of Relations" was published in 1969 as a
research paper. While Codd was working on ways to store data in a data bank,
what he actually did is in a very scientific and mathematical way describe a
theory of data organization, based on a branch of mathematics that deals
with sets of data.  In a nutshell, it was the overall concept that data
could be modelled in a very logical and rigorous way. 

So R theory is not concerned with how data is physically stored nor does it
only apply to computer systems. What R theory is concerned with and only
with is the logical organization of data in order to provide information. To
make that clearer, I can apply relational theory to data kept on a chalk
board, with sticky notes on a wall, or even with pen and paper. With R
theory, it is the meaning of the data and how it is organized that is
important and nothing else.  As you read on, it is imperative that you keep
this distinction in mind. 

That brings us to Misconception #1; relational theory does not exist because
of databases; it is something that is applied to them. And, Misconception
#2; R theory deals with the relationships between your tables.  

To a certain extent R Theory is applied to the database to deal with the
relationships between the tables, well, when you discuss normalization,
which we won't get into here. More fundamentally it applies to something
else and is talking about something else, which is a relation, and that is
why most articles start off on the wrong foot, failing to recognise the
relation.

A relation represents a set of data where the given set all pertains to the
same type of "thing" or entity.  In a relation, the data is laid out in rows
and columns.  The columns, each which represent an attribute of the entity
(describes it in some way), and the rows (which are called tupples)
represent a specific instance of a set of attributes in the relation.  Now
if you stripped away all the mumbo jumbo technical terms and use database
terms instead, you would see that I have just described a table.  The
attributes are the fields and the tupples are the records. 

So let's start now with where most other articles pick up and that is with
the classic example of a customer list.  Since our relation is about
customers and for the sake of simplicity, we'll stick to the following
attributes in the relation:

Name
Address
City
State
Zip
Phone Number

One of the fundamental things I didn't mention above about forming a
relation is that each tupple needs to be unique. If by combining all the
attributes, you cannot uniquely identify a given tupple, then you don't have
enough attributes in your relation. It is the combination of one or more
attributes, which will uniquely identify a tupple that will become the
primary key. Can you have more than one primary key? No!  Can you have more
than one possible primary key?  Certainly and these are called candidate
keys, one of which will be designated as the primary key.

Each grouping of one or more attributes that can uniquely identify a tupple
is called a super key.  There are a multitude of super keys in the above and
going to the extreme, would be combining every attribute to form a key.

But when you work with data, there are a few attributes that a primary key
should have:

1. It should be as stable as possible.
2. It should be as minimal as possible.
3. It should be as familiar as possible.

Given that, we certainly would not want to use most of the super keys in a
relation.  Keeping the above in mind, let's take a look at our example.
Would name alone suffice to uniquely identify each row?  Probably not as you
could easily have "ABC Company" in two different cities.  How about Name and
Address?  Well that would be better, but might still not be unique enough.
You might have two "ABC Company" on "33 Main street", but in different
cities.  What about combining Name, Address, City, and State?  That might
work.  How about Name, Address, and Zip?   Possibly.  What about the phone
number by itself?  Yes, that would probably work too.

So we have three candidate keys, but which one should be the primary key?
Again, looking at attributes a good primary key should have, phone number is
probably the best bet.  Phone numbers typically don't change unless you move
so it would be fairly stable and it certainly is shorter then using either
of the other candidate two keys.  Last, it should be familiar to anyone that
works there. But cannot the phone number change?   Sure and most would say
that this means that it cannot be a primary key!  

Which of course brings us to Misconception #3; Primary keys can never
change. Primary keys can and will change; companies move, phone numbers can
change, etc.  There is nothing within relational theory that says a primary
key cannot change.  Again, keep in mind that still we are talking about the
logical representation of data.  We have not moved on to how it is
physically stored or issues with that.

Up to this point, we've been discussing what are commonly referred to as
Natural Keys.  That is the key is derived from the existing attributes.

Early database designs used the method above to choose a key for a table.
But rather quickly, it was discovered that the computer systems we had could
not keep up performance wise as the keys had a tendency to become fairly
long (when forming joins between relations).  This is the point where
surrogate keys came into use.   Many believe that a surrogate key should be
meaningless and have no connection with the data in the row.  This partly
came about because of data warehousing.   When warehousing data, it becomes
imperative that a key assigned to a row never changes. Some database
designers even go to the point of saying that a meaningless surrogate should
never be displayed to the user.  But does a meaningless number used as a
surrogate work as a primary key?  Is it really a surrogate?  Let's look at
an example where the company has a customer table that looks like this:

tblCustomers
CustID - Identity  - Primary Key
Name - Text 
Address - Text
City - Text
State - Text
Zip - Text
Phone Number - Text

So our CustID attribute is now just a number that goes up by one for each
new record.  It's never given to the customer and it does uniquely identify
each row in the table.   Now imagine that I'm a customer calling to place an
order:

  Customer:   Hi, I'm calling to place an order.
  Sales Rep:   Have you done business with us before?
  Customer:   Yes
  Sales Rep:     Great.  I can just pick you from the list..uh, what's the
company name?
  Customer:      ABC Company
  Sales Rep:     Oh...well I have six companies listed with that
name...where are you located?
  Customer:   New York City, NY
  Sales Rep:     Wow! Believe it or not, there are three ABC Companies in
NYC!...what's the address?
  Customer:   7th avenue and 28th street
  Sales Rep:   OK great...this must be you... is your phone number
210-699-9999?
  Customer:   Yes
  Sales Rep:     OK, what did you want to order?

You've just seen in action the difference between a true primary key and one
that is not even though it is labelled as a "primary key" in the table
design.  While CustID does serve to identify the record uniquely within the
relation physically, it does not serve to identify a specific customer
within the relation logically.  In order to do that, the sales rep used the
natural attributes of customers to ensure that the correct customer was
being chosen.  The attributes he used would have been a super key, possibly
a candidate key, the real primary key, but it was not the "primary key" key
that is currently in the table design.

"Surrogate" means "to take the place of", but as we have just seen, a
meaningless key does not take the place of a primary key from a logical
point of view.  Yes, it does uniquely identify a row in a table, but only in
a physical sense, not in terms of the data.  And how could it, since it has
no connection with the data in that row. 

Which brings us to Misconception #4; A meaningless identity or auto number
column can serve as a primary key.   A meaningless key is simply a tag or
pointer, but in regards to relational theory, it cannot be a primary key.

So is there anything that could be called a true surrogate key?  Well what
if we took that meaningless CustID and handed it to the customer when they
first started doing business with us?  In doing that, we would give it
meaning.  It is now known to us and the customer and would never be given to
another customer.  We could now use it to identify a customer uniquely in a
logical context, so yes, it would serve as a primary key even though it is
an artificial (non-natural) attribute.  A subtle difference to be sure, but
it does make a difference.

I can see the question now; Ok the above is all well and good and I now
understand the differences, but where does that leave me in developing
applications?

Well first, like 99% of the developers out there at this point (including
myself), you're going to use meaningless keys (note that I did not call it a
surrogate though<g>) in your databases.   But with understanding the above,
you also now realize that it might mean:

1. You might need to maintain additional indexes and/or code to form a
constraint based on a primary key - how does your database ensure that you
don't have a customer entered twice?
2. You might want to make changes in your user interface - How can I present
data to the user as efficiently as possible in order to uniquely identify a
specific instance of something?
3. Are there places where a surrogate key can be used?
4. Are there places where I might not want to use a meaningless key?

By asking yourself these questions and with the understanding gained from
the above, you now should be able to answer them.  For example, if we take
number four, there is one place where I think it is just down right silly to
add a meaningless key; that is in a many to many linking table.

Given:

tblBooks 
BookID - Identity - PK
BookTitle - Text
ISBNNumber - Text

tblAuthors
AuthorID - Identity - PK
LastName - Text
FirstName - Text

And that many authors can author more then one book and a book can have one
or more authors, then you use a linking table to form the many to many
relationship like this:

tblAuthorsAndBooks - One record per Author / Book combination
AuthorBookID - Identity - PK
AuthorID - Long - CK1-A
BookID - Long - CK1-B 

"CK" stands for candidate key.  Looking at this, including AuthorBookID just
for the sake of having a meaningless key is a waste.  The AuthorID/BookID
combination must be unique, so we need a constraint (usually done with an
index) on it anyway.   The table should look like this:

tblAuthorsAndBooks - One record per Author / Book combination
AuthorID - Long - PK-A
BookID - Long - PK-B

Another place I think it is a waste to add a meaningless key is in a simple
lookup table which has a code and a description as the only attributes:

tblCreditCodes
Code - Text - PK
Description - Text

The Code in of itself is the primary key and can be used as the key.  So why
would you want to do this:

tblCreditCodes
CreditID - Identity - PK
Code - Text - CK1
Description - Text

Well one reason mentioned for using surrogates or meaningless keys was for
performance.  If this was going to be a large table and the Code attribute
was bigger then a long (4 bytes), then it might make sense to use an
identity field in the table.    But given that, then maybe I should change
my user interface and just present a Description and have a table like this:

tblCreditCodes
CreditID - Identity - PK
Description - Text

Again though realizing I will need to maintain an additional index on
Description so that I can't enter the same thing twice.  As you can see, the
important point about all this is that by understanding the underlying
concepts vs the real world challenges that we all face can in the end help
you design and develop better databases and applications.

In conclusion, I hope you found this article enlightening or at the very
least given the subject matter and your judgment of my points, entertaining.
Comments are more then welcome.
 

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Susan Harkins
Sent: Monday, October 04, 2010 5:25 PM
To: Access Developers discussion and problem solving
Subject: [AccessD] New article

http://blogs.techrepublic.com.com/msoffice/?p=3904&tag=leftCol;post-3904

I'll get hate mail for the natural/surrogate key comments, but ... that's
Okay. 

Susan H. 
-- 
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com





More information about the AccessD mailing list