Wednesday, February 29, 2012

The Fountain Of Youth?

The office is empty today due to the 4 inches of slush that we received overnight.  Hardy Minnesotans indeed!

To entertain myself, I've been sitting here pondering the question - if you were born on February 29th, do you age slower than the rest of us?  With the help of some T-SQL and a virtual numbers table, I have found conclusive proof that this is true.  You do indeed age slower than normal people.

For example, let's assume you were born on February 28th, 1964.  You would have turned 48 yesterday.  We can verify that with this statement:

DECLARE @DateOfBirth DATE = '02/28/1964';

WITH Years_CTE (YearNum)
AS
 (
  SELECT YearNum = 1
  UNION ALL
  SELECT Years_CTE.YearNum + 1
  FROM Years_CTE
  WHERE Years_CTE.YearNum < 100
 )
SELECT SUM(CASE WHEN DATEADD(YEAR, Years_CTE.YearNum, @DateOfBirth) <= GETDATE() AND DATEPART(DAY, DATEADD(YEAR, Years_CTE.YearNum, @DateOfBirth)) = DATEPART(DAY, @DateOfBirth) THEN 1 ELSE 0 END)
FROM Years_CTE

What if you're one of those weirdos who was born on February 29th of that same year?  You, my friend, just turned 12 today.  This we can prove with the same statement as before:

DECLARE @DateOfBirth DATE = '02/29/1964';

WITH Years_CTE (YearNum)
AS
 (
  SELECT YearNum = 1
  UNION ALL
  SELECT Years_CTE.YearNum + 1
  FROM Years_CTE
  WHERE Years_CTE.YearNum < 100
 )
SELECT SUM(CASE WHEN DATEADD(YEAR, Years_CTE.YearNum, @DateOfBirth) <= GETDATE() AND DATEPART(DAY, DATEADD(YEAR, Years_CTE.YearNum, @DateOfBirth)) = DATEPART(DAY, @DateOfBirth) THEN 1 ELSE 0 END)
FROM Years_CTE

We have thus proven that leap-year babies do indeed age slower.

Go forth, pre-teen, you have your whole life ahead of you.  Why are you wasting it reading this blog?

Monday, February 20, 2012

20 Abuses: SELECT * - The Lazy Man's Performance Killer

If you've been doing T-SQL development for a while, or read any good T-SQL books or blogs, you've likely seen the advice/warnings/best practices that say not to use "SELECT *" in your code.  It's been said a million times in a million different places.  If you're already in the know, there's nothing for you to see here.  If, on the other hand, you're one of the inexperienced people out there, new to T-SQL, the people that this "20 Abuses" series is aimed at, keep reading.

Let's start by clarifying just what SELECT * does.  Quite simply, it returns the entire columnset from a table.  It's an easy way to say "I want the whole nut, the whole enchilada, I'm not interested in being polite or conservative, just let me have everything!".  That's fine for an ad-hoc query, against a system that isn't busy, someplace where a momentary performance hit won't matter.  Everywhere else, you need to be careful and understand the unintended side-effects that you might be causing.  The last thing you want to do is choke your production system because you got greedy.

First and foremost, SELECT * can be dangerous.  Using it carelessly can lead to incorrect data being returned, from a view in particular.  I've covered that little problem in a separate post.  We're not going to rehash that here  Here, we're going to focus on the performance implications, specifically when it comes to indexing.  My examples are based on AdventureWorks, so if you have it installed, you can follow along, or just rely on my screenshots.

We all know what indexes are for.  They exist to make it easy for SQL Server to locate rows in a table that meet the criteria that you've specified.  The ideal index will contain all of the columns that your query needs, both for filtering and for returning as part of the resultset.  The problem with SELECT * is that you're asking for every column.  It's unlikely that any index (aside from the clustered index, i.e. the table itself) will contain every column.  It can be done, but it's terribly inefficient, wastes space, and probably indicates a table that isn't properly clustered to begin with.

Take a look at the indexes that are defined on the Sales.SalesOrderHeader table:


We can see that there is an index on the SalesPersonID column.  Assuming a simple query that filters on SalesPersonID, this index should be used to quickly zero in on the rows that we're looking for.  Let's try an example to see for sure.  Let's look for all of the SalesOrderID values for sales rep 277.

SELECT SalesOrderID
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = 277

We can see from the query plan that the index was used, and was the target of an efficient index seek.


Watch what happens though if we make one tiny little change.  We get lazy and decide that typing out the name(s) of our desired columns is too hard.

SELECT *
FROM Sales.SalesOrderHeader
WHERE SalesPersonID = 277

This query produces a very different query plan.  In fact, compared to the plan from the first query, this one is 99 TIMES more expensive.  Ouch!


If that's not bad enough, let's compare the I/O demands of the two queries.


The second query, the one where we got lazy, generates a whopping 228 times the reads of the first query.  This example is against a small table.  Imagine the impact on your production system when run against a table with 40 million rows in it.  You could be looking at a real performance killer here.

As I said earlier, this has been covered a million different ways.  There is no shortage of expert advice out there telling you to avoid the use of SELECT *.  I'm not going to attempt to rehash it all.  My goal here was simply to show a clear example of the performance impact that can result from its use.

Don't be lazy.  Keep your resultsets narrow, grab only the columns that you really need.  Your users will thank you, your DBA will thank you, and your system will thank you.

Friday, February 17, 2012

20 Abuses: Procedurally Going In Circles

The past few years have seen the rise of an strange beast on Minnesota's roads - the roundabout.  If you've done much driving in Minnesota, particularly in the Twin Cities, you can imagine that some of these drivers are absolutely flummoxed (I've always want to use that word in a blog post) by these creations.  Some folks have reportedly gotten confused and spent days driving round-and-round the same roundabout, unable to figure out how to exit.  It's not that a roundabout is hard to navigate, it's quite simple to get from one side to the other.  Some people just can't resist the urge to go in circles.  Sort of like some T-SQL developers.  Usually the people who grew up writing code using procedural languages who are now venturing into database-land for the first time.

There's nothing inherently wrong with procedural coding, it is in fact quite appropriate for certain cases.  It usually gets the job done, and is easily readable by most anyone.  The issue is one of efficiency.

Think of your data as a pile of dirt.  Your job is to move that pile from one place to another, from a table or set of tables into a report.  There are a variety of tools to choose from for moving a pile of dirt.  If the pile is small, you might choose a shovel, and that might actually be the best tool for the job.  If the pile is more substantial, you might want to use a different tool.  Something big and yellow, capable of moving the entire pile of dirt in one pass.  You could, of course, stick with the shovel and follow the process of bend/scoop/turn/throw/bend/scoop/turn/throw, moving the pile one scoop at a time, but you're going to get sweaty and dirty, and you'll be sore in the morning.

What does any of this have to do with SQL Server?  Patience, young data-juggler, I'm going to show you.  If you have a copy of AdventureWorks available, you can try the following examples yourself.  If not, just follow along with my screenshots.

Let's propose that we've been asked to produce a report, showing all sales reps with a quote greater than $250,000, and for each of those reps, the total dollar amount for shipped items during the first half of 2008.  Sticking with our dirty metaphor and the notion of going in circles, let's look at a couple of different approaches for producing this report.

For the first approach, we're going to use a shovel.  We're going to dig into our pile of data, removing one row at a time, turning, then throwing that row into our process.  We're going to repeat this process (going in circles) until we've reached the bottom of the pile.  We start by pulling out the sales reps that we're interested in, those with a quote greater than $250,000.  We turn, then "throw" those sales reps into a temp table.  Then we start going in circles.  For each of those sales reps that we threw into that temp table, we're going to scoop up the sales information for that rep, total up the amount, and then throw that total into the temp table.  Once we find out way out of the circle, we're going to display the contents of the temp table.

-- A procedural method
CREATE TABLE #SalesPersonShipTotals(SalesPersonID INT, SalesQuota MONEY, TotalShipped MONEY);

INSERT INTO #SalesPersonShipTotals (SalesPersonID, SalesQuota)
SELECT BusinessEntityID, SalesQuota FROM Sales.SalesPerson WHERE SalesQuota > 250000;

DECLARE @SalesPersonID INT;
DECLARE @TotalShipped MONEY;

DECLARE SalesPersonCursor CURSOR FOR SELECT SalesPersonID FROM #SalesPersonShipTotals;

OPEN SalesPersonCursor;

FETCH NEXT FROM SalesPersonCursor INTO @SalesPersonID;

WHILE @@FETCH_STATUS = 0
 BEGIN
 
 SELECT @TotalShipped = SUM(SubTotal) FROM Sales.SalesOrderHeader 
 WHERE SalesPersonID = @SalesPersonID AND ShipDate >= '2008-01-01' AND ShipDate < '2008-07-01';

 UPDATE #SalesPersonShipTotals SET TotalShipped = @TotalShipped WHERE SalesPersonID = @SalesPersonID;
 
 FETCH NEXT FROM SalesPersonCursor INTO @SalesPersonID;

 END

CLOSE SalesPersonCursor;
DEALLOCATE SalesPersonCursor;

SELECT SalesPersonID, SalesQuota, TotalShipped FROM #SalesPersonShipTotals ORDER BY SalesPersonID;

Make sense?  Good.  If you're thinking "I wouldn't have done it that way", even better, you're a step ahead of me.

A real man, the guy driving that big yellow machine, would take a completely different tact.  Hopefully you and your shovel are out of the way.  This guy is going to come in, scoop up the entire pile of data in one pass, ready to dump wherever you want.

-- A set-based method
SELECT
 SalesPersonID  = SalesPerson.BusinessEntityID,
 SalesQuota   = SalesPerson.SalesQuota,
 TotalShipped  = SUM(SalesOrderHeader.SubTotal)
FROM Sales.SalesPerson
INNER JOIN Sales.SalesOrderHeader
 ON SalesPerson.BusinessEntityID = SalesOrderHeader.SalesPersonID
WHERE SalesPerson.SalesQuota > 250000
 AND SalesOrderHeader.ShipDate >= '2008-01-01'
 AND SalesOrderHeader.ShipDate < '2008-07-01'
GROUP BY SalesPerson.BusinessEntityID, SalesPerson.SalesQuota
ORDER BY SalesPerson.BusinessEntityID;

This is nothing extraordinary, just a couple of joins and an aggregate.  Basic T-SQL stuff.  Both methods return the same results:


Some of you are scoffing at this point, thinking "If both methods return the correct results, why do I care which one to use?  What's the difference?".  I won't tell your DBA that you said that.  I will show you what the difference is.

The second method, the one using the joins, produces a nice, clean query plan.  A couple of fast index seeks piped through a couple other steps, pretty straightforward.  Nobody will complain about a query plan like this one.



The first method produces a much different query plan:


I won't bother explaining it, the picture says it all.  You can clearly see that we went in circles a few times, running the same steps repeatedly.  Wasteful and inefficient.

Let's go a little deeper and look at what this means in terms of server I/O.  Lots of I/O generally doesn't translate into good performance.  The second method, using joins and aggregation, hits three tables, resulting in a trivial number of reads against the database:


When we look at the I/O stats for the first method, we can again clearly see that we're repeating some of the same work over and over again.  There's a lot more I/O generated, lots of repeated hits against the same tables.  Bend, scoop, turn, throw, bend, scoop, turn, throw...


Have I made my point?  I hope so.  If it's not clear, that point is "always think in terms of sets".  Always try to process an entire data set as a whole instead of repeating work over and over again.  Your DBA's will appreciate it, your end-users will appreciate it, and your code will be more efficient and faster!


Wednesday, February 15, 2012

Linked Up With LinkedIn?

The Meme15 assignment for February, put forth by Jason Strate (blog | @StrateSQL),  is "Why and how do you use LinkedIn?"

Hmmmm....  Good question.  How DO I use LinkedIn?

- I advertise new blog posts on LinkedIn - same as I do on Twitter and Google+

- I connect with friends, coworkers, and other acquaintances - same as I do on Twitter, Facebook, and Google+

That's it.  I can't think of another way in which I use LinkedIn.  Is it really just another Twitter or Facebook?  Not quite.  Some of the things that I might post on Facebook, I would never dream of posting on LinkedIn.  No dumb jokes, no random "thought just popped into my head" status updates, no photos.  I'm more discerning about what I put there.  Why?  Why do I care to treat LinkedIn differently?

I think the simplest explanation is that LinkedIn is my professional Internet presence.  Sure, I have this blog, but there is a certain "character" that I'm after with this blog.  It's mine, the rules are mine, and this is where I get to unleash some of my sarcasm while still staying somewhat professionally relevant.  The blog is not a resume'.  You can't determine my work history by reading the blog, at least not entirely.  You can't determine who I know or who I've worked with by reading the blog.  Those things come from LinkedIn.

LinkedIn also makes it easy for me to keep track of where my associates are working.  If I find out that my company is hiring a network admin, and I worked with a guy five years ago who was a great network admin, I might look him up through LinkedIn and let him know about the opening.  It keeps me passively connected with people from my professional past.  You won't find my mother in my LinkedIn connections, or that guy from high school 25 years ago who now has as much in common with me as a turnip.  Those people are confined to Facebook, where there is very little overlap with my professional life.

Someday, I suppose, LinkedIn might prove useful in finding a job, but I can't see that far over the horizon.  I have no plans to leave my current employer, I'm quite content where I am.  That doesn't stop the recruiters from finding me on LinkedIn and dropping an email or a phone call to let me know about the "greatest opportunity ever", but right now, I'm not interested.

Sunday, February 5, 2012

Time Flies

Ever notice how quickly time gets away?  One minute, you have all the time in the world.  The next, you're wondering where it all went.  Happens with kids, life, hobbies, taxes, weekends, holidays, and it happens with work.  We've all been there, willing the clock to slow down so that we can complete that project that's due tomorrow.

The #MemeMonday topic for this month is "working against deadlines", and the assignment is to share experiences, stories, tips, etc., about getting the work done on time.  As the supreme being that I am, I never find myself in the position of racing the clock, but I'll see if I can make up enough stuff to fill up a blog post.  I spent a year doing the work of four DBA's (long story for another day) in a 24x7 operation, I know a thing or two about time pressures.

First rule, and I find this gets easier as you become older more experienced, is to try to avoid getting behind in the first place.  It's not always possible, due to circumstances beyond your control.  Part of effective time management is learning to anticipate those circumstances and how to mitigate them.  As an experienced DBA, you'll begin to get a feel for the "big picture", and what unforeseen tasks may surface as part of a project.  Application upgrade coming at the end of the month?  We're going to need a database backup, maybe some additional database space allocated, maybe it would be a good time to do some database maintenance that you haven't been able to get done.

Second rule is to minimize distractions.  This can be difficult in an office setting - some people simply don't grasp the concept of concentration, nor do they realize how disruptive that "quick question" is to someone trying to write code.  We have sort of an unwritten rule in my office.  If someone is wearing headphones and has a chair blocking the entrance to their cube, it means "DO NOT DISTURB!".

This works well for 95% of the office population, but there are a select few who don't get it and will disturb anyway.  I've even had one fellow move the chair and tap me on the shoulder because I was ignoring him.  If something like this won't fly in your workplace, remove yourself altogether.  Find an empty office to work in, or an unoccupied cube far away from your own, someplace nobody will know to look for you.  Better yet, work from home.  Turn off the IM program, close Outlook, turn off the phone.  Speaking of Outlook - if you send me an email, there's NO NEED to walk over and tell me that you just sent me an email.  That's a pet peeve of mine, as anyone who works closely with me knows already.

Should the first two rules be broken, the third and final rule is to do whatever it takes.  Sometimes working late, all night, maybe all weekend, simply can't be avoided.  Nobody wants to do it.  In my role (always on call), after-hours work is typically reactionary, something broke or is running poorly.  The weekends of our monthly production deployments usually involve me in some fashion.  I don't often find myself working after-hours to meet a deadline, but it HAS happened before.  You do what you have to do.  I'm fortunate to work in an environment where the need for going "above and beyond" is recognized as necessary, but is offset by having an extremely flexible work schedule.  Working from home is extremely common within my group, and rarely challenged.  We are permitted to work whatever schedule we prefer - myself, I'm in the office by 6:00am and home by 4:00pm most days.  A fitting trade in my opinion.

There you have it, my take on working against deadlines.  Some people won't agree with me, I can live with that.  If you're part of the "don't bother me after 5:00pm" crowd, never fear, the rest of us have you covered.


Thursday, February 2, 2012

20 Abuses: SELECT * - Doom With A View

Views are great.  We all love views, right?  Don't like the looks of that query with 28 tables all joined together?  Hide the joins in a view, then your query becomes much simpler.  Have fun figuring out that query plan a year from now, after you've forgotten what you've done.  Views are excellent tools for creating complex performance issues.

Views can also be downright dangerous if used improperly.  Want to see how?

Let's create a table:

CREATE TABLE Employees
 (
  EmployeeID   INT    NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
  EmployeeName  VARCHAR(255) NOT NULL,
  Address1   VARCHAR(255) NOT NULL,
  Address2   VARCHAR(255) NULL,
  City    VARCHAR(25)  NOT NULL,
  State    CHAR(2)   NOT NULL,
  ZipCode    CHAR(5)   NOT NULL
 );

Feed it some data:

INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Charlie Brown', '123 Rice Street', NULL, 'St. Paul', 'MN', '55122');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Sheldon Cooper', '2455 Orange Court Drive', NULL, 'Pasadena', 'CA', '91051');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Awesome Bill', '14 Swamp Road', NULL, 'Dawsonville', 'GA', '30534');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Stephen King', '19 Turtleback Lane', NULL, 'Lovell', 'ME', '04051');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Al Bundy', '9764 Jeopardy Lane', NULL, 'Chicago', 'IL', '60623');











Then create a view on the table:

CREATE VIEW EmployeesView
AS
SELECT *
FROM Employees;

Simple enough.  Why would you do this in real life?  I don't know, but it happens.  Let's also overlook the poor design of the table, this is solely for purposes of demonstration.  You wouldn't design a name/address table quite like this in real life.

So, we have our view now, and somebody, maybe you, creates a report or maybe an Intranet page, meant to serve as an employee directory.  This report queries the view, pulling out the employee name, using a query like this one:

SELECT EmployeeName FROM EmployeesView;















Easy enough, this isn't rocket science.  This report is made available for public consumption, and soon becomes a vital tool to the company.  Congratulations, you've made a contribution.  Life is good.

Until....

The decision is made to change the design of the Employees table.  Perhaps you've hired a new data architect or DBA who has pointed out some flaws.  Maybe you've just come to realize the error of your ways, and are trying to fix things now.  The point is, the table structure is changing.  The current "EmployeeName" column is going to be split into separate columns for FirstName and LastName, as it should have been in the first place.  A script is prepared that will make the necessary DDL and DML changes.

ALTER TABLE Employees
 ADD FirstName VARCHAR(255), LastName VARCHAR(255);

UPDATE Employees
SET
 FirstName = SUBSTRING(EmployeeName, 1, CHARINDEX(' ', EmployeeName) - 1),
 LastName = SUBSTRING(EmployeeName, CHARINDEX(' ', EmployeeName) + 1, 255);

ALTER TABLE Employees
 DROP COLUMN EmployeeName;









This goes to production on Saturday night as part of your monthly deployment outage.  You haven't a care in the world.  Your automated tests didn't report any problems, so everything must be OK.  You've forgotten all about that "vital" employee directory that made you famous.  That changes around 6:00am Sunday morning, when you are given a very unexpected reminder, in the form of a phone call.  On the other end of that call is your on-call tech, who has been fielding angry phone calls all night long.  The employee directory is not working correctly, and people are unhappy.  Confused, you take a look at the directory yourself, and sure enough, it's broken.  Instead of a list of employee names, you're seeing a list of street addresses.  What happened?

You dig in, find the query that the report is using, which hasn't changed::

SELECT EmployeeName FROM EmployeesView;

You run that query in Management Studio, and adding to your confusion, you see this:












You've just been bitten by "SELECT *".  When you originally defined the view, you didn't want to type out the column list, so you took the easy way out and used "*" as the column list.  I won't call you lazy, but...

The problem with this is that view, when declared in this way, doesn't care about the column names in the underlying table.  It knows that there were 7 columns in the original table, so it "reserves" a spot for 7 columns within it's output.  It remembers the column names as they existed at creation, and their ordinal position, and that's all it cares about.  When you select from the view, it returns "table column 1" as "output column 1", with the name that it remembers from its original creation.  In the case of this example, "EmployeeName" was the second ordinal column in the table.  Until this view is regenerated, it will always return the second ordinal column of the table as "EmployeeName".

How do we fix this and prevent it from happening again?

We drop the view and recreate it, specifying exactly the columns that we want to be included in the view.  If you want every column returned, list every column in the column list.  This will not prevent the underlying table from changing, but it WILL cause an error to be generated when you next query the view.  This would have made you aware of the error before this code made it to production.

Want proof?  Let's start over.

Drop the view and the table:

DROP VIEW EmployeesView;
DROP TABLE Employees;

Create the table:

CREATE TABLE Employees
 (
  EmployeeID   INT    NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
  EmployeeName  VARCHAR(255) NOT NULL,
  Address1   VARCHAR(255) NOT NULL,
  Address2   VARCHAR(255) NULL,
  City    VARCHAR(25)  NOT NULL,
  State    CHAR(2)   NOT NULL,
  ZipCode    CHAR(5)   NOT NULL
 );

Fill it with data:

INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Charlie Brown', '123 Rice Street', NULL, 'St. Paul', 'MN', '55122');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Sheldon Cooper', '2455 Orange Court Drive', NULL, 'Pasadena', 'CA', '91051');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Awesome Bill', '14 Swamp Road', NULL, 'Dawsonville', 'GA', '30534');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Stephen King', '19 Turtleback Lane', NULL, 'Lovell', 'ME', '04051');
INSERT INTO Employees (EmployeeName, Address1, Address2, City, State, ZipCode)
 VALUES ('Al Bundy', '9764 Jeopardy Lane', NULL, 'Chicago', 'IL', '60623');









Create the view, PROPERLY this time:

CREATE VIEW EmployeesView
AS
SELECT EmployeeName
FROM Employees;

Quick test, just to make sure:

SELECT EmployeeName FROM EmployeesView;















Repeat the DDL/DML changes:

ALTER TABLE Employees
 ADD FirstName VARCHAR(255), LastName VARCHAR(255);
GO

UPDATE Employees
SET
 FirstName = SUBSTRING(EmployeeName, 1, CHARINDEX(' ', EmployeeName) - 1),
 LastName = SUBSTRING(EmployeeName, CHARINDEX(' ', EmployeeName) + 1, 255);
GO

ALTER TABLE Employees
 DROP COLUMN EmployeeName;
GO









How's the view look now?

SELECT EmployeeName FROM EmployeesView;






Oh, look, now we're getting an error.  Not good, but much better than "phantom" data.  This will show up in testing, hopefully, steering you away from your impending doom.  For now...