I'm awaiting the availability of an updated .NET/.COM zone file for performance demonstration purposes (e.g. many of the samples for part III use the whole of the .COM/.NET DNS directories as performance samples). This is public data that people can replicate themselves, rather than confidential internal or client data, or manufactured data, so I thought it a good foundation.
I hope to finish up this series in the next couple of days.
While I remain committed to SQL Server, after a lengthy evaluation and feature analysis process, we will start advocating MySQL 5 to some clients under a set number of conditions, both on the Windows and the Linux platforms. While MySQL was previously lacking some critical foundational features, v5 ups the ante considerably, filling a particular niche in the solution spectrum. This is a part of a wider trend at yafla of embracing some of the "alternative" platforms, outside of our normal Microsoft-enabled comfort zone, where it benefits our customers and the solution.
Due to licensing conditions we still won't be targeting our applications at MySQL.
We still advocate PostgreSQL, another excellent open source RDBMS, for some scenarios.
I've received a couple of fantastic comments about troubles that people have faced adding items from here to their del.icio.us bookmarks, namely because Radio Userland uses a constant title for all entries (and del.icio.us automatically uses the title, so three different entries get the same title if you fail to manually override its choice). The common title problem was one of the reasons I created the notables static listing, though of course that listing is just a subsection of entries.
To help with this issue, I've added quicklinks below each entry to add it to your del.icio.us bookmarks, furl bookmarks, to Digg it or to Reddit it (which will link to an existing entry if one is already on there), and to check for Technorati links (there are seldom Technorati links because most of the readers here aren't bloggers, or they aren't the sort of bloggers that comment on every site they visit. I'd get a big boost in the Technorati rankings if I started pandering to the incestuous blogging community). I've mirrored these items to the static section as well.
[The following repost of "legacy" yafla content is preparation for the long awaited publishing of part III, which will be published through this medium. To give consistency, I'm reposting Pt I and II in this format]
Much of the material covered in this outing will be old hat for a lot of developers, but is nonetheless worth a recap -- Even among the pros there remains misunderstandings and conflicting information about the fundamentals of databases, and the true magnitude of impact they have on systems. I've intentionally authored this series conversationally, as opposed to a "high impact hit list!", however if you'd just like a brief summary list you can find one at the end.
This needs to be addressed as there is a growing camp of "data wasters" that erroneously believe that the larger the amount of waste, the more Enterprise Ready a solution is (capitalization used derisively). I step into this quagmire knowing full well that this section will yield me some "you're a dummy!" responses from some newly earned adversaries, however that's a price I'm willing to pay if I can save but one byte tree.
Minimize the size of your data. Don't use GUIDs where they aren't necessary (e.g. where you don't really need global uniqueness/replication), and don't use a bigint where an int or a smallint will suffice. Don't use a smallint where a tinyint will suffice or an nvarchar where a varchar would be fine. Use the smallest type that is reasonable for the field. Don't invent vague packing technicalities or native type size issues in an attempt to justify oversized data.
Of course you should plan for realistic growth, and I'm not advocating that you use a tinyint to store your CustomerID field, but keep the rational, real world in mind when designing your applications - are you really going to exceed 2 billion users? Is there going to be more than 32767 languages in your application? Is it likely that we're going to a new calendaring system that might have 2 billion months?
Evaluate if it might be an acceptable compromise to simply use these large types in your façade while actually storing smaller types in the actual database. This would give you improved performance, and would allow you to easily upsize your data types in the future in the unlikely event that it becomes necessary.
Clearly there are cases where large data types are legimately warranted, however too many database architects abscond themselves of any responsibility for efficiency by making everything a GUID or a bigint "just in case" (GUIDs have a substantial creation cost as well, in addition to the obvious storage and I/O costs. While GUIDs once used the available network card MAC address as the foundation and generally sequentially increased in value on each new GUID, in current Windows variants GUIDs are basically random numbers -- used as a clustered primary key they can lead to endless data reordering. NOTE: See Sequential GUIDs in SQL Server for solutions for this problem).
Why does it matter? In real enterprise apps before you know it there are tens or hundreds of millions of rows throughout your database, and these rows need to be read from, and written to, the glacially slow storage subsystem constantly -- given this finite resource, doesn't it seem logical that 1MB of I/O carrying 20,000 records is better than 1MB of I/O carrying only 5,000 records? Of course it is. Isn't it better that 10% of your database can fit in the memory cache rather than just 5%? Of course.
Don't be lulled into a false justification of large data types by running ridiculously small benchmarks, where all of the data exists in the memory cache and the I/O is dwarfed by the computational element of the query, yielding "only" a performance hit of 10% or so with larger types -- when your database gets to real enterprise size, size really does matter. That cluster indexed GUID primary key not only makes the row bigger, it makes every non-clustered index bigger (and thus slower) as well, and when the weakest link SAN is running at 100%, you'll regret every wasteful byte.
Many SQL Server performance problems are rooted in missing or inappropriate indexes, or alternately unused indexes. This is often true for databases thrown together by front-end experts unhappily tasked with supplying the back-end database, just as it's often the case for those carefully crafted by highly-focused database professionals.
An understanding of indexes, and a focus on their application, is paramount for high performance databases. Not only is it critical to create the right indexes, it's important to craft your access to properly utilize the indexes that are there.
Indexes in reference books serve the same purpose (and share many of the same traits) as those in the database world -- by referencing the index you can follow a shortcut to a particular piece of information, seeking directly to that specific page, versus going from page to page scanning the contents. In the SQL Server world these sorts of indexes are called non-clustered indexes (or secondary indexes) -- they are a subset of the data ordered for a specific purpose, containing a pointer to where the real data row can be found.
In the Northwind database an example of a non-clustered index is ShipPostalCode on the Orders table. This index sorts by the ShipPostalCode, and may be used for a query such as the following.
SELECT * FROM Orders WHERE ShipPostalCode = '05022'
If you take a look at the execution plan (Ctrl-K or choose the option "Show Execution Plan" under the Query drop-down in Query Analyzer. When you execute a query with this option enabled a new tab, Execution Plan, will appear beside the results tab), you can see that an index seek took place, and then a bookmark lookup to find the actual data row. If we run SET STATISTICS IO ON on the connection before running the above query, we'll get some statistics on the IO used to satisfy the query, which will be as follows.
Table 'Orders'. Scan count 1, logical reads 4, physical reads
0, read-ahead reads 0.
Compare this to the following query, which simulates having no index.
SELECT * FROM Orders WITH(INDEX(0)) WHERE ShipPostalCode =
'05022'
In this case the execution shows a full table scan, and our IO statistics reports the following.
Table 'Orders'. Scan count 1, logical reads 21, physical
reads 0, read-ahead reads 0.
Without the index there was significantly more I/O, and the differential would be vastly worse if this were a large enterprise-sized table instead of a small sample table. To make matters much, much worse such a table scan will block on every single exclusive page or row lock on the table, waiting for the data to be unlocked just to be able to verify if the wanted information is contained within, while the index seek knows that the locked data isn't the data that it's looking for and is unaffected. Try running the above two queries in two separate query analyzer windows while the following script is running in yet another (increase the WAITFOR delay if you can't jump between them all within 30 seconds).
BEGIN TRAN
UPDATE Orders SET OrderDate = '1997-08-27'
WHERE OrderID = 10647
WAITFOR DELAY '00:00:30'
ROLLBACK TRAN
The first query, using the index, instantly returns the result regardless of the row lock, which is logical given that the row being locked is not what the query is looking for, while the second query, not using the index (which can happen because no index exists, or the index isn't deemed the best choice) blocks until the other connection's lock is released. Not using an index is not only vastly less efficient, it can significantly worsen blocking problems as a database scales (or rather tries to scale).
Returning to our analogy, many books take it a step further and order the content, making the data itself an index of sorts. A cookbook might sort by main category and then dish name, while a phone book famously sorts by [city, last name, first name] (in that order). Thus if you want to search based upon the sorted data, it's extremely efficient -- in the example of the phone book you can very quickly seek to the desired city and last name, scanning a small number of records for the desired person. In the SQL Server world this sort of index is called a clustered index (the sorting of the data itself), and for obvious reasons you can only have one clustered index in a book, or on a table. The primary benefit of a clustered index is that all of the table data is immediately available for every index match -- no dereferencing is necessary.
In the Northwind database consider the following query.
SELECT * FROM Orders JOIN [Order Details] ON Orders.OrderID =
[Order Details].OrderID WHERE Orders.ShipPostalCode =
'05022'
If you look at the execution plan, the Order Details data is grabbed via a very efficient clustered index seek.
Clustered indexes aren't all milk and honey, though. For instance imagine that you're the hard working typesetter maintaining the layout of the phone book, and you've carefully arranged all of the entries on the respective pages. Every time a new entry comes in and doesn't coincidentally fit right at the end of the sort order, or someone changes the information on an existing row in a way that alters the sort order ("Smith" changes his name to "Jones"), you need to reorganize some pages to make space. This same data-churn problem occurs with both non-clustered and clustered indexes, but clustered indexes exacerbate the problem given that it contains the entirety of the row data.
Of course you could plan for this by keeping a bit of blank space on each of your pages to facilitate at least a couple of changes, which is what the fillfactor is used for in SQL Server (a lower fillfactor leaves more empty space but reduces the true data density - insert performance is improved, but read performance is diminished. A high fill factor increases the real data density and thus read performance, but increases the likelihood of inserts requiring page splits. Note that fillfactor only applies on index creation, and whenever you defragment/rebuild your indexes), however this can be a serious performance issue for out-of-order inserted data, or frequently changing cluster indexed fields. It's for this reason that many developers use the monotonically increasing identity field as their cluster index. Historically there was a worry that having multiple inserts all going to the same "place" in the index, at the end, would lead to scalability killing contention at this hot spot. SQL Server has logic to deal with identity fields and effectively eliminates this hot spot issue.
Another problem with clustered indexes is their girth (they contain the entire row data). This is largely irrelevant if you're seeking to specific records, or where you actually plan on using all of the data after a lookup, however if you are querying a range of data (for instance the first name of all of the people with the last name "Forbes" in the city of "Oakville") the query engine will read in the entire row contents for each matching record using a range scan, extracting only the requested data. In our example phone book there is so little extra data that it's a minor overhead, however in many large real-world tables this can have a serious performance impact.
Consider if instead we had a secondary index that was sorted by City, Last Name, First Name, the query engine could very efficiently scan past only the small index entries.
This brings up a very important point -- Some indexes contain enough information that you don't even need to go to the content, your query being satisfied by the index itself. Consider an index in a tour book that sorts famous attractions by their name, and the country and city that they were located in, pointing to the page where further information regarding it could be found. If you just want to know what city the Accademia dell' Arte del Disegno in Italy is found in, a quick seek through the index will tell you that it's in Florence. In this instance the index was a "covering index", in that it fully covered our request and we didn't need to dereference to the complete topical information. This is often the most efficient query mechanism of all.
Consider the following variation of a query we ran earlier.
SELECT OrderID FROM Orders WHERE ShipPostalCode =
'05022'
This will efficiently use the ShipPostalCode index to find the specific record, and because the query is fully satisfied by the index itself, the costly bookmark lookup is avoided, and IO is minimal.
Table 'Orders'. Scan count 1, logical reads 2, physical reads
0, read-ahead reads 0.
The observant will note that OrderID isn't actually in the index ShipPostalCode, or at least it doesn't appear to be. The trick is that all of the fields that are the sort fields for the clustered index, if one is defined, are automatically added as data fields to every other index on the table. This can be blessing in cases where you want one of the clustered fields and suddenly the non-clustered index is a covering index, but it also needs to be weighed against the fact that it makes every other index larger, and thus less "data dense".
Small covering index seeks are the most efficient method of pulling data, and it's a good reason to ensure that you are only pulling the specific fields that you actually need from any given table, preferably with a covering index. Range scans are also highly efficient in many situations, and are usually used when the wanted entries in an index are consecutive, such as when you search BETWEEN two dates against a date index, though because of the previously mentioned bookmark lookup costs range scans are generally only seen if the index is fully covering, or against the clustered index.
Even in the case where a full scan is necessary, indexes might still be fully covering. Consider if the following index were added to the Customer table.
CREATE INDEX CountryCity ON
Customers(Country,City,Address)
This index of course contains Country, City, Address, but as mentioned above it also contains CustomerID because it's in the clustered index. Of course if we query on Country, or Country and City, or Country, City, and Address, an efficient seek or range scan might be used to pull the matching records.
SELECT CustomerID FROM Customers WHERE Country =
'Canada'
What if instead we wanted to search only on the address? In that case the index can't be searched in order because it sorts by country and then city and address, and thus a particular address could exist anywhere in the index.
SELECT CustomerID FROM Customers WHERE Address = '43 rue St.
Laurent'
You might be surprised to see that it still used the index, albeit this time it's an inefficient scan rather than a seek. The index was used because it covered the query (all predicates and returned columns), and because the index is only a subset of the data it requires less I/O to scan the entire index than it does to scan the entire table data.
In a prior example we ran a query that required a bookmark lookup to satisfy the query (thus it did not have a covering index). The query was as follows.
SELECT * FROM Orders WHERE ShipPostalCode = '05022'
If you look at the execution plan for this query you can see that it seeks the "lookups" in the index, and then does a bookmark lookup against the clustered index (which is the actual table data). In this case there is only a single row to return, but even still the bookmark lookup cost is estimated to account for 50% of the cost of the query.
The cost of bookmark lookups, where an item is found in the index but it isn't a covering index, is the reason why many people are surprised to find that SQL Server has ignored what they believe are perfect indexes and instead table scanned ("Why isn't it using my index! ARGHHH!!!"). Consider the following query.
SELECT * FROM Orders WHERE ShipPostalCode = '24100'
Looking at the query plan you can see that it actually did a cluster index scan (which is a table scan on a table with a clustered index) instead of using our index, and the subtree cost is 0.0530.
This might seem perplexing because we seem to have a perfectly satisfactory index, however let's do the same query again, this time using a query hint to force it to use our index.
SELECT * FROM Orders WITH(INDEX(ShipPostalCode)) WHERE
ShipPostalCode = '24100'
If you look at the query plan you can see that it used our index, as demanded, but this time the bookmark lookups account for 80% of the query time. Our subtree cost comes in at 0.0564 -- more than it was doing a table scan!
In this case this was only 10 records of a total of 830 (1.2%) yet still it opted to do a full table scan rather than using our index. Many developers have been perplexed in this situation, wondering why SQL Server was avoiding their beautiful index, but it's doing it for a very valuable reason - it was cheaper than indirectly looking up each piece of data through bookmarks.
Of course we could have avoided bookmark lookups by using the index as a covering index if possible by using a query like the following, presuming this was all the data we needed to extract from the table.
SELECT OrderID FROM Orders WHERE ShipPostalCode =
'24100'
Now it uses our index, is super efficient, and has a subtree cost of only 0.0064. In larger databases the difference can be the tremendous by avoiding both the bookmark lookup and the table scan.
So how did the query engine guess how many rows would match a criteria to choose which method (whether by index and bookmark lookup, or table scan) to use to most efficiently satisfy the query? That's where something called distribution statistics comes into play. Statistics are a representative set of the data that are used by the query engine to make a best-guess plan for how to most efficiently serve the data. You can view the statistics for a given index, in the following case for the index ShipPostalCode, via the DBCC SHOW_STATISTICS command.
DBCC SHOW_STATISTICS('Orders','ShipPostalCode')
Due to the limited number of discrete ShipPostalCode values in the table, the statistics are entirely accurate in this case. In a more realistic database, with thousands or millions of rows, statistics start to become much more of an estimation (with an ever increasing margin of error). These estimations can lead to entirely wrong assumptions by the query engine in some edge cases, such as where it thinks a given set of predicates will yield thousands of rows when really it might yield only a couple.
Statistics can also fail for multi-field indexes. In this case the selectivity of the first field is used, so in the case of the index we created earlier (Country, City, Address), due to the fact that the country has a low selectivity (there are lots of entries for each country), the index will often be ignored, even though the city and address combination is highly unique. For this reason it is generally recommended that the most selective field comes first in your index, so in the case of that index the fields would be Address, City, and then Country. This is debateable because it also makes the index more single purpose -- it no longer serves an efficient purpose for less-granular searches like just Country, or Country/City. This needs to be evaluated on a case by case basis, and truly wouldn't be an issue for fully-normalized tables.
It should also be noted that ensuring that your statistics are as accurate as possible is critical. SQL Server includes automatic statistic updates, on by default, where it will attempt to do data sampling and update statistics when it feels they are out of date. Nonetheless it is a best practice to schedule full statistic updating at regular intervals (at a minimum weekly), preferably using the WITH FULLSCAN option so it is as accurate as possible. The standard database maintenance plan includes a step for statistic updating, and allows you to choose the amount of data to sample.
Regardless of all of the above, there will be cases where you may find that your statistics are up to date, your indexes are optimal, yet SQL Server is still incorrectly choosing not to use your index. In this case it may be an unfortunate reality that an index hint needs to be added to the query to politely (or rather sternly) request that it reconsider. Obviously this should be a last resort.
So you've created beautiful indexes, and you've ensured that your query only pulls the necessary data from each table, using covering indexes where possible to avoid costly bookmark lookups. You pull up the execution plan to find that the query engine is entirely ignoring your index. There are several reasons why this could happen.
Consider the following query.
SELECT OrderID FROM Orders WHERE LEFT(ShipPostalCode,4) =
'0502'
Fairly simple query, and from the looks of it one might think it'd be an efficient covered index seek. Upon execution you'll discover that actually it was an inefficient scan. Consider the following instead.
SELECT OrderID FROM Orders WHERE ShipPostalCode LIKE
'0502%'
In this case the query is executed as a highly efficient index seek. I have had cases where this tiny difference reduced an enterprise report from running for literally hours to a matter of seconds.
The reason is that in the former the indexed field was hidden within a function. The query engine can't predict what the result of the function will be, so it's forced to evaluate it for every row to see what pops out. LIKE is a first class comparison, it knows how it behaves, so the query engine can actually optimize against it. There are countless cases where people hide criteria fields in functions unnecessarily, and the result is massive, unexpected inefficiency.
The most common example of this mistake is using DATEADD/DATEDIFF to pull rows within a certain period of time -- instead of pre-calculated a fixed demarcation (i.e. precalculating GetDate() - 3 years) and then doing a direct comparison with the row data, developers are forcing whole table scans with wasteful date computations on every single row. For instance consider a query to report news items that have occurred within the past 12 hours from a hypothetical news table .
DECLARE @CurrentTime
SET @CurrentTime = GetDate()
SELECT * FROM NewsStories WHERE
DATEDIFF(hh,NewsDate,@CurrentTime)<12
Guaranteed to be terribly inefficient, yet it's overwhelmingly common. The query engine can much more effectively optimize the following variant.
DECLARE @StartTime
SET @StartTime = DATEADD(hh,-12,GetDate())
SELECT * FROM NewsStories WHERE NewsDate > @StartTime
At the outset I advocated that you minimize space usage (increasing real data density). The goal wasn't to try to fit that database on a floppy disk, but rather to minimize the amount of I/O necessary to satisfy a given query, as I/O is the weakest link of most enterprise systems. There are design choices, such as adding additional indexes, that actually increase the size of your database on disk yet reduce the I/O necessary for certain queries, and these are usually very worthwhile trade-offs.
Another powerful technique you can use to trade disk space for improved database performance is indexed calculated columns. There are countless variations, but I'll cover one scenario that is fairly commonly used -- report counts by month for a given year. In the case of the Orders table this could be achieved via the following query.
SELECT YEAR(OrderDate) AS [Year], MONTH(OrderDate) AS
[Month], COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE YEAR(OrderDate)=1997
GROUP BY YEAR(OrderDate), MONTH(OrderDate)
Instead of adhoc decomposing the date into month and year constituents, consider adding them as computed columns.
ALTER TABLE dbo.Orders ADD
OrderDateYear AS CONVERT(smallint,YEAR(OrderDate)),
OrderDateMonth AS CONVERT(tinyint,MONTH(OrderDate))
Now we can change our query to the following.
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month],
COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE OrderDateYear=1997
GROUP BY OrderDateYear, OrderDateMonth
By itself we've done nothing for the query efficiency (in fact it is actually less efficient as it's applying the where predicate after building the set), though we've achieved a bit of "code re-use". However we now have the foundations for some powerful indexed computed columns.
CREATE NONCLUSTERED INDEX IX_OrderDateDecomposed ON
dbo.Orders
(
OrderDateYear,
OrderDateMonth
) ON [PRIMARY]
GO
Now the query referencing these computed columns is dramatically more efficient. Even better, these indexed computed columns haven't decreased the real data density of the table because they're only materialized in the index. NOTE: Ensure that queries that don't need these computed fields don't pull them explicitly or implicitly via the wasteful * column selector, as it'll unnecessarily calculate each of the computed fields for each row.
An indexed view, sometimes referred to as a materialized view, is a sort of indexed computed columns on steroids, taking the idea of storing computed results to the next level. In a previous example we improved the performance of some data aggregation logic by grouping on some indexed computed columns. We can take a variant of that and create a view out of it.
CREATE VIEW dbo.OrdersByMonth
WITH SCHEMABINDING
AS
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month],
COUNT_BIG(*) AS [Monthly Orders]
FROM dbo.Orders
GROUP BY OrderDateYear, OrderDateMonth
As it is the view is acting as nothing more than a template for queries against the underlying table, and is only of benefit for code reuse (which itself is a very worthwhile goal). We can take that a step further by materializing this view so the actual results are stored, and changes to the underlying table is automatically reflected in the aggregates. The following command creates the indexed view.
CREATE UNIQUE CLUSTERED INDEX IX_OrdersByMonth ON
OrdersByMonth(Year,Month)
Now the following query will be satisfied by the indexed view, using the pre-computed values, rather than recalculating for every query.
SELECT [Year],[Month],[Monthly Orders] FROM OrdersByMonth
WHERE Year=1997
In the end the resource usage for our monthly order count query has dropped by about 90% over the original query, and this is for a tiny sample database. In the real-world the differential can be extraordinary.
Indexed views do have some downsides, such as the automatic maintenance that occurs whenever the underlying data changes, however they can be an extraordinarily powerful tool in your arsenal and deserve further research if your platform supports it.
[The following repost of "legacy" yafla content is preparation for the long awaited publishing of part III, which will be published through this medium. To give consistency, I'm reposting Pt I and II in this format]
Inefficiency is a gluttonous thief. It burglarizes your server rooms at all hours of the day and night, demanding virtually limitless hardware sacrifices to satiate its endless thirst for clock cycles and disk rotations. In return it punishes your users with reduced performance and reduced satisfaction, and devastates your solution's scalability.
This inefficiency, materialized in the form of slow performance, is one of the primary causes of system abandonment. This is particularly troublesome in the SQL Server world where many systems servicing large user bases often run on low cost server boxes that leave little margin for performance waste. Many organizations have tossed out their SQL Server solution running on a $3000 PC because the performance wasn't satisfactory (not achieving so-called 'Enterprise' performance), to replace it with a multi-million dollar mainframe solution, overcoming embarrassing inefficiency with brute force.
Several years back, in a moment of nerdish bravado, I made a foolish blanket statement that I could reduce the runtime of virtually any element of a non-trivial SQL Server database solution by 95% (thus improving the performance by about 20x), doing so through some rudimentary changes requiring nothing more than some analysis, minor code changes (changing the underlying code, but not the functionality), indexing, and file group changes. To my surprise, and even greater dismay, this number actually proved to be remarkably accurate: From giant multi-hour organization wide reports, to simple security procedures run hundreds of times a minute, the obvious low hanging fruit alone often improved performance by 10x or more. With a little bit of elbow grease it has proven extraordinarily common to improve performance by 20x or more, significantly improving responsiveness and load handling of the respective systems at minimal cost.
The remarkable thing is that these weren't systems implemented by bad developers - many of them were extraordinary developers who implemented a lot of tricks and techniques that I've co-opted and added to my own bag of techniques. Instead there seems to be a dearth of real information on developing for performance in SQL Server, leaving many to guess about the best approach, not to mention that there isn't enough attention paid to performance efficiency in enterprise solutions. Many seem to be under the false impression that gross inefficiency requiring massive clusters to perform trivial tasks merits a capital-E Enterprise designation.
In software development there's an oft-referenced vice known as 'premature optimization'. This is the tendency to prematurely focus on code performance while code is still young and awkwardly growing, before the critical performance weaknesses have been identified and measured. The end result of this misguided effort is often convoluted code that is difficult to understand and maintain (for instance code including inline assembly or using specialized system hacks in seldom called edge functions). This is often a mistake of inexperienced programmers that haven't had the perfectionist engineering streak beaten out of them.
Consider also that performance truly isn't a concern for the vast majority of code in most client-side applications - it likely doesn't matter if the code that validates an input box in a Windows Forms application takes 3ms or 70ms to complete. As the processing is decentralized and isn't impacting other users who might be running the application elsewhere on the planet, it is basically making use of 'free' clock cycles available on the client PC, and generally is imperceptible to the user. If one thousand different users were running the application simultaneously, they're running it on a thousand powerful PCs, effectively throwing a massive 'cluster' at the problem. In other words, you can overcome application inefficiency on the client side through massive computational excess and a endless ability to scale-out. Even in cases where there is worthwhile performance issues identified, for example an image processing algorithms that takes several seconds to perform an operation, it's often best to wait until the project nears a release and the code has settled, at which point you can send a commando performance team to profile and then selectively improve the slowest sections of code that will have the most beneficial impact, focusing on the lowest hanging fruit, yielding a bounty of quick wins. (Taking one for the team because there's no I in team, and no cliche unworthy)
Enterprise databases, or any centralized system for that matter, are entirely different beasts - performance is one of the critical elements of these systems, and performance problems are one of the primary reasons why solutions are abandoned or re-architected. Consider that every clock cycle wasted on a shared resource, such as a database server, impacts the performance of the overall system and every other user. In most environments there is a massive asymmetry between the computational capability of client machines, and the computational capability of a shared system, such as a database server. There are usually some fixed financial and technological limits to the amount of hardware that a system can scale to, so your database server running on a lowly Dell two-way server is desperately trying to keep up with the demands of 500 user workstations pounding away at it. Even though Google is clustered on purportedly thousands of machines, they still have to develop efficiently to be able to economically service millions of users in a timely manner.
Thus, while it might seem irrelevant when taken alone that your stored procedure saturates the resource, taking 200ms to return a simple list of values to populate a drop list for Joe User, imagine 100 users all opening that form at the same time putting a shared demand on the database system. The performance impact starts to become significant and adversely affects the usability (and credibility) of the system. This is exacerbated by the fact that simultaneous performance demands aren't merely additive on shared resources, but rather contention and task sharing often means that these issues snowball into much more than the sum of the parts.
You should consider the performance of your database from day one with every table you add, every index you create, every trigger you concoct, and every relationship you define. While the misguided will argue that this amounts to premature optimization (as Ralph Waldo Emerson observed, a foolish consistency is the hobgolbin of little minds, and the belief that any performance concerns are premature is just such a foolish consistency), the reality is that the performance of a database system is largely defined by the fundamental design of the system, and as the system grows it becomes much more difficult and costly to solve fundamental performance problems. Furthermore, once an enterprise system reaches production even the simplest performance change, such as adding an index, requires complex analysis to determine how it impacts other parts of the system, or that it satisfies what could be hundreds of procedures accessing the object.
The cynical will wonder how one can predict the future when designing a database system, but the reality is that the access patterns are usually obvious by the time you're starting designing tables - you know how the tables relate, what data will be searched, how often you'll be selecting the records versus modifying them, and how big the fields and records should be. Use this information effectively when developing the tables to choose the appropriate clustered and secondary indexes, to minimize the size of each record, and to write efficient SQL. Don't leave it for a maintenance programmer to reverse engineer the system and apply best guesses in a moment of crisis in the future.
Part II and III will introduce a variety of common performance pitfalls and panaceas in the SQL Server world, touching upon (but not limited to) the following:
Tagged: [SQL Server], [SQL-Server]
I've received some great feedback regarding the entry on setting up a MediaWiki install on Windows. Many of the comments were kind words of thanks (which I really appreciate. Knowing that it helps people is my greatest motivation), and others helpfully suggested improvements to the instructions.
As an example of comment-driven improvements, my instructions have you installing the GNU diff utilities, in particular for the diff3.exe utility, however the MediaWiki setup scripts don't properly find it (e.g. as the instructions are currently written the GNU diff utilities are completely unused, although they can still be useful in your day-to-day travails). This is because a prior revision included fairly involved changes to the MediaWiki config/index.php script so it would properly locate diff3 on the Windows platform, as it is currently Unix-centric and doesn't look for the proper executable, not to mention that it parses the PATH environment variable incorrectly . After receiving two comments that those steps were a little too complex, however, I removed that section.
My goal was to get people experimenting with MediaWiki, or even just wikis in general, so diff3 functionality really wasn't critical. I pared the instructions accordingly. Similarly one early draft included the building and installation of a PHP memory cache to improve performance, but that too is unnecessary to simply try out the product.
Another line of comments involved asking:
To answer this I really need to describe the philosophy of this blog, along with my resistance to "technology alliances".
In the byline of this blog I describe my philosophy as "pragmatic software development", and this really drives my recommendations. In this case there are a lot of development shops that are Windows-centric, with little or no UNIX/Linux experience, yet MediaWiki is one of the best, most featurer rich, "standard" wiki products out there. Choosing a solution that leveraged what shops already know with the best solution is a pragmatic approach.
Which brings me to my general philosophy towards Microsoft, as comments indicating that I'm either a Microsoft hater, or a Microsoft drone parroting the corporate line, have hit my inbox over the short history of this blog.
I am not subservient to Microsoft.
Unlike many Microsoft technology advocates (I truly love both SQL Server, and .NET, and I think they're remarkable solutions), I have no desire to ever work for Microsoft (Microsoft has some top notch, world-class talent, and I've met and worked with a lot of great talent from there, but they also have their share of both jerks and duds). I'm not going to praise their every move in hopes that I'll get noticed. yafla, my consulting/ISV company, has chosen to avoid any partnerships or tying to the Microsoft brand because we don't want to become another drone "consulting" company single-mindedly acting as a third-party sales force for Microsoft, desperately racking up Microsoft partner points by pushing less-than-optimal solutions on customers. We didn't choose to use .NET for our software because we're hoping to nestle into the Microsoft family -- we chose it on technical merit, and a pragmatic analysis of our current and prospective clients.
We work for our clients and ourselves, not Microsoft. This is a very important mantra for our services, and for the technology of our software, and if Microsoft wants their products to get recommended to our clients, and their technology to the foundation of our software, they need to make great products at competitive prices. No sales gladhanding, or sad career dreaming, is going to change that.
Am I saying that Microsoft solutions are second rate? Of course there are examples of Microsoft products that are terrible, and customers are being misled into buying buzzword-laden atrocities because a Microsoft partner is hoping to get invited to the next Microsoft dinner party. Yet there are also Microsoft solutions that are extraordinary. Windows 2003R2 is a superlative operating system, and where you need the breadth of its functionality, it can be well worth the money. Microsoft Small Business Server can be an amazing package of value for some small organizations, within the constraints of the product. Other times, however, if you have the appropriate skills, a Linux machine is the best choice, along with a stack of the many available free or close to free server products on that platform. Sometimes IIS 6 is the superior solution for a problem, while other times Apache would be your best bet. Sometimes PHP and MySQL is a great solution, and other times C#/ASP.NET with SQL Server is the perfect combo.
I don't blindly assume the Microsoft product to be the best, but neither do I automatically presume it to be second rate. Instead I evaluate on merit, and propose solutions based upon the customer and their needs.
To do otherwise would be just biased noise, and wouldn't be to the service of clients and peers.
Tagged: [Software Development], [Programming], [Software-Development]
I've been doing this as a somewhat regularly updated blog for just over half a year now, and the results have been extremely satisfying: I get about ~2500 direct unique visitors on an average day (increasing 2-6x when something ends up being a meme-of-the-day on sites like Reddit or Digg, and of course many read via aggregators), search engine referrals are up to 200 or so a day, and viewing the "who's on" list is a laundry list of influential corporations and locations across the globe.
It does feed my ego a little bit seeing visitors from various governments, the CIA, nuclear research labs, just about every large financial company, and visitors from every end of the globe. My numbers aren't huge, but it's a perfect composite of influential and knowledgeable readers.
The most popular entries thus far are as follows (I'm providing the static version links where possible):
Effectively Integrating Into Software Development Teams
Optimal Software Development Processes and Practices
Spelling Matters
Everyone Is Above Average - The Overpopulated Top 2%
I've tried to minimize the number of entries (outside of the personal category, though this anniversary one being an exception) to keep the noise as low as possible -- if you're using a reader it won't constantly pretend there's new content when I'm just adding a peanut gallery comment about someone else's blog -- though on the flip side that means that I've delayed various .NET and SQL entries until they're "perfect".
Perhaps I might have to find a compromise somewhere in between.