Tuesday, February 21 2006

[The following repost of "legacy" yafla content is preparation for the long awaited publishing of part III, which will be published through this medium. To give consistency, I'm reposting Pt I and II in this format]

Examples given in this series will reference the sample database "Northwind". It can be installed via the script found on the Microsoft website

The Basics? We're All Professionals Here!

Much of the material covered in this outing will be old hat for a lot of developers, but is nonetheless worth a recap -- Even among the pros there remains misunderstandings and conflicting information about the fundamentals of databases, and the true magnitude of impact they have on systems. I've intentionally authored this series conversationally, as opposed to a "high impact hit list!", however if you'd just like a brief summary list you can find one at the end.

Data Size

This needs to be addressed as there is a growing camp of "data wasters" that erroneously believe that the larger the amount of waste, the more Enterprise Ready a solution is (capitalization used derisively). I step into this quagmire knowing full well that this section will yield me some "you're a dummy!" responses from some newly earned adversaries, however that's a price I'm willing to pay if I can save but one byte tree.

Minimize the size of your data. Don't use GUIDs where they aren't necessary (e.g. where you don't really need global uniqueness/replication), and don't use a bigint where an int or a smallint will suffice. Don't use a smallint where a tinyint will suffice or an nvarchar where a varchar would be fine. Use the smallest type that is reasonable for the field. Don't invent vague packing technicalities or native type size issues in an attempt to justify oversized data.

Of course you should plan for realistic growth, and I'm not advocating that you use a tinyint to store your CustomerID field, but keep the rational, real world in mind when designing your applications - are you really going to exceed 2 billion users? Is there going to be more than 32767 languages in your application? Is it likely that we're going to a new calendaring system that might have 2 billion months?

Evaluate if it might be an acceptable compromise to simply use these large types in your façade while actually storing smaller types in the actual database. This would give you improved performance, and would allow you to easily upsize your data types in the future in the unlikely event that it becomes necessary.

Clearly there are cases where large data types are legimately warranted, however too many database architects abscond themselves of any responsibility for efficiency by making everything a GUID or a bigint "just in case" (GUIDs have a substantial creation cost as well, in addition to the obvious storage and I/O costs. While GUIDs once used the available network card MAC address as the foundation and generally sequentially increased in value on each new GUID, in current Windows variants GUIDs are basically random numbers -- used as a clustered primary key they can lead to endless data reordering. NOTE: See Sequential GUIDs in SQL Server for solutions for this problem).

Why does it matter? In real enterprise apps before you know it there are tens or hundreds of millions of rows throughout your database, and these rows need to be read from, and written to, the glacially slow storage subsystem constantly -- given this finite resource, doesn't it seem logical that 1MB of I/O carrying 20,000 records is better than 1MB of I/O carrying only 5,000 records? Of course it is. Isn't it better that 10% of your database can fit in the memory cache rather than just 5%? Of course.

Don't be lulled into a false justification of large data types by running ridiculously small benchmarks, where all of the data exists in the memory cache and the I/O is dwarfed by the computational element of the query, yielding "only" a performance hit of 10% or so with larger types -- when your database gets to real enterprise size, size really does matter. That cluster indexed GUID primary key not only makes the row bigger, it makes every non-clustered index bigger (and thus slower) as well, and when the weakest link SAN is running at 100%, you'll regret every wasteful byte.

Indexes

Many SQL Server performance problems are rooted in missing or inappropriate indexes, or alternately unused indexes. This is often true for databases thrown together by front-end experts unhappily tasked with supplying the back-end database, just as it's often the case for those carefully crafted by highly-focused database professionals.

An understanding of indexes, and a focus on their application, is paramount for high performance databases. Not only is it critical to create the right indexes, it's important to craft your access to properly utilize the indexes that are there.

A Bookshelf Full of Examples

Non-Clustered Indexes

Indexes in reference books serve the same purpose (and share many of the same traits) as those in the database world -- by referencing the index you can follow a shortcut to a particular piece of information, seeking directly to that specific page, versus going from page to page scanning the contents. In the SQL Server world these sorts of indexes are called non-clustered indexes (or secondary indexes) -- they are a subset of the data ordered for a specific purpose, containing a pointer to where the real data row can be found.

In the Northwind database an example of a non-clustered index is ShipPostalCode on the Orders table. This index sorts by the ShipPostalCode, and may be used for a query such as the following.

SELECT * FROM Orders WHERE ShipPostalCode = '05022'

If you take a look at the execution plan (Ctrl-K or choose the option "Show Execution Plan" under the Query drop-down in Query Analyzer. When you execute a query with this option enabled a new tab, Execution Plan, will appear beside the results tab), you can see that an index seek took place, and then a bookmark lookup to find the actual data row. If we run SET STATISTICS IO ON on the connection before running the above query, we'll get some statistics on the IO used to satisfy the query, which will be as follows.

Table 'Orders'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

Compare this to the following query, which simulates having no index.

SELECT * FROM Orders WITH(INDEX(0)) WHERE ShipPostalCode = '05022'

In this case the execution shows a full table scan, and our IO statistics reports the following.

Table 'Orders'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.

Without the index there was significantly more I/O, and the differential would be vastly worse if this were a large enterprise-sized table instead of a small sample table. To make matters much, much worse such a table scan will block on every single exclusive page or row lock on the table, waiting for the data to be unlocked just to be able to verify if the wanted information is contained within, while the index seek knows that the locked data isn't the data that it's looking for and is unaffected. Try running the above two queries in two separate query analyzer windows while the following script is running in yet another (increase the WAITFOR delay if you can't jump between them all within 30 seconds).

BEGIN TRAN

UPDATE Orders SET OrderDate = '1997-08-27'
WHERE OrderID = 10647

WAITFOR DELAY '00:00:30'

ROLLBACK TRAN

The first query, using the index, instantly returns the result regardless of the row lock, which is logical given that the row being locked is not what the query is looking for, while the second query, not using the index (which can happen because no index exists, or the index isn't deemed the best choice) blocks until the other connection's lock is released. Not using an index is not only vastly less efficient, it can significantly worsen blocking problems as a database scales (or rather tries to scale).

Clustered Indexes

Returning to our analogy, many books take it a step further and order the content, making the data itself an index of sorts. A cookbook might sort by main category and then dish name, while a phone book famously sorts by [city, last name, first name] (in that order). Thus if you want to search based upon the sorted data, it's extremely efficient -- in the example of the phone book you can very quickly seek to the desired city and last name, scanning a small number of records for the desired person. In the SQL Server world this sort of index is called a clustered index (the sorting of the data itself), and for obvious reasons you can only have one clustered index in a book, or on a table. The primary benefit of a clustered index is that all of the table data is immediately available for every index match -- no dereferencing is necessary.

In the Northwind database consider the following query.

SELECT * FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID WHERE Orders.ShipPostalCode = '05022'

If you look at the execution plan, the Order Details data is grabbed via a very efficient clustered index seek.

Clustered indexes aren't all milk and honey, though. For instance imagine that you're the hard working typesetter maintaining the layout of the phone book, and you've carefully arranged all of the entries on the respective pages. Every time a new entry comes in and doesn't coincidentally fit right at the end of the sort order, or someone changes the information on an existing row in a way that alters the sort order ("Smith" changes his name to "Jones"), you need to reorganize some pages to make space. This same data-churn problem occurs with both non-clustered and clustered indexes, but clustered indexes exacerbate the problem given that it contains the entirety of the row data.

Of course you could plan for this by keeping a bit of blank space on each of your pages to facilitate at least a couple of changes, which is what the fillfactor is used for in SQL Server (a lower fillfactor leaves more empty space but reduces the true data density - insert performance is improved, but read performance is diminished. A high fill factor increases the real data density and thus read performance, but increases the likelihood of inserts requiring page splits. Note that fillfactor only applies on index creation, and whenever you defragment/rebuild your indexes), however this can be a serious performance issue for out-of-order inserted data, or frequently changing cluster indexed fields. It's for this reason that many developers use the monotonically increasing identity field as their cluster index. Historically there was a worry that having multiple inserts all going to the same "place" in the index, at the end, would lead to scalability killing contention at this hot spot. SQL Server has logic to deal with identity fields and effectively eliminates this hot spot issue.

Another problem with clustered indexes is their girth (they contain the entire row data). This is largely irrelevant if you're seeking to specific records, or where you actually plan on using all of the data after a lookup, however if you are querying a range of data (for instance the first name of all of the people with the last name "Forbes" in the city of "Oakville") the query engine will read in the entire row contents for each matching record using a range scan, extracting only the requested data. In our example phone book there is so little extra data that it's a minor overhead, however in many large real-world tables this can have a serious performance impact.

Consider if instead we had a secondary index that was sorted by City, Last Name, First Name, the query engine could very efficiently scan past only the small index entries.

Covering Indexes

This brings up a very important point -- Some indexes contain enough information that you don't even need to go to the content, your query being satisfied by the index itself. Consider an index in a tour book that sorts famous attractions by their name, and the country and city that they were located in, pointing to the page where further information regarding it could be found. If you just want to know what city the Accademia dell' Arte del Disegno in Italy is found in, a quick seek through the index will tell you that it's in Florence. In this instance the index was a "covering index", in that it fully covered our request and we didn't need to dereference to the complete topical information. This is often the most efficient query mechanism of all.

Consider the following variation of a query we ran earlier.

SELECT OrderID FROM Orders WHERE ShipPostalCode = '05022'

This will efficiently use the ShipPostalCode index to find the specific record, and because the query is fully satisfied by the index itself, the costly bookmark lookup is avoided, and IO is minimal.

Table 'Orders'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.

The observant will note that OrderID isn't actually in the index ShipPostalCode, or at least it doesn't appear to be. The trick is that all of the fields that are the sort fields for the clustered index, if one is defined, are automatically added as data fields to every other index on the table. This can be blessing in cases where you want one of the clustered fields and suddenly the non-clustered index is a covering index, but it also needs to be weighed against the fact that it makes every other index larger, and thus less "data dense".

Small covering index seeks are the most efficient method of pulling data, and it's a good reason to ensure that you are only pulling the specific fields that you actually need from any given table, preferably with a covering index. Range scans are also highly efficient in many situations, and are usually used when the wanted entries in an index are consecutive, such as when you search BETWEEN two dates against a date index, though because of the previously mentioned bookmark lookup costs range scans are generally only seen if the index is fully covering, or against the clustered index.

Even in the case where a full scan is necessary, indexes might still be fully covering. Consider if the following index were added to the Customer table.

CREATE INDEX CountryCity ON Customers(Country,City,Address)

This index of course contains Country, City, Address, but as mentioned above it also contains CustomerID because it's in the clustered index. Of course if we query on Country, or Country and City, or Country, City, and Address, an efficient seek or range scan might be used to pull the matching records.

SELECT CustomerID FROM Customers WHERE Country = 'Canada'

What if instead we wanted to search only on the address? In that case the index can't be searched in order because it sorts by country and then city and address, and thus a particular address could exist anywhere in the index.

SELECT CustomerID FROM Customers WHERE Address = '43 rue St. Laurent'

You might be surprised to see that it still used the index, albeit this time it's an inefficient scan rather than a seek. The index was used because it covered the query (all predicates and returned columns), and because the index is only a subset of the data it requires less I/O to scan the entire index than it does to scan the entire table data.

Statistics and Bookmarks

In a prior example we ran a query that required a bookmark lookup to satisfy the query (thus it did not have a covering index). The query was as follows.

SELECT * FROM Orders WHERE ShipPostalCode = '05022'

If you look at the execution plan for this query you can see that it seeks the "lookups" in the index, and then does a bookmark lookup against the clustered index (which is the actual table data). In this case there is only a single row to return, but even still the bookmark lookup cost is estimated to account for 50% of the cost of the query.

The cost of bookmark lookups, where an item is found in the index but it isn't a covering index, is the reason why many people are surprised to find that SQL Server has ignored what they believe are perfect indexes and instead table scanned ("Why isn't it using my index! ARGHHH!!!"). Consider the following query.

SELECT * FROM Orders WHERE ShipPostalCode = '24100'

Looking at the query plan you can see that it actually did a cluster index scan (which is a table scan on a table with a clustered index) instead of using our index, and the subtree cost is 0.0530.

This might seem perplexing because we seem to have a perfectly satisfactory index, however let's do the same query again, this time using a query hint to force it to use our index.

SELECT * FROM Orders WITH(INDEX(ShipPostalCode)) WHERE ShipPostalCode = '24100'

If you look at the query plan you can see that it used our index, as demanded, but this time the bookmark lookups account for 80% of the query time. Our subtree cost comes in at 0.0564 -- more than it was doing a table scan!

In this case this was only 10 records of a total of 830 (1.2%) yet still it opted to do a full table scan rather than using our index. Many developers have been perplexed in this situation, wondering why SQL Server was avoiding their beautiful index, but it's doing it for a very valuable reason - it was cheaper than indirectly looking up each piece of data through bookmarks.

Of course we could have avoided bookmark lookups by using the index as a covering index if possible by using a query like the following, presuming this was all the data we needed to extract from the table.

SELECT OrderID FROM Orders WHERE ShipPostalCode = '24100'

Now it uses our index, is super efficient, and has a subtree cost of only 0.0064. In larger databases the difference can be the tremendous by avoiding both the bookmark lookup and the table scan.

So how did the query engine guess how many rows would match a criteria to choose which method (whether by index and bookmark lookup, or table scan) to use to most efficiently satisfy the query? That's where something called distribution statistics comes into play. Statistics are a representative set of the data that are used by the query engine to make a best-guess plan for how to most efficiently serve the data. You can view the statistics for a given index, in the following case for the index ShipPostalCode, via the DBCC SHOW_STATISTICS command.

DBCC SHOW_STATISTICS('Orders','ShipPostalCode')

Due to the limited number of discrete ShipPostalCode values in the table, the statistics are entirely accurate in this case. In a more realistic database, with thousands or millions of rows, statistics start to become much more of an estimation (with an ever increasing margin of error). These estimations can lead to entirely wrong assumptions by the query engine in some edge cases, such as where it thinks a given set of predicates will yield thousands of rows when really it might yield only a couple.

Statistics can also fail for multi-field indexes. In this case the selectivity of the first field is used, so in the case of the index we created earlier (Country, City, Address), due to the fact that the country has a low selectivity (there are lots of entries for each country), the index will often be ignored, even though the city and address combination is highly unique. For this reason it is generally recommended that the most selective field comes first in your index, so in the case of that index the fields would be Address, City, and then Country. This is debateable because it also makes the index more single purpose -- it no longer serves an efficient purpose for less-granular searches like just Country, or Country/City. This needs to be evaluated on a case by case basis, and truly wouldn't be an issue for fully-normalized tables.

It should also be noted that ensuring that your statistics are as accurate as possible is critical. SQL Server includes automatic statistic updates, on by default, where it will attempt to do data sampling and update statistics when it feels they are out of date. Nonetheless it is a best practice to schedule full statistic updating at regular intervals (at a minimum weekly), preferably using the WITH FULLSCAN option so it is as accurate as possible. The standard database maintenance plan includes a step for statistic updating, and allows you to choose the amount of data to sample.

Regardless of all of the above, there will be cases where you may find that your statistics are up to date, your indexes are optimal, yet SQL Server is still incorrectly choosing not to use your index. In this case it may be an unfortunate reality that an index hint needs to be added to the query to politely (or rather sternly) request that it reconsider. Obviously this should be a last resort.

Actually Using Indexes

So you've created beautiful indexes, and you've ensured that your query only pulls the necessary data from each table, using covering indexes where possible to avoid costly bookmark lookups. You pull up the execution plan to find…that the query engine is entirely ignoring your index. There are several reasons why this could happen.

Consider the following query.

SELECT OrderID FROM Orders WHERE LEFT(ShipPostalCode,4) = '0502'

Fairly simple query, and from the looks of it one might think it'd be an efficient covered index seek. Upon execution you'll discover that actually it was an inefficient scan. Consider the following instead.

SELECT OrderID FROM Orders WHERE ShipPostalCode LIKE '0502%'

In this case the query is executed as a highly efficient index seek. I have had cases where this tiny difference reduced an enterprise report from running for literally hours to a matter of seconds.

The reason is that in the former the indexed field was hidden within a function. The query engine can't predict what the result of the function will be, so it's forced to evaluate it for every row to see what pops out. LIKE is a first class comparison, it knows how it behaves, so the query engine can actually optimize against it. There are countless cases where people hide criteria fields in functions unnecessarily, and the result is massive, unexpected inefficiency.

The most common example of this mistake is using DATEADD/DATEDIFF to pull rows within a certain period of time -- instead of pre-calculated a fixed demarcation (i.e. precalculating GetDate() - 3 years) and then doing a direct comparison with the row data, developers are forcing whole table scans with wasteful date computations on every single row. For instance consider a query to report news items that have occurred within the past 12 hours from a hypothetical news table .

DECLARE @CurrentTime

SET @CurrentTime = GetDate()

SELECT * FROM NewsStories WHERE DATEDIFF(hh,NewsDate,@CurrentTime)<12

Guaranteed to be terribly inefficient, yet it's overwhelmingly common. The query engine can much more effectively optimize the following variant.

DECLARE @StartTime

SET @StartTime = DATEADD(hh,-12,GetDate())

SELECT * FROM NewsStories WHERE NewsDate > @StartTime

Database Cheat Sheet

Indexed Computed Columns

At the outset I advocated that you minimize space usage (increasing real data density). The goal wasn't to try to fit that database on a floppy disk, but rather to minimize the amount of I/O necessary to satisfy a given query, as I/O is the weakest link of most enterprise systems. There are design choices, such as adding additional indexes, that actually increase the size of your database on disk yet reduce the I/O necessary for certain queries, and these are usually very worthwhile trade-offs.

Another powerful technique you can use to trade disk space for improved database performance is indexed calculated columns. There are countless variations, but I'll cover one scenario that is fairly commonly used -- report counts by month for a given year. In the case of the Orders table this could be achieved via the following query.

SELECT YEAR(OrderDate) AS [Year], MONTH(OrderDate) AS [Month], COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE YEAR(OrderDate)=1997
GROUP BY YEAR(OrderDate), MONTH(OrderDate)

Instead of adhoc decomposing the date into month and year constituents, consider adding them as computed columns.

ALTER TABLE dbo.Orders ADD
OrderDateYear AS CONVERT(smallint,YEAR(OrderDate)),
OrderDateMonth AS CONVERT(tinyint,MONTH(OrderDate))

Now we can change our query to the following.

SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month], COUNT(*) AS [Monthly Orders]
FROM Orders
WHERE OrderDateYear=1997
GROUP BY OrderDateYear, OrderDateMonth

By itself we've done nothing for the query efficiency (in fact it is actually less efficient as it's applying the where predicate after building the set), though we've achieved a bit of "code re-use". However we now have the foundations for some powerful indexed computed columns.

CREATE NONCLUSTERED INDEX IX_OrderDateDecomposed ON dbo.Orders
(
OrderDateYear,
OrderDateMonth
) ON [PRIMARY]
GO

Now the query referencing these computed columns is dramatically more efficient. Even better, these indexed computed columns haven't decreased the real data density of the table because they're only materialized in the index. NOTE: Ensure that queries that don't need these computed fields don't pull them explicitly or implicitly via the wasteful * column selector, as it'll unnecessarily calculate each of the computed fields for each row.

Indexed Views

[This functionality only exists in the Enterprise and Developer edition of SQL Server 2000]

An indexed view, sometimes referred to as a materialized view, is a sort of indexed computed columns on steroids, taking the idea of storing computed results to the next level. In a previous example we improved the performance of some data aggregation logic by grouping on some indexed computed columns. We can take a variant of that and create a view out of it.

CREATE VIEW dbo.OrdersByMonth
WITH SCHEMABINDING
AS
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month], COUNT_BIG(*) AS [Monthly Orders]
FROM dbo.Orders
GROUP BY OrderDateYear, OrderDateMonth

As it is the view is acting as nothing more than a template for queries against the underlying table, and is only of benefit for code reuse (which itself is a very worthwhile goal). We can take that a step further by materializing this view so the actual results are stored, and changes to the underlying table is automatically reflected in the aggregates. The following command creates the indexed view.

CREATE UNIQUE CLUSTERED INDEX IX_OrdersByMonth ON OrdersByMonth(Year,Month)

Now the following query will be satisfied by the indexed view, using the pre-computed values, rather than recalculating for every query.

SELECT [Year],[Month],[Monthly Orders] FROM OrdersByMonth WHERE Year=1997

In the end the resource usage for our monthly order count query has dropped by about 90% over the original query, and this is for a tiny sample database. In the real-world the differential can be extraordinary.

Indexed views do have some downsides, such as the automatic maintenance that occurs whenever the underlying data changes, however they can be an extraordinarily powerful tool in your arsenal and deserve further research if your platform supports it.

Summary

  • Keep your rows as small as possible to maximize real data density.
  • All tables should have a clustered index except in rare exception situations. Normally this will be a single field primary key.
  • Small clustered indexes keep non-clustered indexes small, increasing real data density.
  • Clustered indexes help make other indexes covering indexes.
  • Queries serviced by covering indexes are extremely efficient.
  • Avoid hiding criteria fields in functions -- indexes will not be used for them.
  • Consider indexed computed columns where appropriate
  • If you shelled out the cash for Enterprise Edition, seriously evaluate how indexed views fit in your solutions
  • Index, index, index! Only in extremely rare cases are the additional update and insertion costs associated with maintaining indexes heavier than the benefits.
  • Understand execution plans. Evaluate them regularly.
 SQL 
   
Tuesday, February 21 2006

As a software developer, I'm generally drawn to complexity, and I generally view "hard" problems as much more worthwhile, from a business perspective, than easy problems.

For instance I never considered many of the "successful" .COM-type ideas as worthwhile, and I naturally feel the same about many of those percolating up via Bubble 2.0, because they either seemed too obvious, or too easily duplicated (aside from the fact that most of them haven't a hope in heck of ever having a sustainable revenue model...but that's beside the point -- If you can get Yahoo or Google to buy you out with big dollars, it doesn't matter if your flighty userbase would never pay a penny, and that they would never tolerate a single ad impression).

IMG_3824

As such I discarded a lot of "neat ideas", only to see them bring success to someone else. I was keeping my eye out for something both unique and difficult.

Yet often it's the simplest of things that hold the most value to people.

Just over a year ago, late December 2004, I was working on some custom C++ JPEG parsing logic (scary pointers and all), and noticed that some of my test files were bloated up with a lot of extraneous data. We're talking 4KB JPEGs that had an extra 60KB+ of data appended to them. Apart from EXIF data -- useful at times, but completely irrelevant at other times -- Adobe Photoshop was particularly notorious for stuffing files full of worthless application blocks.

Many of the images you find on the web have all of this extraneous info, slowing transfer times, increasing server loads, hiking bandwidth bills, causing pestilence and suffering (okay I'm going a little overboard).

As such, I gathered up a ridiculously small amount of the parsing code, compiled it, and "released" it as the rashly named PureJPEG. I expected it would see maybe a dozen downloads from nitpicking webmaster looking to ultra-optimize their user's experience.

Over the past year, over 40,000 people have downloaded this utility directly from yafla. It's been mirrored on quite a few sites (particularly in Russia, for some reason), so I have no idea how many worldwide have downloaded it, but presumably a greater number still.

I would have never imagined that something so simple would have filled such a niche, and I'm a little embarrassed about how trivial of a micro-project it really was, but there it is. As far as little utilities go, it's been a stunning success.

   
Wednesday, February 22 2006

I've been contemplating doing a Firefox Extension as a personal enrichment technology project (obviously extensions aren't a potential revenue market), and have played around with the rudiments: Tremendously powerful, and very easy to take each step, and while there are specifics that I haven't nailed down, most of the enabling functionality for the following requirements seems to be there.

IMG_4078

Before I waste the time I thought I'd put a general query to see if anyone has seen anything like this (searches yield nothing that looks similar).

  • Record and report the time spent on all sites, or selected sites (e.g. defined by regex patterns, or groups of regex patterns), allowing for various reporting levels. e.g. domain, subdomain, folder, document, etc. Reporting could be facilitated via the new SVG or canvas functionality.
  • Based upon the recorded time, allow the user to "ration" their own access. e.g. "Limit me to 20 minutes of Slashdot per day" or "Limit me to 60 minutes of Reddit, Digg, Slashdot, and JoelOnSoftware per week".

Like most software developers, I peruse the various technology sites during the day, and in the interest of good time management, not to mention that such a tool could be useful for billing purposes, I want to have an accurate grasp of exactly how much time I spend at specific sites, or a group of sites for that matter. Apart from status bar indicators of the time spent, and perhaps time "remaining", the product could be more interesting if it actively blocked further access to that site until you indicated moral weakness and selected the override feature.

Come across anything like this?

   
Thursday, February 23 2006

I've always wanted to mention Quake 2 on here, so here goes.

One of the big advantages .NET brought to desktop applications, at least in regards to official Microsoft dogma, was XCopy deployments: Instead of long, convoluted setups installing dozens of components into the system, shared libraries into system folders, and registry settings in the registry (and maybe some win.ini settings for real historical fun), XCopy allows you to build applications that can be "installed" by simply copying a folder.

Everything old is new again. Years back this was the DOS way, and in many cases it's still the UNIX way.

Some time back my favourite game on the Windows platform -- I did, and though the time is extraordinarily rare still do, play "shooter" games -- was Quake2, in particular with the ActionQuake mod. This game lived in my computing ecosystem for quite a while, not only because it was great fun, but also because the application existed as a directory "island" of sorts: John Carmack and crew had disregarded the Windows developer guidelines, ignoring centralized libraries, components, and registry settings, and stored settings in little config files in the application directory, with mods, extensions and libraries appeared in that directory or subdirectories.

I could "install" the application on a new machine, including all of my settings and extensions, just by copying the Quake2 directory to the new PC. Similarly, if I installed a new harddrive I could move it there and it was fully functional immediately. No complaints about missing libraries, or ridiculous dependencies upon fixed drive letters or fixed paths.

It just worked, and happily adapted to wherever it found itself.

This seemingly trivial "feature" made this application live on my hard drive far longer than it would have otherwise. Without the hassle of reinstalling the app everytime I upgraded machines or reinstalled the OS --  a nuisance that led to many apps getting left behind -- which would also have meant reinstalling all of the patches and mods, and then laborious reconfiguring the settings to a close proximity of my tastes, it just migrated with me. It was always there waiting to provide a quick diversion during a time of thought.

With XCopy deployment, Microsoft has shown that they've seen the light, and have realized that the whole "tentacles throughout the system" approach has been a terrible mistake.

With IIS 7 we might finally see the same sort of benefit for web applications. As it is IIS is a bit of a mixed-up configuration mess, with many directory-specific settings being stored in the IIS metabase existing somewhere else (on Windows 2003 it's a convenient, well-documented XML file that you can find at %sysdir%\\inetsrv\\metabase.xml) -- mime types, directory and file security, where virtual directories/web apps start, cache settings, etc. With IIS you can't simply XCopy the app and have full configuration, but instead you have to have appropriate permissions (generally administratrive) to open the IIS administrative console and set, for instance, that your CSS files should be cacheable for a day but the frequently rewritten XML file shouldn't be cached all, which tells IIS to add the appropriate HTTP headers to each.

Compare this to Apache, which lets you configure the vast majority of directory and file specific settings via .htaccess files in each directory, saving the "system wide" httpd.conf for settings that are truly web server wide. Configuration is logical and single point, and an application can be migrated with virtually all related setup with tremendous ease. A remote developer with access to only his folder of the web app has the ability to configure things as they should be configured without ridiculously requiring administrative rights.

IIS 7 adds this sort of functionality, moving a lot of the virtual directory and folder/file settings into files that you put in your web app file structure (obviously it won't, or rather won't under ideal conditions, allow these files to be read by web users). No more hydra-setup where half the setup exists over there and half exists over here.

Everything old is new again.

   
Monday, February 27 2006

Congratulations to the Canadian Olympic Team on a remarkable showing! Not only the amazing medal haul (24 medals), but the tremendous number of winners who placed 4th or 5th, and all of the great athletes who did their best in representing this country.

It has definitely been a huge turnaround from the "disappointment" of Olympics past, and foretells a great 2010 winter Olympics in Vancouver.

   
Monday, February 27 2006

Most users rely upon a small number of passwords — often only one — that they use everywhere.

For their corporate domain account, their blog, their photo site, their email, their banking and PayPal accounts, and their discussion groups, one key opens them all. Despite the incredible risks involved with this practice, it is more prevalent than ever.

Password reuse is often a learned habit of jaded users who've been bitten by the "lost password" bug a few too many times, especially against sites that they seldom visit. Automated password management systems, such as those built into just about every browser now, help lubricate the usage of more secure passwords, but their single–PC implementations and data loss potential has many shying from trusting them.

It's so much easier to remember one or two password derivatives than it is to remember dozens.

Of course, few will actually admit to recycling passwords like this, and instead it's the guy using unique 20 character random sequences that are most likely to speak out. Impromptu prodding of acquaintances, clients, and contacts, along with the results of several recent security surveys, has me convinced that these security best-practice aficionados are the exception, and a large number of users, perhaps even a majority, are dangerously reusing the same password prolifically. Look at the outrage and fear when a site like Reddit loses their user's plaintext passwords (it is hardly alone in having this happen). Of course the users could simply change their password, but the source of the outrage was because this was the one key for many of these users.

If someone discovers your password on site A, there's a very good probability they can use it to access site B, and C, and D, and E, and so on (especially when your username is your email address). The security of your online reality relies upon faith in a lot of people who you shouldn't have faith in. The weakest link can make it all unravel.

On sites that I've architected, I've tried to minimize the potential damage of a hypothetical exploit by eliminating the target surface area.

Following this philosophy, not only do I not want to store your password — of course I only store a hash and not your original password, which should be a universal practice even among "low value" sites — but I never want your probably-reused password ever hitting the site in the first place.

Instead of sending your password to be hashed on the server, I want it hashed on the client end, before it even gets sent down the wire.

This wouldn't be a possibility without JavaScript, however the functionality of JavaScript has become pretty much mandatory, and is growingly becoming necessary for even the basic functionality of the site. As such, the account creation and logon system incorporates functionality that hashes a combination of your username, password, and the domain on the client end, passing through the hash to the server as your "password". As a secondary benefit, the server can generate single-use variants (salts of sorts) which it provides with the form. If such a variant is provided, after the client script has created the hash, it then hashes the first hash with the variant, which the server can do as well, offering basic line encryption as well presuming that the server is tracking the variants, and ensuring they are server provided and not reused (it doesn't replace SSL, and there are still avenues for man-in-the-middle attacks and untrusted remote servers masquerading as the official site if they overrode DNS, however it's a step in the right direction where SSL can't or won't be used, which is the case for the vast majority of sites out there).

The SHA1 algorithm is easy to demonstrate (albeit also easier to crack), so in the example I used the excellent SHA1 implementation by Paul Johnston. Implementing it was trivial, and a simple example demonstrating how to use it for this purpose follows.

Note that this is purely an example, and in the wild you would want to use something much more computationally demanding, such as multi-rounds of blowfish. Something that makes brute force matching much more unreasonable.

<script language="JavaScript" src="sha1.js" type="text/javascript"></script>
...
<form name="loginForm" id="loginForm">
<input id="passwordHash" type="hidden" value="">
Email Address: <input id="emailAddress" type="text" size="20"><br/>
Password: <input id="password" type="password" size="20"><br/>
<input type=button onclick="DoLogon();" value="Logon" />
<input type=hidden id="variant" value="" />
</form>
<script language="JavaScript">

function DoLogon()
{
  var username = document.getElementById("emailAddress").value;
  var password = document.getElementById("password").value;

  var hashString = password + '|' + username + '|' + document.domain;
  var hash = hex_sha1(hashString);
 
  /* Variant - trivial "encryption" if the server has provided
     a tracked, single use pseudo-salt. */
  var variant = document.getElementById("variant").value;
  if (variant.length > 0)
  {
    hash = hex_sha1(hash + "|" + variant);
  }

  document.getElementById("passwordHash").value = hash;
 
  /* Remove the password element from the form before submitting */
  document.getElementById("loginForm").removeChild(document.getElementById("password"));
   
  /* Submit the form. */
  document.loginForm.submit();
}

Voila. Now I never know that you use the password 4muppet8 on every site, and instead I only ever see a unique hash specific for this domain. If you accidentally enter the login credentials for another site (this is a huge security risk that catches people frequently), my logs will never betray what it was. If someone mirrored this page on another site for spoofing purposes, and they weren't smart enough to modify the javascript, even if the user entered their credentials for my site they still wouldn't see it (because it is hashed against their domain. e.g. paypalspoof.com).

Of course this scheme still suffers a critical weakness: If, somehow, a nefarious agent could replace the server side scripts, and somehow my remote server validation scripts failed, they could simply alter it to pass through the original password. While that scenario is far more remote and unlikely than the already remote and unlikely database delving or line monitoring, it does demonstrate why the optimal situation would be intrinsic browser support: Instead of creating a site-specific custom script to secure and individualize the password for a specific domain, which allows users to reuse passwords without actually giving the password to any specific site, the browser should support this functionality directly, and it should be uniquely evident in the UI when such a secure password element is in use.

In addition to the password input type, there should be a secure password type (with obvious, non-spoofable graphical indicators that it is a secure password box) as a basic HTML element, automatically incorporating this sort of enhancement. HTTP already supports digest autentication, which is similar, but unfortunately it is incompatible with the form logon approach commonly used, not to mention that it has its own failings.

   
Wednesday, March 01 2006

I have absolutely no association with SiteGround outside of being one of many customers, however I've had good success with their MediaWiki hosting so I thought it worth a mention: For a very competitive price, SiteGround will get you going with your MediaWiki in no time (I get no kickbacks or commissions for mentioning this, but propose it as an alternative to the previous entry regarding setting up MediaWiki on IIS). Their service is quick and the hosting is reliable, and they're somewhat of MediaWiki experts.

While in the spirit of giving shout-outs, http://www.isqsolutions.com/ is a fantastic host (again I get no benefit from recommending them). They host yafla, and have done so to a close-to-perfect degree of reliability. Their offerings are feature rich and their prices are competitive, and they're very responsive to support issues. They're located here in the Greater Toronto Area.

   


About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.





 
Earlier EntriesLater Entries

Dennis Forbes