Planning for High Availability in the Enterprise
Lesson 1: Assessing Database Availability Requirements1. Calculating Availability(1) A = (F – ( MR + R )) / F A: Availability F: Mean time between failures MR: Mean time to recognize R : Mean time to repair(2) Availability is determined by how quickly you can recover a system after it has suffered a failure. Some people call this disaster recovery. Recovering a system form a failure is both disaster recovery and high availability.2. High Availability Framework(1) A basic roadmap for availability that can be found with the framework. 3. Availability Requirements(1) Gather availability requirement is the first step of planning any availability architecture. However, this is the most difficult element in the entire process. The difficulty is in trying to obtain realistic availability requirements.(2) The interview process that you go through should result in the following pieces of information for each application. Average number of users affected Type of user affected: internal or external How the application is used Which segment of the biz is affected What percentage of overall biz operation is affected Maximum amount of time the application can be offline before the biz is irreparable damaged.4. Barriers to Availability(1) Process failure The first level of protection is controlling physical access to machines The second level of protection is controlling physical access to the server room The third level of protect is change control, which is one of the most-often overlooked processes within an IT environment Change control processes exist to ensure that any change to an OS is tested as well as documented. You should test all service pack and hot fixes to ensure compatibility with your system configuration. There are two paths for changing control: planed and emergency.(2) User Errors. User errors are the second most common cause of system being offline. This is the most difficult category of availability issues to resolve because user errors are not technical issues. The only solution to this type of problem is to restore a backup to a secondary server, extract the data, and merge it back into the database.(3) Natural and Man-made Disasters Tornadoes Hurricanes and typhoons Earthquakes Tidal waves and tsunamis Volcanic eruptions Fires Lightning Floods Nuclear explosions and meteors(4) Hardware Failure(5) Software Failure(6) Network Failure(7) Security Failure5. Cost of Downtime(1) The total cost of downtime is determined for each application within an environment. This value is derived by evaluating the percentage of the total revenue for which a particular application is responsible.(2) Calculating downtime cost for applications that are directly responsible for generating revenue is reasonably straightforward.(3) Cost of downtime for applications that do not directly generate revenue but support biz operations are a bit more difficult to calculate.(4) After you have the figures for the downtime cost, multiply them by a factor of 3.(5) By factoring all the intangible aspects – such as loss of image, bad publicity, potential loss of staff, etc – you can better evaluate the costs associated with not having systems available.6. Budget Restrictions Lesson 2: Design Database Storage1. RAID Configurations(1) RAID 0 (2) RAID 1 (3) RAID 5 (4) Mirror of stripes (5) Stripe of Mirrors (6) SAN Configuration. You can configure a SAN with any combination of RAID levels previous discussed. You can also configure the storage within the array by using either a file system approach or a full disk approach.(7) File system configuration. In file system configuration, small portions of many disks drives are combined together into a storage unit, which is then presented to the OS as if it were a single disk drive. (8) Full disk configuration. You can construct a full disk configuration by configuring one or more disks in their entirely into a storage unit. 2. Tempdb Placement(1) The tempdb database is considered by many to be a “throwaway” database. Although you should nerver use tempdb to store persistent data because it is lost when the instance is restarted, it is still critical to the operation of a SQL Server.(2) If the storage array of a user database were to fail, the database on that storage array would be affected. However, all other database would still be accessible. If the storage array for tempdb or any system databases fails, the entire instance of SQL Server becomes unavailable.(3) At a minimum, you need to place tempdb on a storage array that has minimum redundancy, such as RAID 5 array.(4) If tempdb becomes a bottleneck within the system, you should move it to a dedicated storage array to reduce the disk connection.3. Filegroups (1) Filegroups are the most overlooked structure within a database.(2) Add filegroup and move tempdb databaseLesson 3: Planning a Database Migration Strategy
1. Standby Categories(1) You can have three different categories of standby technology: hot, warm and cold. The standby category that a particular technology fits into does not have to do with the following: Time Proximity Scalability Complexity How fast the failover operation is(2) Cold standby. Is not relevant for high-availability architectures. This is a spare server that you can employ if the primary server fails. It does not necessarily have any software installed or hardware configured and is not receiving any updates or data(3) Warm standby. If it requires a manual process to fail over. Even if a technology automatically detects an error, if the technology does not have a built-in mechanism to automatically fail over, it is a warm standby(4) Hot standby. Provides automatic failure detection as well as an automatic failover.(5) SampleMirroring operation state Standby categoryHigh performance Warm standbyHigh protection Warm standbyHigh availability Hot standbyIf witness unavailable Warm standby2. Process Change(1) The most significant step that can be taken for availability is to continuously evaluate and make improvements to processes and operational procedures.(2) Continuously reviewing and monitoring security helps ensure that database don’t become unavailable because of security breaches. Implement DDL triggers to notify an administrator any time a permission is granted or revoked directly using a GRANT statement or indirectly through role membership Implement DDL triggers to notify administrators when logins, database users, schemas, or certificates are created, modified, or dropped. All DDL events that enable encryption should trigger a notification so that DBAs can ensure that the proper master keys and certificates are backed up as well as created on any standby system(3) Tracking the creation, alteration, or deletion of any object within your SQL Server instances ensure that DBAs are always aware of any changes to databases and can address unauthorized changes immediately.(4) If you do not have change control processes in place, you need to implement at least a system to document the changes as a first step. Your change control procedure needs to undergo periodic review to ensure that current procedures still meet biz objectives.(5) At a minimum, you need to review all processes and procedure within your environment that allow high availability to occur every three months.(6) Database backups need to be periodically reviewed to ensure that restore operations can still occur within an acceptable time frames.3. Technology Implementation(1) Deploy a log shipping or database mirroring solution(2) Replication solution(3) Failover clustering solution(4) A migration cannot be started unless you already have very detailed documentation. Detailed documentation that lists all the security accounts, security settings, configuration options, databases, instance-level objects, DTS packages, SSIS packages, remote/linked servers, and so on have to be compiled