Windows Failover Cluster Process and Guidelines


Windows 2012 R2 failover /multi-subnet failover clustering

 

Purpose

This document provides the policies, process to

  1. Plan, implement and deploy failover cluster on Windows 2012/2008 R2
  2. Manage, maintain and secure Failover cluster including a multi-site cluster
  3. Identify the challenges and Risk and troubleshoot common problems

Scope

This document provides information about requirements and recommendations for a failover cluster running windows 2012/2008 R2. It also describes steps for plan, design and implementing a multi-site cluster along with the checklists to be followed.

Checklist

Below the checklist for a for a failover cluster (multi or same subnet), described in the later sessions

  • Review the hardware and infrastructure requirements
  • Review the Network and Storage requirements
  • Review Disaster recovery and failback requirement
  • Review Quorum configuration requirement
  • Server and Cluster configuration
  • Configure the heartbeat and DNS settings
  • Configure the failback settings and a sequence of preferred owners for each clustered service or application per design
  • Re-run the cluster validation and do necessary fixes
  • Test failover of the clustered service or application, including failover between sites

Responsibility

Below Team responsible for completing Failover clustering

Server Engineering/Operation Team

  • Review and develop failover cluster plan and implement
  • Racking stacking and update firmware of physical servers
  • Complete physical and software configuration for Failover cluster
  • Fix all vulnerabilities identified by security team
  • Complete same and multi subnet failover and failback DR test.

Network Operation Team

  • Management and heartbeat network configuration
  • VLAN creation and Network port configuration, if needed

Enterprise Storage Team

  • Asses the Storage requirement and storage replication.
  • Configure and allocate Storage

Enterprise Security Team

  • Review failover clustering design
  • Open Required ports between failover cluster nodes
  • Vulnerability assessment

Application and/or DB administrators

  • Configure the Failover cluster for application
  • Test application level DR activities

Failover Cluster Requirements 

Below session describe the hardware, software, network and storage requirement for building failover cluster, both same and multi subnet failover cluster.

Hardware requirements for a failover cluster 

Refer for basic recommendation of server hardware for windows 2012, https://technet.microsoft.com/en-us/library/jj134246(v=ws.11).aspx. However best recommended server configuration based on the application and service installing on the server.

It is recommended to have the same hardware configuration across all nodes on failover cluster  unless it is multi subnet failover clustering.

Physical server we are using recommend to have minimum 2 NIC with minimum of 1 GBps network connection for both management and heartbeat network. Recommend to have the Teaming of NIC for management and client facing network.

Physical access to storage using either Fibre channel or Serial attached SCSI ( one iSCSI), all HBA should be installed with latest firmware and drivers. All node must be installed with similar MPIO or similar device specific modules software component.

All Disks must be formatted with NTFS, with partition using MBR or GPT.

Software requirements for a failover cluster  

The servers for a failover cluster must run the same version of Windows Server Windows 2012 R2/2008 R2, including the same hardware version (x64-based). They should also have the same software updates (patches) and service packs.

It is recommend to have some mandatory hotfixes to be installed across all cluster now. Please refer below  link for downloading required hotfixes and install them.

Windows Server 2012 R2 – https://support.microsoft.com/en-us/kb/2784261

Windows Server 2008 R2  – https://support.microsoft.com/en-us/kb/2545685

Network infrastructure and domain account requirements 

Below network infrastructure and administrative account with domain permissions are required for failover clustering configuration.

Network settings and IP addresses 
Must use identical network adapters and identical communication settings on those adapters. With windows 2008 R2 or higher it is recommend to enable the IPv6 for the best cluster communication.

If we use private/heartbeat network, It should be separated from rest of  the network infrastructure, ensure that each of these private networks uses a unique subnet

DNS

All servers must be register name in DNS  for name resolution, DNS also required for Cluster name and other resource name registration to the DNS server. User who configure the Cluster must have privilege to create DNS entry on DNS server in corresponding zones.

Active Directory

All nodes in the cluster must be a member server of same Active directory domain. User Account using to create/configure cluster and cluster resource must provide with required privilege to create computer object on active directory domain. User account must be have a delegated privilege to create Computer Objects and Read All Properties permissions in the domain, unless the user account is a domain admin.

For other cluster resources (SQL failover cluster ) computer object creation, if it is initiated by application or DB administrators. Domain administrator can pre-create computer object with Cluster Name computer object to associate cluster resource to the computer object.

For that need to create the computer object with the  same name used in creating cluster resources, provide cluster name with full Control over the computer object.

For ex: Cluster01 is cluster name computer object and DB administrator want to create SQL FC resource Cluster_Res01 on Windows Failover cluster, As administrator you may create and Computer object “Cluster_Res01”, go to security page and add Cluster01 (object type select computer to search a computer object) and provide “Full Control”  over newly created computer object.

Storage and Disks Requirement

Storage and Disk requirement based on the application and service configured, also depend on the DR scenario for multi-Subnet clustering

It is recommend to have the all disks imported and formatted using NTFS (MBR or GPT) and maintain primary partition.

Depend on the Application the allocation unit (Cluster) size and  partition offset varies. It is recommend to have all shared disk must be formatted with 64k cluster size and 1024 offset on SQL server and clustering.

Windows FC and SQL failover clustering recommend to use basic over dynamic disks

For Normally Organization follows below for sql DB disk allocation size

  • Data Disks 64K block size
  • Backup Disks 64k
  • Log Disks 8k
  • TempDB dataDisks 8k
  • TempDB Log Disks 8K

 Firewall and Security requirement

All nodes in failover cluster, especially on multi-subnet cluster, need to make sure all Cluster Nodes must be communicate using below ports

TCP UDP
50519 50519
54660 54660
52874 52874
59095 59095
1433 1433
5005 5005
139 3343
445 49152-65535
135
49152-65535
3343
5022

 

For file share witness server  access from all cluster node using below SMB ports

445
139
135

 

Quorum Best Practices and Requirement

Use Node & Quorum especially for even number of Cluster Nodes, Decide convenient Quorum disk or File Share Witness (FSW) for failover clustering.

It is recommend to use File Share Witness for above situation incase of multi-Subnet clustering

Microsoft recommend to have the FSW in 3rd Site that has direct connection with both Cluster sites. Else decide based on priority between high availability and disaster recovery to place them in Primary or Secondary site.

Avoid hosting FSW in a Cluster node or Virtual Machines in the same Cluster.

Failover cluster Configuration

Below shows a typical multi-subnet failover cluster design for disaster recovery and failover.

This target mainly on disaster recover rather than high availability, that

Server Hardware configuration

Once decided on the hardware configuration, next task involved in racking and cabling of physical servers in datacenter, depend on how the failover and high availability planned.

Configure the server for remote management and high performance, enable Static high performance and OS control mode on  BIOS/Platform Configuration.

http://h20566.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c00300430

http://h17007.www1.hpe.com/docs/iss/proliant_uefi/UEFI_Gen9_060216/s_set_hp_power_regulator.html

https://www.hyper-v-server.de/hypervisor/cisco-ucs-blade-settings-fr-hyper-v/

Server Configuration for Failover Clustering

Install Windows 2008 R2 Enterprise or Windows 2012 R2 std/dc edition for failover cluster.

Configure 2 NIC, one for public and other for private network

  1. Public network should be routable across all network and Private network work should not be routable.
  2. For multi-subnet failover clustering, private/heartbeat network should communicate within but not with other network in this server environment.
  3. Make sure IP V6 is enabled and NetBios Over TCP is disabled

Configure the local disk, based on the Application requirement and allocate storage.

Change the server Power setting to High Performance

Run “POWERCFG.EXE /S SCHEME_MIN” on Command Prompt as administrator

Install and configure Server management and other standard software based on the build configuration standard document.

  1.  Anti-Virus –  Symantec Endpoint Protection (SEP) is the Corporate standard, installed through SEPM server console. Make sure we are provided MDF,LDF and other SQL file being exempted from scanning

 

  • Configuration manager – System Center Configuration Manager (SCCM) is the standard configuration manager tool that also do many other operation management tasks like software distribution, software, and hardware inventory, Windows update management etc
  • Backup – Veritas NetBackup is the corporate standard backup system. Backup exception for MDF/LDF/NDF file type
  • Server Monitoring Solution – Microsoft System Center Operations Manager (SCOM) is the standard historical performance data monitoring tool. SCOM monitor operating system and services installed on the servers. Operation team install and configure the SCOM agent to corresponding management servers
  • Event log collector – SNARE or Qradar etc.  are few standard log collection and forensic analysis tool, Operation team install and configure snare agent to send event to corresponding LCPs
  • Server hardware Management –Operation team uses HP-SIM to monitor and troubleshoot hardware issue, configure the server to send hardware events to HP-SIM

 

Install NET-Framework 3.5  and Windows Fail-Over Cluster feature on Windows server. Using Powershell run below command to install these features

  1.  Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools
  2.  Add-WindowsFeature NET-Framework-Core -Source c:\temp\sxs  (Copy sxs from Windows installer CD support folder to  c:\temp)

Create Failover cluster with a naming standard, as per the organization standard.

  1. Run Validation test using powershell command  “Test-Cluster –Node Site1Node1, Site1Node2, Site2Node1, Site2Node2”
  2. Create Cluster using “New-Cluster -Name cluster1 -Node Site1Node1, Site1Node2, Site2Node1, Site2Node2 -StaticAddress <Site1StaticIP>,<Site1StaticIP> -NoStorage”

Add Storages to the cluster, import and create shared disk using below commands

  1. Run command to make all offline disk to online Get-Disk | where {$_.OperationalStatus -eq “Offline”} | foreach {Set-Disk $_.Number -IsOffline $false}
  2. To make online disk to clear read-only Get-Disk | where {$_.IsReadOnly -eq “True”} | foreach {Set-Disk $_.Number -IsReadOnly $false}
  3. Verify Disk status by running Get-Disk | select Number,IsReadOnly,IsOffline
  4. Either manually import all disk and create volume using disk management or use powershell commmand “get-disk | Where-Object {$_.PartitionStyle –Eq “RAW”} | foreach {New-Partition -DiskNumber $_.Number -AssignDriveLetter –UseMaximumSize; Get-Partition –Disknumber $_.Number -PartitionNumber 1 | Format-Volume -FileSystem NTFS  -AllocationUnitSize 65536 –Confirm:$false }”

Add shared disk to cluster using

  1. Open Cluster manager, navigate to Disk and select add disk
  2. Select the disk to be added to Cluster and select OK
  3. Add Disk to the cluster using Powershell “Get-ClusterAvailableDisk | ?{$_.Number -eq “1”} | Add-ClusterDisk”

Label the disks on the Cluster and add the storage to corresponding cluster resource

Only for Multi-Subnet Clusters “Heartbeart network”, add route path and custom gateway for private network.   (10.10.10.x are primary site IP and )

Run “Route add 10.20.20.0 mask 255.255.255.0 10.10.10.1 -P” at Primary Site cluster nodes

Run  “Route add 10.10.10.0 mask 255.255.255.0 10.20.20.1 -P” at Secondary site cluster nodes

Label the Cluster network and configure the private and Public Network

Configure the Quorum disk or File share witness on the Failover cluster

On normal failover clustering, re configure the cluster quorum with a shared quorum disk “Set-ClusterQuorum -NodeAndDiskMajority ‘Cluster Disk 8‘ “, Disk # varies depend on the cluster disk using for quorum

On Multi-Subnet clusters, reconfigure the cluster with a file share witness quorum as “Set-ClusterQuorum -NodeAndFileShareMajority \\fileserver\fsw

Check for all applicable security update for the server and reboot the server, also ensure the cluster name and IP ownership changing as required.

Run a Vulnerability assessment too to find any outstanding vulnerabilities and fix them.

Handover the cluster to the application or DBA team for SQL cluster or Always on Cluster configuration, giving them assistance with clustering and grant required access.

Active Directory Object for SQL FC Cluster and AlwaysOn

Below AD Object need to be created on Active Directory and provide necessary permission to join the cluster Network name in to Pre-created computer objects.

  1. Cluster Network Name: This will be auto created while creating failover cluster (If we are using Domain admin or account with computer object management).
  2. SQL Fail-over Cluster network Name : Since it is creating using SQL setup, we need to pre-create computer object using the name already shared between SQL DBA, Hosting Team, Network Team and Operations team
  3. SQL AlwaysOn Cluster Network name (Availability Group): Same as SQL FCI name, need to pre-create computer object using the name already shared between SQL DBA, Hosting Team, Network Team and Operations team

NB: All Pre-Created Active Directory computer object should provide with full access permission for respective Cluster Network Name.

Connection Timeouts in Multi-subnet

By default, the behavior of the SQL client libraries is to try all IP addresses returned by the DNS lookup – one after another until the all of the IP addresses have been exhausted and either a connection is made, or a connection timeout threshold has been reached, This create problem when DNS return offline IP address first and it cashed for some time cause the application to cause intermittent time out with DBs.

To fix this we need to correct two settings at AlwaysOn cluster resource name and Windows failover-Cluster: RegisterAllProvidersIP and HostRecordTTL

RegisterAllProvidersIP.

This parameter determines whether the Windows Cluster will register all of the IP addresses the AG is dependent on, or only the one active IP address. When set to 1 (Default), the clustered resource is created with all of the IP addresses the AG is dependent on, registered in DNS. When set to 0, only the one active IP address is registered in DNS (the IP address which is online).

HostRecordTTL

This parameter governs how long (in seconds) before cached DNS entries on a client OS are expired, forcing the client OS to re-query the DNS server again to obtain the current IP address. By default, this value is 1200, Need to change that to 300.

  1. Run “Get-ClusterResource <Always On Cluster Name> | Get-ClusterParameter, RegisterAllProvidersIP, HostRecordTTL” to get the current properties
  2. Run Powershell command “Get-ClusterResource <Always On Cluster Name>  |Set-ClusterParameter RegisterAllProvidersIP 0” for registering only Online IP address and
  3.  “Get-ClusterResource <Always On Cluster Name> |Set-ClusterParameter HostRecordTTL 300” for changing the TTL value to 300 min

It is also recommend to change the CrossSubnetThreshold, RouteHistoryLength, SameSubnetThreshold as shown below.

Ref:

https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/)

https://technet.microsoft.com/en-us/library/dd197562%28v=ws.10%29.aspx?f=255&MSPPError=-2147217396

Powershell command used as below

$Cluster = Get-Cluster

$cluster.CrossSubnetThreshold = 5

$Cluster.RouteHistoryLength =  10

$Cluster.SameSubnetThreshold = 5

Disaster Recovery Situation

Scenario:

All 2 Nodes at Primary  site and File share witness down (Primary Site is Down)

Expected Result:

Primary Site Cluster name and Secondary Site is Down due to lack of enough cluster Vote, AlwaysOn AG down, if available

Recovery:

Logon to Node at Secondary site

Open Powershell command line as administrator, import Failover Cluster module

Stop Cluster Service on both Secondary Node “(get-service -ComputerName <Site2-Node0x> -Name ‘Cluster Service’).Stop()”

Force start the cluster at one node, Run Start-ClusterNode –Name <Site2-Node0x > -FixQuorum

Verify Cluster is up on Cluster node by running “Get-ClusterNode –Name <Site2-Node0x >”

Start the Cluster Service on another Cluster node by running Start-ClusterNode –Name <Site2-Node0y > -PreventQuorum

Change the Quorum Vote settings by running.

“(Get-ClusterNode –Name <Site2-Node0x >).NodeWeight=1”

“(Get-ClusterNode –Name <Site2-Node0x ).NodeWeight=0”

 Up on bringing the Primary site up, make sure run PreventQuorum on each cluster node by running Start-ClusterNode –Name <Site1-Node0x > -PreventQuorum

Reference Used :

https://technet.microsoft.com/en-us/library/33adaa5b-a6d3-4db3-a053-67b85ba7023d

http://houseofbrick.com/improve-performance-as-part-of-a-sql-server-install/

https://support.microsoft.com/en-us/kb/2784261

https://support.microsoft.com/en-us/kb/2545685

https://technet.microsoft.com/en-us/library/dd197430

https://blogs.technet.microsoft.com/meamcs/2013/11/09/microsoft-windows-multi-site-failover-cluster-best-practices/

https://blogs.msdn.microsoft.com/jimmymay/2008/12/04/disk-partition-alignment-sector-alignment-for-sql-server-part-4-essentials-cheat-sheet/

https://blogs.msdn.microsoft.com/jimmymay/2009/05/08/disk-partition-alignment-sector-alignment-make-the-case-save-hundreds-of-thousands-of-dollars/

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s