Managing The Data Center At The Rack Level

A datacenter is a bustling place of activity very similar to a beehive. As in a beehive, every bee has a different task to Netmagicperform similarly every IT asset in a datacenter has a different task to perform. The IT infrastructure within a datacenter is the hub with very complex relationships between the three primary resources that power it – space, power and cooling. Managing IT assets in a datacenter is certainly an uphill task even though there is a wide range of tools available for datacenter management. Ever considered why? Is it because of the sheer number of assets or is it due to the complexity in trying to juggle between power and cooling resources in a non-homogeneous environment that supports a 2kW as well as a 22kW rack?

A wise approach, when dealing with a complex situation, will be to break it down into smaller parts and then find the part that is causing the problem. Hence, when addressing the complex environment in a data center, the ideal way is to break it into more manageable segments – the racks.

Managing at the Rack level
With the increase in power consumption at the rack level, measuring and monitoring power and cooling at the rack level becomes increasingly important for managing data center capacity.Information about power and cooling at the data center/room level can be useful in answering general capacity questions. However, for accurate answers to questions about changes to the IT infrastructure, it is necessary to have information at the rack level. Having detailed information about power and cooling at the rack level provides the best, and most useful, information for capacity management.

Does rack density affect capacity planning?

Capacity planning in a data center is usually done by data center managers who base their decisions on the amount of power, cooling and space required to support the IT infrastructure hosted within the data center. This planning very often is based on the concept of ‘power density’. Power density can have different interpretations as mentioned below:

• Power consumption of IT equipment / Area used by racks
• Power consumption of IT equipment / Area used by racks and their clearances
• Power consumption of IT equipment / Data center space
• Power consumption of IT, power and cooling equipment / Data center space
• Power consumption / Number of racks

The first four definitions measure PUE in watts/ft2 or watts/m2. The fifth one measures PUE in kW/rack. Assuming that the data center has a homogeneous environment, where each rack uses the same amount of power, any of the above definitions can be used for capacity planning as long as it is constantly applied.

However, the power consumption in racks can significantly vary, depending on their type. For example, a patch panel rack draws 0kW of power and a blade server rack can draw 20kW of power or even more. Lets take a hypothetical situation where a data center has 100 racks and a total power consumption of 600kW. We can calculate average power consumption by dividing the total power consumed by the number of racks, therefore:

Average power/rack = 600kW/100 racks = 6kW per rack

If the data center is designed to support 6kW per rack, the following figure illustrates the issues that can arise:

Managing power consumption (average) across diverse rack types can cause two types of problems. Firstly there is the issue of stranded resources, for example, a rack consumes only 2kW of power but is being supplied 6kW of power means that 4kW of power and cooling supplied to the rack are not being consumed and are wasted (stranded). This over provisioning of 

resources leads to a decrease in overall energy efficiency. Over provisioning lowers the load on the equipment resulting in reducing the overall energy efficiency.

Secondly, a more serious issue is the increase in risk of shutdown due to inadequate power and cooling distribution to the racks. This does not mean that sufficient power and cooling in a data center prevents downtime at the rack level. Power generates heat and an increase in power consumption generates more heat leading to the creation of ‘hot spots’ within the data center. There are numerous ways to reduce the ‘hot spots’ in a data center and these include additional cooling, cold or hot aisle containment, segmenting the data center density wise etc. but all of these methods require data/evidence at the rack level. Without rack level information of how much power and cooling is required for each rack, a data center managed cannot make informed decisions as to which method listed above will work best in a given situation.                                                 What should be measured at the rack level?

The key areas that need to be considered are:
• How much power is the rack consuming in total?
• How much power is the rack PDU providing?
• How much power is each IT device consuming?
Environmental Sensors
• What is the inlet temperature at the front of the rack?
• What is the outlet temperature at the back of the rack?
• What is the humidity of the air being supplied to the IT equipment?
• What is the pressure differential between the hot and cold aisles?
• Are the front and rear rack doors open?
• Are the front and rear rack doors locked?

When determining what data to collect and how often it should be collected, the important question is how will this information be used. Will the data so collected be used for crating alerts or for viewing trends for capacity planning? In case the data/information is to be used for capacity planning and by collecting the data, the data center manager wants to understand the changes over an extended period of time (more than a month or longer), collecting the data once every hour would be more than sufficient. If the data is to be analyzed over a shorter period of time (a day) then data snapshots should be collected more often.

When dealing with power data, the following three parameters should be considered:

• Important for knowing how much power is consumed at the rack and how much cooling is needed to cool the rack
• Important for calculating the total rack power and for alerting to potential overload conditions
• This information provides a greater level of detail of power at the device level but can add considerably to the cost of the rack PDU
• This information can also be obtained through communications with the device itself using IPMI (Intelligent Platform Management Interface) or manufacturer-specific base management protocols suchas iLO or iDRAC.

Temperature readings at the rack are important to ensure that the rack inlet temperature (the temperature of the air being drawn into the IT equipment) is sufficiently low to properly cool the equipment. Just as with power, there are multiple levels of temperature monitoring which can be done at the rack level. In the case of temperature, however, the level of monitoring comes down to the number ofsensors and their placement within the rack. One typical choice is to have threetemperature sensors at the front of the rack – low, medium and high – in order tomeasure the inlet temperature variance across the front of the rack. Some choose tomeasure the rear temperature, particularly if using a hot-aisle containment system.Some choose to measure temperature at every other rack or only for selected racks. All of these are valid choices.

Humidity readings at the rack are important to ensure that there is sufficient moisture in the air being drawn into the IT equipment. If the humidity is too high, there is a risk of condensation forming on the equipment. If the humidity is too low, there is an increased risk of static electricity. Unlike temperature sensors, there is typically only a single humidity sensor used to measure the humidity at the front of the rack.

The pressure differential between the front and back of the rack is an important indication of the ability of the air to move from the cold aisle to the hot aisle, cooling the IT equipment in the process. If the pressure differential is too low, the rack will only be able to handle a certain load without exceeding a particular temperature threshold. This pressure differential can be measured with a pressure monitor at the front and the rear of the rack.

Security at the rack is important if only certain people are allowed to interact withequipment within the rack. The two important parameters measurements are thelock status of the rack (if employing an electronic locking mechanism) and the dooropen status.

Inventory at the Rack Level
A data center can contain thousands of assets, from servers, storage and networkdevices to infrastructure support equipment such as PDUs, UPSs and cooling units.Keeping track of these assets is anongoing task faced by data centermanagers. A Digital Realty Trustsurvey found that only 26% ofdata center managers couldlocate a server that had gonedown within minutes. Only 58%could locate the server within 4hours and 20% required morethan a day. The inability to locateequipment in the data centerincreases the mean time to repair(MTTR) for the equipment anddecreases the overall availability.

At the rack level, there are three primary resources, which must be considered when trying to determine whether the rack can support a new asset:
• Is there enough contiguous space to house the asset?
• Is there sufficient redundant power for the asset?
• Is there enough cooling to remove the heat generated by the asset?

Comprehensive management of IT equipment must consider space, power and cooling at the rack level. Therefore, the inventory of the assets in the rack is required down to the rack unit. This is typically done through a manual data entry and audit process. There are drawbacks to this method, however. Manual entry is an arduous, time-consuming process that is typically rife with errors. In the Computer Associates technology brief Striving to Achieve 100% Data Accuracy: The Challenge for Next Generation Asset Management (Watson & Fulton, 2009), the authors point out the difficulty in maintaining the accuracy of this information. The authors state that “Manual tracking with pen and clipboard, or even spreadsheets is time consuming and highly error-prone. Organizations can typically expect a 10% error rate in manual data entry due to typing and transcribing errors.”

There are other options for asset inventory at the rack level that can reduce or eliminate the manual process and its inherent errors. The two primary methods are radio frequency identification (RFID) or “tethered” solutions. A tethered solution ties an asset to a rack location by means of a cable or some other physical connection.

At the end it all comes down to managing intelligently
Even with the proper tools, managing a data center is a difficult task due to thenumber of assets and the complex relationships between space, power, and coolingwhich must be balanced. The problem can be simplified by breaking the large,complex data center environment into more manageable segments – the racks. Doingthis simplifies the overall management while providing the added benefit of moreclosely monitoring the resource requirements of the IT equipment.

Mahesh Trivedi – Senior Vice President and Head at Netmagic