ZCS Operational Best Practices - Operational Structure and Guidelines
|Introduction and Scope||
||Operational Structure and Guidelines||
||Monitoring and Operational Actions||
Zimbra Operational Philosophy
Over the past several years, Zimbra has been dedicated to the pursuit of Operational Excellence in and around its product line. Each Zimbra service offering, working in partnership with our customers, seeks to evaluate the operational health and performance of their individual customer site installations. The complex nature of implementing and managing operations for the ZCS and multi-vendor stack components requires that the assessment methodology include a comprehensive process and technical analysis. This analysis focuses on Best Practice methods to effectively manage and administer mid sized enterprises to large carrier class applications and services while improving operational efficiencies, minimizing downtime and increasing scalability.
Each Operational Assessment (or Audit) offers an improvement plan that is specific to a particular customer environment. Once the environment is evaluated based on the assessment methodology criteria, it is compared to industry Best Practices and targeted improvement recommendations are made. Each recommendation is prioritized according to its ability to impact the following criteria:
- Reliability. The ability to perform under stated conditions for a stated period of time and thus avoids any Unplanned Downtime.
- Recoverability. The ability to easily bypass and recover from a component failure and to restore services as quickly as possible. Reduce Recovery Time.
- Serviceability. The ability to perform effective problem determination, diagnostics, and repair. Reduce Planned Downtime.
- Manageability. The ability to create and maintain an environment that limits the negative impact people may have on the system, and limit any of the times described in the other three Service Availability elements above.
Within a ZCS environment, intersecting technology layers are assembled to build the overall services that are provided by the application. One aspect of these layers would be Linux System Administration. This is considered a core system layer of the environment, and requires System Administration tasks be performed on each component within the overall service layer.
Another aspect would be redundant (N+1) load balanced components of the ZCS. This would be considered a front end architectural component of the environment, including the Proxy Services for POP, IMAP and HTTP, MTA (Message Transport Agent), and any Anti-Abuse or edge components. Where there are various applications and software components providing a redundant service, each of the integrated third party or extended ZCS components is directly dependent on the core layer in which it sits (Operating System, Network, Storage, etc). Table 2.1 provides a conceptual view of the various layers involved in supporting a ZCS environment.
Task Assignments across Technology Layers
Best Practice dictates that operational processes and procedures be clearly defined and documented in an Operations Manual. The Operations Manual should be regularly reviewed and updated and made available to all operations staff via a central repository. A customer’s Operations Manual includes all management escalation paths and roles and responsibilities. It includes, for example: all operational process and procedure workflows, task descriptions and schedules, metrics and Key Performance Indicators (KPI), and reporting criteria used for process and application status as well as trending information.
Each process in the Operations Manual must be defined using standardized naming conventions, workflow documentation tools and process content structure. Each documented process must have clear ownership and responsibilities and a designated backup who can manage the process in the event the primary owner is unavailable. Application specific requirements and thresholds must be defined and provided to the process owners, and must be incorporated into the Operations Manual as Standard Operating Procedure (SOP).
Table 2.2 provides a conceptual view of the process tasks within the intersecting technology layers of a ZCS environment.
The Operational Staffing outlined here is designed to help identify the technical and management skills required to manage and operate a high availability messaging service using Zimbra Collaboration Suite. It takes into consideration the operational staffing needs of a 7 x 24 messaging service including the planning, engineering, and operations phases of the service lifecycle. Also provided are estimated headcount levels for the various messaging functions. In this environment, ZCS systems can be operated by as few as 2-3 support engineers depending on the size and complexity of the environment. This section looks at the broader operational requirements for delivering quality service.
The Messaging Organizational Structure
The Messaging team is structured around three primary objectives:
- Operations are the day-to-day management and support of the production system. The team is responsible for operating the messaging service to defined service levels using documented processes. When there are system events, this team play a pivotal role in addressing the issue and/or coordinating resources to restore service and state quickly.
- Engineering is responsible for the technical aspects of capacity, performance, enhancements and growth management to handle existing and new service requirements. In addition they compile business requirements, develop / augment the service architecture, and introduce and manage change in a predictable way.
- The planning process is essential for developing new capabilities, introducing new services and handling growth requirements in a predictable way. One of the key objectives of the planning team is to insure that the right resources are available and coordinated to achieve business objectives. The focus should be to drive the development of new capabilities in order to enable new services and functionality within the project planning outlines; and to ensure that the planning encompasses full integration of all added and legacy components.
The actual number of staff required to execute the above functions is driven by:
- Number of systems (node counts), deployment size, message volume, users, integration points and service levels
- Complexity of system and services offered
- Existing in-house technical and management skills
- Plans to launch new services
A production messaging system that handles millions of messages and users daily will need dedicated Operations, Engineering, and planning functions. Messaging is a complex service to operate successfully. Due to the nature of messaging services, there is constant movement of large amounts of data. The traffic tends to be spiky, and various user interactions can create unpredictable traffic patterns.
Within the above-defined structure, the groups are also responsible for performing the following functional roles.
- The Operations team is responsible for managing the service to defined service levels, using documented procedures. They are the primary entry point for all messaging service issues, and escalations from Tier I. They are responsible for the daily operational tasks, and reporting. Operations should build a strong cohesive relationship with engineering sharing operational experiences, performance, and capacity information.
- It is important that the Operations team is a dedicated staff, and although they escalate to engineering, they have the skills and capabilities to allow the Engineering team to focus on new services, features and growth management.
- The Engineering team includes architecture, design, testing and deployment members. Engineering also includes third level support. Post deployment, typically an architect is needed to handle new service requirements and growth planning. Additionally, a design engineer is needed to handle system modifications. Also, many customers have existing quality assurance functions to test changes to the customer's provisioning, billing and monitoring processes, as well as the core ZCS platform.
- The engineering staffing requirements are directly related to the amount of custom application development that a service provider requires. For example, the service may require extensive customization for web features or reporting. In most environments Engineering will integrate the mail system with provisioning, billing, backup/recovery and monitoring systems.
- Typically, a project manager or lead is needed when launching a new messaging service. This individual can be the engineering or operations manager. The role of the project manager is to coordinate the system implementation with the customer and vendor resources. The project manager should coordinate all phases of the project and consider operational requirements like staffing, training and procedures.
Tiered Support Model
Most customers employ a tiered support model, with an entry-level tier that provides monitoring and basic troubleshooting 7 x 24; an operational support tier, and an engineering tier. Tier I typically monitors a wide array of systems supporting many services. It is essential that Tier I have quality tools, documentation, and escalation criteria. A high-level example of a tiered support model is provided below.
Tier I is the primary entry point into support. It monitors systems and services 7 x 24, posts information regarding infrastructure / application issues, and provides initial diagnostics / validation of problem issues. If possible, Tier I will resolve the issue. If the problem requires more than Tier I resources / skills, Tier I should follow the documented escalation procedures for the application. With the proper training and tools, Tier I staff can cover ZCS monitoring. In addition to Tier I services, some environments use monitors to directly page (alert) Tier II operations staff of critical alarms and/or thresholds.
This team owns the health of the messaging service and is responsible for managing the system to defined service levels. In addition to the daily administration of the environment, Tier II is the first escalation point when messaging system problems occur. They typically diagnose the situation and provide services to resolve the issue, or engage various internal and external teams to ensure system functionally is restored.
Tier III is the Messaging Engineering team. The Engineering team provides support for difficult system problems that require in depth understanding of the messaging application and product integrations. This team would lead the service and approval of code-level changes (patches) to address and/or resolve issues. They should work closely with the Tier II teams to couple and mentor the core technical strengths of the various support teams. Although Engineering is the escalation point for Operations (Tier II) their primary focus should be on new service and feature delivery.
Root Cause Analysis procedures should ensure the appropriate support tier handled the situation. Frequent escalations outside of the responsible support tier can be the result of inadequate skills, training, and/or documentations, all of which can have an impact on quality service delivery.
Tier Support Summary
The Tier II Operations team should be structured and staffed to manage and administer the various components of the ZCS environment, 7 x 24. (Coverage / On-Call) They should work with Tier III Engineering and Tier I on the creation and maintenance of troubleshooting and escalation procedures. All teams should have a clear understanding of the roles, responsibilities, and coverage within each support tier.
The table below provides an example of a ZCS support coverage matrix that would be provided a Tier I monitoring team.
The Operations function can initially be collapsed into two or three headcount depending on the environment. Typically, customers will have one or two application specialists for ZCS (which includes the MySQL Database). An additional system specialist operates the hardware and network platform. It is recommended that these areas of expertise be cross-trained in order to provide the necessary coverage.
As the number of systems and complexity grow, more Operation resources can be added. Since ZCS can be managed in one place and many of the administration procedures can be automated, ZCS Operations resources grow minimally as the user count and message volume grows. This is a significant business benefit because it reduces service cost over time.
Typically, the number of systems in use drives the Operations requirements. It is wise to use vertical scaling to minimize the number of systems required as the service grows. Typically, when the number of users/traffic doubles, customers find that ZCS operation requirements increase 20 - 25%.
Sample Staffing Organization
Larger service providers typically break each service layer (network, hardware, software) into dedicated teams and then roll up these departments into the respective engineering and operations organizations. Smaller ISPs or enterprises might place an entire messaging team including planning, engineering and operations under one manager.
As long as people are trained, have the right tools, clear roles and responsibilities, proper system design, any combination of these approaches may work. The key is to have all the roles identified and staffed even if the same person is performing multiple roles.
- A. Messaging Management
- Responsibilities include the overall health and operational readiness of the messaging system. The manager must provide the right resource levels for staffing, training, hardware, processes, and tools to successfully operate the service. Also, the manager must work with the product management team to understand service requirements and provide estimates on capital and operating costs to achieve the business objectives.
- B. Project Management
- Responsibilities include implementing new projects required by product management, engineering, or operations into the production environment. The project manager coordinates all internal and external resources. It is not uncommon for this organization function to own change management as well as service level management processes. Also, project management may lead system and risk analysis exercises as well as process development.
- C. Engineering
- The Engineering team is responsible for engineering and implementing changes into the production environment in a predictable way. These responsibilities include system architecture, design, functional specifications, test cases, deployment and change plans, capacity planning, and Tier III support. The team should consider technical as well as operational requirements when making changes.
- Note: The size of the design engineering team is proportional to the application development activity. Implementation of customized software applications and the relevant hardware (including SAN) will require more design resources.
- D. Operations
- The Operations team is responsible for achieving defined service levels using repeatable, documented process. The team is responsible for the overall health of the system including monitoring, system administration, and problem management. The Operations team should be relatively self sufficient to minimize the impact to the engineering team.
- Note: This size of the support engineering team is a direct function of how many systems must be supported and how complex the technical environment is; including but not limited to the nature of the size and complexity including network integration points (SAN, etc).
Sample Staffing Matrix
The staffing matrix maps the sample organization to the staffing requirements for “medium” to “large” deployments. This matrix assumes the goal is to achieve all four levels of operational readiness. (See Operational Structure and Guidelines > Operational Maturity).
|Area||Function||Roles||Staff||Monthly Hours||Maturity Level|
|B||Project Management||Project Manager||1||160||2,3|
|C||Engineering||Architect (1), Engineer (2)||3||480||3,4|
Sample Service Matrix with Estimated Level of Effort (Hrs / Month)
Before a service provider makes an investment in staffing the messaging team, the service provider should consider what service maturity level will meet its business goals in providing the messaging service. Different markets (e.g. consumer, business or public/edu/government) and market conditions (required quality levels) will determine the need for maturity levels. Also, the need to introduce new services and handle growth will drive the need for business planning and architecture (Level 4) as well as engineering change (level 3) capabilities.
In launching and/or maintaining a messaging service, all activities are required. A provider can choose to outsource higher-level activities (Level 3 and 4) to Zimbra Professional Services to mentor or build in-house expertise, or period contracts as an integrated provider. Achieving high maturity levels allows a carrier to provide high quality, messaging service as a core competency, establishing competitive advantage that can be used to quickly move into new markets, handle growth, or wholesale services to other service providers.
Operational Maturity Levels
Level 1. Maintain
Level 1 is the basic ability to operate the service and restore service and state when an outage occurs. The service provider is staffed to provide basic hardware and software administration, monitoring, and troubleshooting. Additionally, staff deals with day-to-day trouble shooting as well as problem management. There is no predictable service level, planning, or growth management. Typically, the engineering team operates the service or is very involved in routine troubleshooting and problem management.
For service providers, this is an undesirable operations level because no matter how much capital is spent, the overall service level is unpredictable and often of poor quality. The service provider does not have an operations core competency that can be leveraged for premium services such as managed messaging or value added, telephony services such as wireless notification or universal messaging.
Level 2. Measure and Predict
At level 2, the service provider can predictably manage the quality of service and make investments to achieve the targeted service level. Additional staff is required to develop and test process, measure service level, and plan for new services and growth. The team undertakes routine post mortems, reviews outage events for root cause, provides proactive risk analysis, and develops plans to improve all areas.
Level 3. Design and Implement Change
At level 3, service providers can take control of service improvements and changes needed to respond to new business requirements and growth. These changes can be introduced in a predictable way with minimal impact to the user community. Typically, service providers outsource initial mail system deployment to Zimbra Professional Services or to a certified Zimbra Partner. Over time, the service providers build the competency to manage change.
Level 3 staffing includes system and component design engineers, testers, and project management staff. The engineering team should also consider process and staffing impacts to the operation staff. For example, will a new service increase staffing requirements, expertise level, or tools required to operate the service?
Level 4.Meet Business Goals
Level 4 is the ability to map business requirements to a service’s architecture and design. Typically, when launching or enhancing a messaging service, product marketing does the business analysis and develops service requirements. Product marketing may work with the engineering and operations teams to assess deployment costs including staffing, system and facilities requirements. Upon approval, these business plans become the basis for the service requirements.
Architects take the service requirements and develop a service’s architecture and design. The service architecture includes technology, people and process required to operate the service to defined service levels. Once plans are finalized, the architect works closely with the engineering and operations teams to introduce the change in a predictable way to minimize service disruption to customers.