- Account
- Join for Free
- Sign In
- Help & Info
- Privacy Notice
- DMCA
- Contact Us
- Terms Of Use
...Description...... more. less.
Hitachi Data Systems.<br><br> Document Revision Level Revision Date Description 1.0 February 2007 Initial Release Source Documents for this Revision Hitachi Data Systems Architect 3 Business Continuity Certification Prep Course Hitachi Universal Replicator Three Data Center Multi-Target Planning and Proof of Concept by Roy Strouse Audience This document is intended for storage and system administrators responsible for designing and implementing business continuity solutions based on Hitachi Data Systems storage systems. Contributors The information included in this document represents the expertise of a number of skilled practitioners within Hitachi Data Systems in the successful delivery of remote replication solutions to our customers. In addition, a broad range of additional skills and talents working across many disciplines within Hitachi Data Systems provided many hours of effort on quality assurance of this material.<br><br> Thanks go to: Douglas E. Babcock Managing Principal, Business Continuity Mainframe, Global Solutions Services Christina Gallagher Account Consultant, Sales Tom Goulding Solution Architect, Global Solution Services John Hickman Field Product Manager, Enterprise Solutions Vladimir Igolnikov Principal Technical Consultant, Global Solution Services Carl Isenburg Principal Technical Consultant, Global Solution Services Gian Jagai Technical Consultant, Global Solution Services Bill Martin Product Manager, Disaster Recovery Program Products, Global Solutions Strategy and Development Chris Myers Technical Consultant, Global Solution Services Keith O 9Toole Technical Consultant, Global Solution Services Roselinda R Schulman, CBCP Director, Data Protection, Global Solution Business Line Leland Sindt Technical Consultant, Global Solution Services Roy Strouse Advanced Technical Consultant, Solutions Sales & Services Alex Tse Solutions Service Practice Manager, Global Solutions Services Delivery of this guide was managed by Technical Marketing, Global Solutions Strategy and Development. Contact Information All documents in this series can be found on www.hds.com.<br><br> Please send comments on this document. Make sure to include the document title, and revision. Please refer to specific section(s) and paragraph(s) whenever possible.<br><br> E-mail: TM@hds.com Fax: 408-562-5477 Mail: Technical Marketing, M/S 34-88 Hitachi Data Systems 750 Central Expressway Santa Clara, CA 95050-2627 Thank you! (All comments become the property of Hitachi Data Systems Corporation.) Contents Introduction .................................................................................................................. ........................................................1 Problem Description............................................................................................................<br><br> ..........................................1 Technical Issues and Challenges................................................................................................ ..................................2 Synchronous and Asynchronous Replication....................................................................................... .........................<br><br> 2 The Rolling Disaster Challenge................................................................................................. .................................... 2 Write Order Fidelity...........................................................................................................<br><br> ............................................ 3 Data Consistency............................................................................................................... ...........................................<br><br> 3 Evaluation Criteria............................................................................................................ .............................................4 Data Consistency............................................................................................................... ...........................................<br><br> 4 Cost versus Benefit............................................................................................................ ........................................... 4 RPO and RTO....................................................................................................................<br><br> ........................................... 5 Making the Choice.............................................................................................................. ..........................................<br><br> 5 Solution Overview.............................................................................................................. ..................................................7 Synchronous Remote Replication................................................................................................. ................................<br><br> 7 Asynchronous Remote Replication................................................................................................ ............................... 7 Three Data Center Configurations...............................................................................................<br><br> ................................. 8 Large Data Center Configurations .............................................................................................. ..................................<br><br> 8 Local Data Copies.............................................................................................................. ........................................... 8 Replication Management.........................................................................................................<br><br> ..................................... 8 Business Impact and Benefits .................................................................................................. .....................................8 Disaster Recovery and Business Continuity......................................................................................<br><br> ........................... 9 Data Migration................................................................................................................. ..............................................<br><br> 9 Data Center Relocation......................................................................................................... ........................................ 9 Hitachi TrueCopy" Heterogenous Remote Replication Software.....................................................................<br><br> ........... 9 Hitachi Universal Replicator Software.......................................................................................... .................................<br><br> 9 Hitachi ShadowImage" In-System Replication Software............................................................................ ............... 10 Hitachi Business Continuity Manager Software ..................................................................................<br><br> ........................ 10 Hitachi HiCommand® Replication Monitor Software................................................................................ ...................<br><br> 10 Reference Architecture......................................................................................................... .......................................10 Architectural Alternatives..................................................................................................... ........................................18 Implementation Planning........................................................................................................<br><br> ...........................................19 Data Collection Tools and Processes............................................................................................ ..............................19 Risk Analysis and Remote Copy Planning and Design Services..................................................................... ...........<br><br> 19 Remote Copy Expert Assistant................................................................................................... ................................ 19 Resource Requirements..........................................................................................................<br><br> ....................................20 Bandwidth and Replication Paths ............................................................................................... ................................ 20 Redundancy Requirements........................................................................................................<br><br> ................................. 21 Processing Capacity on the Storage Systems..................................................................................... .......................<br><br> 21 Channel Extension.............................................................................................................. ........................................ 21 Buffer Capacity ...............................................................................................................<br><br> ............................................ 21 Storage for Replica Copies..................................................................................................... ....................................<br><br> 21 Software Requirements ......................................................................................................... ..................................... 22 Command Device ................................................................................................................<br><br> ....................................... 22 Sizing & Design................................................................................................................ ...........................................22 Remote Copy Planning and Design Service........................................................................................<br><br> ....................... 22 Data Gathering 3 Business Requirements......................................................................................... .........................<br><br> 22 Data Gathering 3 Environmental Requirements.................................................................................... ...................... 23 Data Gathering 3 Technical Requirements........................................................................................<br><br> ......................... 25 Installation & Configuration .................................................................................................. ............................................32 Copy Paths ....................................................................................................................<br><br> ............................................. 32 Journal Groups ................................................................................................................ ...........................................<br><br> 32 Beginning with Hitachi Business Continuity Manager for Mainframe Replication................................................... .....32 Configuring Business Continuity Manager........................................................................................ ..........................<br><br> 32 Device Address Domain ......................................................................................................... .................................... 32 Beginning with Hitachi Command Control Interface for Open Systems Replication..................................................<br><br> ..32 HORCM.......................................................................................................................... ............................................ 32 CCI Poll Values................................................................................................................<br><br> ........................................... 33 Naming Conventions for DEV_GROUPs and DEV_NAMEs................................................................................ .......<br><br> 33 Automation .................................................................................................................... ..............................................33 Scripting Interface............................................................................................................ ...........................................<br><br> 33 Configuration Testing ......................................................................................................... .........................................34 Daily Operations............................................................................................................... ..................................................35 Application Interaction........................................................................................................<br><br> .........................................35 Disaster Recovery Testing ..................................................................................................... .....................................35 Scheduling Disaster Recovery Testing........................................................................................... ............................<br><br> 35 Scaling for Growth and Change.................................................................................................. ......................................36 Optimization & Tuning.......................................................................................................... .......................................36 Initial Copy...................................................................................................................<br><br> ............................................... 36 Hitachi Universal Replicator Reverse Resync.................................................................................... .........................<br><br> 36 Appendix A References.......................................................................................................... ...........................................37 Appendix B Glossary............................................................................................................ .............................................38 Best Practices Library Guidelines for Hitachi TrueCopy" Remote Replication Software and Hitachi Universal Replicator Software By Hitachi Data Systems Technical Marketing Introduction The world has changed significantly in the past few years.<br><br> Devastating terrorist acts and threats, the seemingly increased frequency of widespread power-grid disruptions, and the emergence of regulatory requirements for infrastructure protection are all placing stringent, yet necessary, data protection requirements on many organizations. Regardless of the industry, as more and more businesses operate in a 24/7 environment 4 especially large enterprises where global operations are the norm 4they need an increasingly competitive edge to maintain profitability and stay in business. In the complex and challenging global environment, well-planned business continuity or proven disaster recovery practices for nonstop data availability have become critical to organizations if they are to survive any type of outage.<br><br> Problem Description Most information technology related disruptions are actually locally contained events, such as data corruption, viruses and human error, as opposed to physical disasters like fire, earthquakes, hurricanes, etc.These events occur frequently and pose a more common threat to businesses than physical disasters. However, because they are less visible to the general public, these disruptions may be taken less seriously. The real challenge facing IT management lies in getting the organization to think proactively and to deploy best practices and technologies that can be leveraged to maximize business operations instead of adopting a reactive cfix-it d posture.<br><br> The true test of sound IT infrastructure is the ability to prevent outages from occurring in the first place and minimizing the effects of those incidents when they do occur. Companies today must follow the continuous business paradigm which combines high-availability solutions with advanced disaster recovery techniques. The ultimate goal is to be able to manage both planned and unplanned situations with minimal or zero disruption.<br><br> When an unplanned event does occur, the ideal scenario is: : : Recovery happens almost automatically with no loss of data : : Costs of the solution and resources are minimal : : Impact to the production environment is zero While technology is moving forward at a rapid pace to reach this ideal scenario, many other business and technology concerns exist, including some significant trade-offs dictated by technology, budgets and personnel resources. 1 Technical Issues and Challenges Options for replicating data for business continuity generally fall into one of two categories: synchronous (real- time) or asynchronous (near real-time). Business managers must consider distance at which replication takes place versus the possible impacts to application performance.<br><br> They also must evaluate the critical replication issues posed by single event or rolling disasters, including latency, sequence 4or write order 4fidelity, and data consistency. Using the results of this analysis, business managers can select and implement the right business continuity and recovery solution for their organization 9s needs. Synchronous and Asynchronous Replication Synchronous replication ensures that a remote copy of the data is identical to the primary copy at the time the primary copy is created or updated.<br><br> In synchronous replication, an I/O update operation is not considered done until completion is confirmed at both the primary and mirrored sites. If the operation fails to complete at the remote site, actions taken will be determined by replication and other software settings i.e. clustering, path failover.<br><br> One benefit of synchronous replication is that data can be recovered quickly. After a disruption at the primary site, business operations at the remote site can begin immediately with a consistent copy of the data. Only I/Os in-flight at the instant of disruption may be lost.<br><br> Because neither the primary nor remote site will have a record of those transactions, the business processing rolls back to the last commonly confirmed state. The drawback to synchronous replication is its distance limitation. Fibre Channel, the primary enterprise storage transport protocol, can theoretically extend over several hundred kilometers.<br><br> However, latency quickly becomes an application problem as propagation delays lengthen with increased distance. Propagation delays can significantly slow down an application by forcing it to wait for confirmation of each storage write operation.This means the practical distance for synchronous replication on a busy system depends on the application response time tolerance and other factors, which typically ranges from 20 to 100 miles (about 35 to 160 kilometers) 4not far enough to be clear of a wide-area or regional disaster. To mitigate the distance limitations of synchronous methods, asynchronous technologies have been developed, implementing a buffering mechanism to accumulate write operations for subsequent transmission after I/O completion has been acknowledged to the host.<br><br> By eliminating the wait for a response from the remote site for each I/O, this approach eliminates the propogation delay that hinders synchronous copy techniques. The main benefit of asynchronous replication is the ability to have the secondary storage system at long distances from the primary storage system without impacting the application at the primary site. Implementations of this replication strategy can extend to thousands of kilometers.<br><br> The downside to asynchronous replication is the potential for data loss between the primary and remote sites. Because of the slight time lag between data being stored at the primary and remote sites, updates lost in-flight during an outage can mean the remote center cannot pick up operations instantly at the point the primary site failed. In such a situation, asynchronous replication caching, sequence numbering, time stamps, and other techniques used to automatically preserve write-order fidelity and data integrity at the remote site are essential.<br><br> The Rolling Disaster Challenge A rolling disaster occurs when an unplanned outage event takes place over a span of time 4anywhere from a few minutes to several hours. During a rolling disaster, not all systems, storage and network connections fail at precisely the same moment. In this situation, a system may still be able to process transactions and issue updates to primary storage devices, but due to earlier failures, updates may not replicate successfully to the 2 secondary site.<br><br> Rolling disasters pose a challenge because they may result in corrupted and unusable data at the remote site, requiring difficult and very lengthy recovery processes. To protect against rolling disasters, a data replication technology must be able to freeze remote replicas at a point in time prior to or during the onset of the outage. This ability to create point-in-time images of data is what differentiates remote copy technology from simple mirroring.<br><br> Because the remote and local I/O of a synchronous replication succeed or fail together, this replication approach does not introduce data inconsistencies following a disaster. Rolling disasters are primarily a challenge for remote asynchronous replication, and one of the principle areas of concern is write order fidelity. Write Order Fidelity Database and file managers maintain very complex internal data structures, including indexes, structured data tables, directories, logs and so forth.<br><br> Database applications should have atomicity (whereby database transactions follow the atomic rule: if one part fails, the whole transaction fails), being asset compliant in writing data to disk. Thus, careful write sequencing and strict adherence to write dependencies allow file systems and databases to preserve the integrity of these internal structures no matter what I/O activity is in progress when failure occurs. Each write is carefully sequenced so that, at any point in time, a correct file system or database state can be recreated.<br><br> Resynchronization refers to the process of updating the remote data copy following a planned or unplanned suspension. Traditional remote replication technologies track changes in storage system cache during normal paired operation, building and maintaining a record of changed data in the event of cache overflow during unexpectedly high change rate or a replication link failure. During resynchronisation, data in the remote storage system may not be updated in the same sequence as it was written in the primary storage system; it will therefore not be consistent.<br><br> If the primary storage system were to fail or if access to the primary storage system was lost during this process, there would not be any consistent data available for the application to continue running. Taking a local copy of the data as part of the recovery process before starting the resynchronisation could alleviate the application data consistency. A replication solution must address write order fidelity, ensuring that remote writes are made in the same order as those at the primary site.<br><br> To ensure the integrity of asynchronously replicated data in a rolling disaster, replication technology must employ techniques to automatically preserve write order fidelity at the remote site. Data Consistency In the context of data replication, data consistency represents the ability to recover from a failure or disruptive event. A fundamental concept of data consistency that enables quick recovery is the cdependent write d, that pervasive logic among complex data structures comprising databases, file systems, etc.<br><br> that determines the sequence in which writes are issued. A dependent write is a data update that cannot be executed until a previous write 4on which it is dependent 4has been executed. It is this logic that preserves the integrity, the consistency, of the data and allows systems and applications to restart after a sudden failure.<br><br> There are three types of data consistency that have different implications at different levels within the application and data architecture where the meaning of the data can have different logical dependencies. These are I/O, transaction, and application consistency. I/O consistency, or crash recovery consistency, refers to data that is not necessarily transaction consistent, but is still in a restartable state.<br><br> If the writing of data is interrupted in the middle of a transaction and fails to complete, this leaves the resulting data in an I/O consistent state if the sequence of dependent writes haas been maintained 4the data is recoverable. When the application is restarted, the data will be rolled back or rolled forward to a transaction consistent state. 3 A transaction is a logical unit of work that may include hundreds or thousands of updates.<br><br> Transaction consistency is achieved when an application is shut down (quiesced), or when the application/database or other system component rolls back or rolls forward after a restart. Restarts could result from a sudden power failure, system crash or other disaster. An application may be made up of many different types of data, such as multiple database componenets as well as flat files.<br><br> Application consistency is the state in which individual components have each been recovered to a transaction consistent state. Collectively, the components need to be synchronized based on the application requirements. The difference between the primary site failure and resumed operations at the remote site represents the Recovery Point Objective (RPO 3 explained later in this document) which will always be kept to a minimum in Hitachi Data Systems asynchronous replication implementations.<br><br> Evaluation Criteria Evaluation of any data replication solution should consider the following: Data Consistency Does the remote copy technology provide I/O consistency, which is a bedrock requirement to successfully use the replica copies? Cost versus Benefit When evaluating the cost of business continuity solutions, the greatest cost component is usually the bandwidth needed to support remote replication. The greatest benefit is maintaining a minimal RPO.<br><br> This balance is illustrated in Figure 1. Figure 1: Recovery Time Versus Cost Cost Recovery Time Objective Acceptable Acceptable Cost/Time Cost/Time Window Window Co s t of So lut ion an d t ime - t o - r ec o v e r Minutes Hours Days Online R e ve nue P ro d ucing A pplic a t i ons Back office, Batch Applications Cost of outage over time Cost of outage over time Cost Recovery Time Objective Acceptable Acceptable Cost/Time Cost/Time Window Window Co s t of So lut ion an d t ime - t o - r ec o v e r Minutes Hours Days Online R e ve nue P ro d ucing A pplic a t i ons Back office, Batch Applications Cost of outage over time Cost of outage over time Each application must be evaluated separately to identify the costs versus benefits. 4 As more is spent on bandwidth, the replica copy can be more up to date.<br><br> The ideal solution allows IT management to balance costs against benefits and maximize return on investment. RPO and RTO Fundamental to developing an effective disaster recovery solution is identifying a business 9s risk tolerance as expressed in terms of Recovery Point Objectives and Recovery Time Objectives (RPO and RTO, respectively). RPO represents the worst case time between the interruption in operations and the last recoverable backup, where potentially lost data weighs against cost.<br><br> RTO represents the time to resume operations after the interruption. Figure 2: Recovery Point and Recovery Time Objectives RTO Disaster RPO Timeline Evaluation of risk tolerance is stated in terms of how much data must be recovered to resume operations, the RPO, and the outage duration, the RTO. Does the remote copy technology allow for minimal RPO?<br><br> Batch copy techniques usually cost less and often promise lower bandwidth requirements by extending RPO. Is this a false economy? Hitachi technology balances RPO against bandwidth costs while also providing mechanisms for bandwidth management.<br><br> Making the Choice Clearly remote storage replication for recovery and business continuity requires more than just shipping data over a network. The selection process starts with an assessment of the potential risks and their probability. Two-data-center (2DC) replication strategies are viable for most in-region recovery 4for example, serving as a hot site for campus-level or metro-level server cluster 4and for out-of-region recovery sites where propagation delays are not an issue.<br><br> Synchronous replication provides very fast recovery time (low RTO) and good data currency (low RPO). If your organization cannot tolerate any data loss and operations must be resumed quickly following an outage, synchronous replication is likely to be the best choice. Of course, the decision must also factor in how far the data has to be replicated to clear any likely disaster zone, balanced against how much degradation of application performance can be tolerated.<br><br> Asynchronous replication provides better protection against regional disasters, albeit with less favorable RPO. If your organization can tolerate being down while the last few transactions are reconstructed 4or cannot tolerate the performance impact of synchronous propagation delays 4asynchronous replication may prove to be a less costly option. When synchronous and asynchronous replication is combined into cthree data center d (3DC) replication solutions, maximum protection and flexibility in recovery is achieved.<br><br> A three data center strategy offers the 5 best of both worlds: fast recovery and excellent data currency for local site failures, combined with advanced protection from regional disasters. 6 Solution Overview Consistent with a long history of delivering technologically advanced storage-based solutions for the enterprise, Hitachi Data Systems offers a diverse portfolio of Application Optimized Storage" solutions for business continuity. While providing the right type of access and availability of data to applications is essential, just as important is the strategy used to protect the data itself.<br><br> This strategy should be based on business requirements, including mitigating risk, regulatory compliance, and employment of best practices. For organizations with demanding heterogeneous data replication needs for business continuity or improved IT operations, Hitachi Universal Replicator and Hitachi TrueCopy" Heterogenous Remote Replication software with synchronous and asynchronous capabilities provide the enterprise-class performance associated with storage-system-based replication while delivering truly resilient business continuity without the need for redundant servers or replication appliances. Both Hitachi Universal Replicator software and Hitachi TrueCopy" Remote Replication software deliver simplified remote data replication across Hitachi TagmaStore® Universal Storage Platform and Hitachi TagmaStore Network Storage Controller internal and externally-attached storage.<br><br> Both Universal Replicator and TrueCopy support Geographically Dispersed Parallel Sysplex (GDPS), an IBM service offering for system failover, workload balancing, and data mirroring. Synchronous Remote Replication For distances within the same metropolitan area, Hitachi TrueCopy" Heterogenous Remote Replication software with synchronous capabilities provides a no-data-loss, rapid restart solution. TrueCopy Remote Replication Synchronous software yields the highest degree of data integrity because its real-time copies are the same as the originals.<br><br> Asynchronous Remote Replication Hitachi Universal Replicator software and Hitachi TrueCopy" Asynchronous software can be deployed for wide-area disaster protection across virtually any distance. TrueCopy Asynchronous software delivers premier data integrity with minimal performance impact on the primary system. Able to operate at any distance, TrueCopy software supports fast restarts and recovery by ensuring proper database update sequences for each transaction by using a unique method of sequence numbers and timestamps in each data record to ensure proper sequencing and data integrity during transmission and recovery.<br><br> Universal Replicator software for the Universal Storage Platform and the Network Storage Controller provides advanced replication among all of the storage systems certified for external attachment to these two storage platforms, permitting data to be copied from any supported device to any other supported device, regardless of operating system or protocol differences. Using industry-leading controller-based virtualization, the Hitachi storage platforms enable this single replication tool to operate against all heterogeneous storage resources in a tiered infrastructure. This significantly reduces the complexity and cost of replilcating data, both locally and long distance.<br><br> A unique asynchronous implementation, Universal Replicator software at the primary site writes designated records to cache and a specific set of disk journal volumes. The remote Universal Storage Platform then reads the records from the journal cache or volumes, offloading the primary system by pulling them across the communication link, instead of making the primary system push them as in most other approaches. By writing records to journals instead of keeping them solely in storage system cache, Universal Replicator software does not consume available cache, freeing resources for production transactions, eliminating the most common cause of asynchronous replication failure, and permitting the replication bandwidth to be sized towards average utilization instead of peak demand.<br><br> 7 Three Data Center Configurations For enterprise environments, Universal Replicator ensures availability of up-to-date copies of data in dispersed locations by leveraging synchronous capabilities of Hitachi TrueCopy Remote Replication software. Three data center configurations, illustrated under cReference Architecture d below, include the following: Three Data Center Cascade replicates data from the primary site to an intermediate site via TrueCopy Synchronous software and then to a third remote location with Universal Replicator software. Three Data Center Multi-Target simultaneously copies data from a central location to a hot stand-by site via TrueCopy Synchronous software and to a third site via Universal Replicator.<br><br> Three Data Center Multi-Target with Delta Resync supports recovery of the remote site from the synchronous copies of journal data at the hot stand-by site if the primary site has failed. Large Data Center Configurations Universal Replicator 4x4 in the mainframe environment supports a single consistency group spanning up to four storage systems at either or both primary and remote sites in any cNxN d combination up to 4x4: 3x3, 2x1, etc. In this configuration, the Universal Replicator Extended Consistency Group feature supports up to 16,000 volumes in a single consistency group for each storage system in the complex.<br><br> Universal Replicator 1x1, 2x2, 3x3 and 4x4 configurations would support up to 16,000, 32,000, 48,000 and 64,000 volumes in one consistency group. Local Data Copies Hitachi ShadowImage" In-System Replication software plays a key role in many of the recommended remote replication architectures by creating disk-based data copies within a single Hitachi storage system. ShadowImage software provides a safeguard formission critical application consistency, near instant recovery from data corruption and cpoint-in-time d (PIT) data copies for immediate and nondisruptive access and sharing of information for decision support, test and development, or to optimize tape backup operations.<br><br> Replication Management Hitachi Business Continuity Manager software for IBM® z/OS® offers centralized, enterprise-wide replication management for IBM z/OS mainframe environments. Through a single, consistent interface based on familiar TSO/ISPF full-screen panels, Business Continuity Manager software automates Hitachi Universal Replicator, Hitachi ShadowImage" In-System Replication, and Hitachi TrueCopy " Remote Replication software operations, accessing key replication metrics with built-in performance monitoring. Business Continuity Manager presents views to the status of all enterprise-wide replication objects in real time and provides automatic notification of key events completion, such as pair state transitions, timeout thresholds, and other system events.<br><br> Hitachi HiCommand® Replication Monitor software simplifies administration of the entire suite of Hitachi replication products for open systems and mainframe environments with a single, easy-to-use display for monitoring and visualizing volume replication configurations and status information. It streamlines storage administration and replication management functions by interfacing with Hitachi Device Manager software for Hitachi storage systems and replication software. This allows storage administrators to get a visual reference for data under replication management, as well as a point-in-time status indicator of replicated pairs including recovery point.<br><br> Business Impact and Benefits Deployment of Hitachi Data Systems 9 remote replication technologies deliver business benefits through a variety of common applications, among them: 8 Disaster Recovery and Business Continuity Remote copy technology can greatly reduce the potential losses incurred during a large scale disaster. Because a nearly simulataneous duplicate of production disk resources can be located at sites far enough removed from production facilities as to be unaffected by disasters such as floods, hurricanes, terrorism, earthquakes, and similar events. By failing over to the remote facility, processing can be quickly resumed at the recovery facility with minimal loss of information.<br><br> Data Migration Moving production data from one storage system to another can be enormously disruptive and require significant service outages as well as planning and logistic effort. Remote copy can be used to substantially minimize the complexity of moving data to target devices, and reduce outage durations that are needed when production applications cut over to the new environment Data Center Relocation Long distance data center relocation can require long duration outages if storage devices are off-line while de- installed, transported to their new home, and then finally brought on-line. With remote copy, outage durations are significantly smaller since an up-to-date can be created at the new facility while production applications are still on-line.<br><br> Further, the suite of Hitachi Data Systems remote replication technologies deliver asynchronous and synchronous remote replication of data from one storage system to another for everyday uptime improvement and rapid recovery in the event of an outage. Specifically: Hitachi TrueCopy" Heterogenous Remote Replication Software " Supports fast restarts and recovery by ensuring proper database update sequences for each transaction during transmission between enterprise storage systems " Improves service levels by reducing planned and unplanned downtime of customer-facing applications Hitachi Universal Replicator Software " Reduces cache utilization and maximizes the use of transmission-line bandwidth by leveraging performance- optimized disk-based journals " Reduces costs, requiring only one product to provide asynchronous copy services for use across all attached storage systems " Can significantly reduces RPO through advanced point-in-time recovery capabilities afforded by the use of journaling technology " Maintains protection of data in the event of total network outages 3 depending on journal size and workload write rate 3 for swift recovery upon resumption of network connectivity by eliminating the need for a full volume copy to recover the remote site, therefore reducing the exposure to lost data " Ensures availability of up-to-date copies of data in dispersed locations by leveraging synchronous capabilities of Hitachi TrueCopy Remote Replication software, including replication to multiple data centers as well as to both remote and hot standby data centers " Eliminates the need for any difficult to understand and implement recovery procedures as with other products/solutions " Provides single pane-of-glass heterogeneous replication through the virtualization capabilities of the Universal Storage Platform, allowing the replication of any volume hosted on any supported externally attached storage system 9 In addition, in the event of a site-wide failure at the primary site, the journal at the intermediate Three Data Center Cascade site can continue to propagate the reaming I/O to the remote site. Three Data Center Multi-Target minimizes failure points because each replication leg is independent of the other.<br><br> And similar to Universal Replicator Three Data Center Cascade, in the event of a sitewide failure at the production site, the journal at the synchronous site in the Three Data Center Multi-Target with Delta Resync environment can continue to propagate the remaining I/O to the remote site. Hitachi ShadowImage" In-System Replication Software " Shortens restart and recovery times with the consistency-group function, which provides multivolume, point- in-time copies for applications and databases that share or span multiple volumes. " Reduces recovery from data corruption time dramatically through the ShadowImage QuickRestore feature, which allows an immediate restore to a disk-resident, point-in-time data copy.<br><br> " Replicates large data volumes without impacting service levels, timing out, or affecting performance levels. " Enables normal backup operations on a copy of up-to-date production data while critical applications continue to run unaffected Hitachi Business Continuity Manager Software " Dramatically reduces recovery times by automating complex disaster recovery and planned outage functions " Allows proactive problem avoidance and optimum performance to ensure that service-level objectives are met or exceeded by providing access to critical system performance metrics and thresholds " Eliminates hours of tedious input and costly human error when configuring and protecting complex, mission- critical applications and data through its auto-discovery capability " Universal Replicator and ShadowImage In-System Replication software support Business Continuity Manager 9s mainframe ATTIME Split functionality, allowing real-time ShadowImage software replication of Universal Replicator software 9s remote volumes without suspending the Universal Replicator pairs. Hitachi HiCommand® Replication Monitor Software " Monitors data currency and recovery points for both open and mainframe environments " Offers advanced replication system status reporting and built-in capabilities for monitoring and managing replicated volumes for active problem avoidance " Provides an enhanced topological 3like view of user-selected copy groups for simplified management Reference Architecture The following tables (Table 1 through Table 8) detail the spectrum of Hitachi Data Systems 9 remote replication architectures from simplest to most advanced.<br><br> 10 Table 1: Real-time Copy Real-time Copy RPO with Site Wide Instant Disaster Low RPO (0 for TrueCopy Sync, minutes for TCA) Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing unless additional point in time copies are added Host response time Sensitive to bandwidth, cache, remote storage system health, and distance if using synchronous copy Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth High Size link to peak workload Storage Capacity Requirements (Primary to Secondary) 1:2 Front End Director processor requirements per storage system > 2 dependent on workload Manageability Ranking Simplest Implementation Complexity Ranking (1=lowest, 5=highest) 1 Table 2: Point-In-Time Three-Copy Model Point-In-Time Three-Copy Model RPO with Site Wide Instant Disaster High RPO (hours) Recovery from Logical Corruption Planned local or remote Point in Time recovery images 11 Recovery from Logical Corruption during Disaster Recovery Testing RPO = Test duration + resynch duration Host response time Sensitive to resynch timing (with initial copy as an upper bound), copy pace Application Interaction Complexity Application quiesce required before suspending TrueCopy pair(s) Bandwidth Lower Size link to peak RPO Average + safety margin for production activity during resynch Storage Capacity Requirements (Primary to Secondary) 1:2 Front End Director processor requirements per storage system >2 dependent on workload Manageability Ranking Simple Implementation Complexity Ranking (1=lowest, 5=highest) 3 Table 3: Point-In-Time Four-Copy Model Point-In-Time Four-Copy Model RPO with Site Wide Instant Disaster High RPO (hours) Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing RPO = Test duration + resynch duration Host response time Sensitive to ShadowImage placement and ShadowImage resynch duration Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Lowest Size to peak RPO Average Compared to Batch DR, TrueCopy pair does not need to catch up to production updates) Storage Capacity Requirements (Primary to Secondary) 2:2 12 Front End Director processor requirements per storage system >2 dependent on workload Manageability Ranking More detailed Implementation Complexity Ranking (1=lowest, 5=highest) 4 Table 4: Universal Replicator Asynchronous Replication Universal Replicator Asynchronous Replication RPO with Site Wide Instant Disaster Flexible RPO (minutes to hours) Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing unless additional point in time copies are added Host response time Sensitive to journal throughput and journal placement Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Flexible Size to peak workload or to peak RPO average Journals can be used to decrease bandwidth requirements by increasing RPO Storage Capacity Requirements (Primary to Secondary) ~1.3:2.3 Roughly, when using Universal Replicator, every three parity groups of production volumes will require one parity group dedicated to journals. Actual requirements are dependent on specific workload. Front End Director processor requirements per storage system >4 dependent on workload Manageability Ranking Moderate Adding workload or volumes requires re-evaluation of journal configuration 13 Table 5: Universal Replicator Three Data Center Cascade Universal Replicator Three Data Center Cascade RPO with Site Wide Instant Disaster RPO as low as 0 Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing unless additional point in time copies are added Host response time Sensitive to bandwidth, cache, remote storage system health, journal throughput, journal placement and distance Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Highest Size to peak for TrueCopy links Storage Capacity Requirements (Primary to Secondary) 1:~1.3:~2.3 Roughly, when using Universal Replicator, every three parity groups of production volumes will require one parity group dedicated to journals.<br><br> Actual requirements are dependent on specific workload. Front End Director processor requirements per storage system >2 for TrueCopy link dependent on workload >4 for Universal Replicator link dependent on workload Manageability Ranking Most detailed Adding workload or volumes requires re-evaluation of journal configuration Implementation Complexity Ranking (1=lowest, 5=highest) 5 14 Table 6: Universal Replicator Three Data Center Multi-Target Universal Replicator Three Data Center Multi-Target RPO with Site Wide Instant Disaster RPO as low as 0 Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing unless additional point in time copies are added Host response time Sensitive to bandwidth, cache, remote storage system health, journal throughput, journal placement and distance Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Highest Size to peak for TrueCopy links Storage Capacity Requirements (Primary to Secondary) 1:~1.3:~1.3 Roughly, when using Universal Replicator, every three parity groups of production volumes will require one parity group dedicated to journals. Actual requirements are dependent on specific workload.<br><br> Front End Director processor requirements per storage system >2 for TrueCopy link dependent on workload >4 for Universal Replicator link dependent on workload Manageability Ranking Most detailed Adding workload or volumes requires re-evaluation of journal configuration Implementation Complexity Ranking (1=lowest, 5=highest) 5 15 Table 7: Universal Replicator Three Data Center Multi-Target with Delta Resync Universal Replicator Three Data Center Multi-Target with Delta Resynch RPO with Site Wide Instant Disaster RPO 0 Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing unless additional point in time copies are added Host response time Sensitive to bandwidth, cache, remote storage system health, journal throughput, journal placement and distance Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Highest Size to peak for TrueCopy all links Storage Capacity Requirements (Primary to Secondary) 1:1:1 Roughly, when using Universal Replicator, every three parity groups of production volumes will require one parity group dedicated to journals. Actual requirements are dependent on specific workload. Front End Director processor requirements per storage system >2 for TrueCopy link dependent on workload >4 for Universal Replicator link dependent on workload Manageability Ranking Most detailed Adding workload or volumes requires re-evaluation of journal configuration Implementation Complexity Ranking (1=lowest, 5=highest) 5 16 Table 8: Universal Replicator Three Data Center Four-by-Four Universal Replicator Three Data Center 4X4 RPO with Site Wide Instant Disaster Flexible RPO (minutes to hours) Recovery from Logical Corruption Planned local or remote Point in Time recovery images Recovery from Logical Corruption during Disaster Recovery Testing No recovery during DR testing Host response time Sensitive to bandwidth, cache, remote storage system health, journal throughput, journal placement and distance Application Interaction Complexity Application quiesce required to automate point in time split Bandwidth Flexible Size to peak workload or to peak RPO average Journals can be used to decrease bandwidth requirements by increasing RPO Storage Capacity Requirements (Primary to Secondary) 1:1 Roughly, when using Universal Replicator, every three parity groups of production volumes will require one parity group dedicated to journals.<br><br> Actual requirements are dependent on specific workload. Front End Director processor requirements per storage system Consistency groups can span directors at either or both locations, unmatched 1 to 4 for Universal Replicator link(s) dependent on workload Manageability Ranking Most detailed Adding workload or volumes requires re-evaluation of journal configuration Implementation Complexity Ranking (1=lowest, 5=highest) 3 17 Architectural Alternatives Table 9 below presents the costs and benefits of several common options to the reference architectures defined in Tables 1 through 8 above. Table 9: Design Variations for Replication Architectures Option Benefits Costs Additional ShadowImage In-System Replication software copies at recovery site · DR testing without impact to RPO · Additional storage capacity for additional ShadowImage copies · Incremental increase to ShadowImage software licenses · Additional scripting to accommodate additional pairs Additional ShadowImage In-System Replication software copies at production site · Local Recovery from logical corruption · Additional storage capacity for additional ShadowImage copies · Incremental increase to ShadowImage software licenses · Additional scripting to accommodate additional pairs Symmetric/ bi-directional remote copy links · Provides for reverse replication to support post-outage failback · Sufficient ports on storage system and channel extenders must be available to support bi-directional replication Fully symmetric/ bi-directional configuration (remote copy links plus recovery volumes) · Ability to move applications to the recovery facility while still maintaining all recovery options · Reduces administrative complexity because operations can be performed from either site with minimal change in procedures · Simplifies and accelerates the failback process · Additional storage necessary to provide symmetric ShadowImage configurations at both the primary and secondary sites · Sufficient ports on storage system and channel extender must be available to support bi-directional replication Hitachi Copy on Write Snapshot software · Up to 64 Point-In-Time Copies · Ability to reduce storage requirements to support additional copies (dependent on locality of write activity) · Host I/ O Performance Impact · Pool overflow conditions will prevent recovery from secondary volumes · May be inappropriate for use as testing volumes due to pool capacity considerations 18 Implementation Planning Data Collection Tools and Processes Successful implementation planning involves collection of data for a comprehensive view of the replication environment.<br><br> In addition to business statistical data, Hitachi Data Systems representatives may employ or recommend the following tools in the solution planning process: Risk Analysis and Remote Copy Planning and Design Services Hitachi Data Systems Global Solution Services provides a number of thorough planning, implementation and integration services for the data replication environment. Key among them for objectively evaluating business requirements and risks for distance replication are the Risk Analysis and Remote Copy Planning and Design services. Starting with the Risk Analysis Service will help identify and quantify the probability of loss occurrence and expected losses for the company.<br><br> Hitachi Data Systems Global Solutions Services help determine which risks are acceptable and which require mitigation; they will develop a risk model to evaluate critical exposures, examine regional and local risks to calculate expected frequency and probable losses, and prioritize risks based on likelihood of occurrence. The following deliverables are included in the scope of this service: " An IT Infrastructure and Facility Risk/Exposure Analysis " A Business Unit and Facility Risk/Exposure Analysis " A Risk Analysis Report covering local and regional risks and exposures, a prioritized matrix of exposures and expected occurrences, and a suggested approach for mitigating and addressing the most pressing risks and exposures, and " An Engagement Findings Executive Presentation summarizing the Risk Analysis Report With the most critical risks and exposures identified, the Remote Copy Planning and Design Service applies data replication best practices to produce a detailed study of the existing environment and a documented high- level strategy for implementing the most appropriate and cost effective distance replication solution. Deliverables from the Remote Copy Planning and Design Service include: " An audit of host and storage environment hardware and software to be included in the replication environment " A report of workload and performance characteristics of the volumes in the replication environment " Documented objectives for the replication environment " Documented mechanisms and techniques to support achievement of those objectives " A strategic recommendation report providing feedback and recommendations for an overall approach, schedule and key success factors, and " A recommended configuration identifying volumes, copy groups, update frequency and other management criteria Remote Copy Expert Assistant Remote Copy Expert Assistant (RCEA) is a tool used by Hitachi Data Systems representatives to automate and streamline tasks needed to deliver remote copy services such as the Remote Copy Planning and Design Service.<br><br> RCEA encompasses the following steps: Data Collection 19 Workload, performance and configuration metrics are collected at the host and storage levels. Ideally, data should be collected spanning a full four- to six-week business cycle to include standard data processing peaks like month or quarter end; collection timeframes for multiple servers should overlap. " For the Mainframe 3 SAS can be used for data analysis 3 RMF Magic is a third-party analysis tool " For Open Systems 3 RCEA from the Tools Competency Center uses common data collection scripts 3 Excel can be used to compile data manually Data Processing and Analysis The data is then transported to Hitachi Data Systems where it secured and workloads are modeled to identify an optimal replication solution.<br><br> RCEA also allows Hitachi Data Systems representatives to do combined storage and host data analysis Resource Requirements Bandwidth and Replication Paths In many respects, replication traffic is processed much like any other workload on a storage system. Write I/O on the primary device is transferred across a wire to the secondary device. For Hitachi remote copy products, this traffic uses SAN Fibre Channel connections between SCSI initiator ports at the production facility to SCSI target ports on the secondary storage system.<br><br> Those Fibre Channel paths have a specific bandwidth capacity, and multiple connections may be necessary to ensure sufficient capacity. It is an industry best practice to dedicate separate bandwidth offering the highest Quality of Service to data replication. TrueCopy Synchronous software requires high bandwidth and low latency.<br><br> TrueCopy Asynchronous and Universal Replicator software require less bandwidth and will tolerate some latency. However, variation in latency over time or cjitter d should be kept to minimum. To maintain a continous replica copy, bandwidth must exceed the average write workload that occurs during any given RPO interval subject to the capacity limitations of the buffering mechanism.<br><br> This means that if an organization wants to maintain an RPO of twenty minutes, the twenty minute interval with greatest write activity must be identified. With that, the bandwidth and buffer capacity required to keep up with this traffic can be calculated. In practice, an absolute peak interval cannot be identified; data on hand can be used to revise resource requirements up to accomodate gaps in data and uncertainty due to assumptions made during the assessment process.<br><br> Further, in calculating bandwidth requirements for synchronous replication, network latency and protocol conversion are added twice to initial storage response time since data transmission via Fibre Channel protocol requires two round trips 5 one for command and one for data. Network bandwidth recommendations are offered through the Remote Copy Planning and Design Service, but the final network choice is the customer 9s responsibility. Table 10 below weighs the pros, cons and applications for many available bandwidth options.<br><br> 20 Table 10: Network Connectivity Options for Remote Replication Connectivity Pros Cons Used for Dark Fiber · Bandwidth · Highest Quality of Service · Cost · Availability · Complexity · SAN extension and Synchronous replication DWDM · Bandwidth · High Quality of Service · Cost · SAN extension and Synchronous replication Optical Carrier Networks · Quality (packet loss and latency) · Availability · Cost efficiency · Could be shared or dedicated for different network services Ethernet (IP) Networks · Lowest cost · Shared with other data services · Highest availability · Require Fibre protocol conversion · Highest protocol overhead · Latency jitter due to routing · Most widely used network services Redundancy Requirements In addition, it 9s important to identify redundancy requirements. As a best practice, replication over distance should occur over redundant independent wide area network (WAN) circuits. In the event of a telco outage, a seperate independent circuit must be available that will accomodate the production write workload.<br><br> Processing Capacity on the Storage Systems Remote Copy operations incur overhead on the storage system; additional processor cycles are consumed to service the additional steps necessary. On Hitachi storage systems, this additional workload occurs within the processors for the front-end director ports, which are also used to service host I/O. Sufficient front-end director ports should be allocated to accomodate the production write workload alongside additional requirements for handling remote copy operations.<br><br> Channel Extension As mentioned above, Hitachi replication traffic i