Database as a Service, High Availability, Horizontal Scalability and Multi-master Architecture

DISCLAIMER: This post intended to aggregate research performed while evaluating various DB solution fitting specific requirements. While it may be a bit lengthy, I hope it would give the reader good starting point while deciding which path to take.

SkySQL was reviewed in more details, and this article presents in-depth review of “last-mile” considerations. The intent is not to force reader into particular solution, but to share all accumulated information.

Introduction

While deciding about changes to database layer at the core of your system many questions and concerns are being raised.

  • Will my system perform as expected after migration?
  • How much effort is required to migrate existing solution?
  • Will it require serious code refactoring?
  • How a new DB solution will fit existing CI/CD practices?
  • What would be impact on product budget in the near future and a long term?
  • Is it available on current cloud platform? Can it be cloud provider agnostic?
  • Can my DevOps team support it and/or will require extensive additional training?

Team needs to take a few things in considerations to ensure smooth transition and minimize downtime, while gaining as much from the migration.

Project’s OKRs

Let’s establish more narrow goals and assess our milestones:

  • Evaluate a Database-as-a-Service vs database as an appliance in the cloud.
  • Ensure High Availability for the database layer.
  • Add Horizontal Scalability and Multi-Primary support.
  • Minimize impact on existing application while minimizing adjustments to the code (in-place drop-in is ideal, i.e. keep standard connectivity to database through non-API DB access layer, if possible).
  • Ability to integrate into existing CI/CD workload.

For purpose of our specific research a few additional requirements also apply:

  • Existing solution uses MariaDB database engine with Primary and Replicas in place.
  • Downtime should be no more than few hours.

Let’s review each requirement, its benefits, and then options available with Maria DB engine.

Database-as-a-Service

“Database-as-a-Service (DBaaS) is one of the fastest-growing cloud markets. According to Gartner, 75% of all databases will be running in the cloud by 2022. The first relational DBaaS offerings were created by cloud service providers – Amazon RDS, Microsoft Azure Database and Google Cloud SQL. These services started with MariaDB, MySQL and/or PostgreSQL.”

Benefits

  • Save money by combining hardware, hosting, software and support cost into one package.
  • Rapid Provisioning of DB instances takes shorter time due to automation and streamlining of the associated activities.
  • Outsourcing of administration and monitoring reduces on-premise IT team activity requirements and optimizes time to market and results in further cost savings.
  • Enhanced Security created through “by-design” security practices helping with compliance and security breaches. Off-site hosting lowers complexity of on-site security requirements.
  • Improved monitoring and tracking of time, space and resource consumption.
  • Server Space requirements are reduced and optimized.
  • Scalability is achieved through advanced provisioning via provider’s offering.

High availability / Always-On Architecture

“High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher-than-normal period.”

To facilitate High Availability and Always-On Architecture number of considerations need to be taken.

MariaDB already provides number of options (reference):

  1. MariaDB Replication – MaxScale’s MariaDB Monitor (mariadbmon) detects primary server failure and promotes the most up-to-date replica based on Global Transaction ID (GTID), waits for that replica to execute any transactions in its relay log, and begins routing queries to it.
  2. MaxScale’s Galera Monitor (galeramon) is used to minimize application impact upon server failure. Additionally, MaxScale may be used to assign primary and replica roles to database instances within a cluster to support read/write traffic splitting and to minimize the risk of certification failures.
  3. Multi-Node Enterprise ColumnStore – MaxScale’s MariaDB Monitor (mariadbmon) determines which node is the primary server, and automatically performs fail-over upon primary server failure.
  4. MariaDB Xpand – Xpand divides tables and indexes into slices, and it distributes those slices among all the nodes. Xpand maintains replicas of each slice, so that it can recover from a node failure without loss of data.

Replication approach is the most common one today and works well with non-distributed processing.  However, it puts certain limitations on how architecture can scale horizontally.

Multi Master Architecture / Horizontal Scalability / Continuous Availability

Horizontal scalability: The ability to increase capacity of the server by connecting multiple hardware or software entities so that they work as a single logical unit. HS can be achieved with the help of clustering, distributed file system, and load balancing.

Continuous availability: The infrastructure (or the applications running on it) cannot be interrupted at all. Essentially, there is no allowance for any outage, either planned or unplanned.

This is where multi-master approach (ex: MaxScale’s Galera Cluster / option #2 above) comes in.

“Galera Cluster turns the popular MySQL and MariaDB server into a multi-master replication enabled solution with an automated member provisioning and joining process and no-data-loss guarantee with a proper configuration”.

With maturity of MariaDB Xpand Distributed SQL architecture, MaxScale can be substituted to achieve even greater MMA flexibility.

Concept of MMA is not new and is used with distributed AD controllers, multi zone data centers, etc.

Here is a an old introduction video which provides some top level details about how Galera architecture extends classical Replication approach – Galera Cluster for MySQL introduction – YouTube

Official information on Galera can be found here – https://mariadb.com/kb/en/what-is-mariadb-galera-cluster/

All MariaDB Galera features listed below are now rolled into Enterprise cluster setup and are supported natively.

Let’s review 3 workflows:

Node Failover

MMR can be used to protect the availability of a mission critical database. For example, a multi-master replication environment can replicate data in your database to establish a fail-over site should the primary site become unavailable due to system or network outages. Such a fail-over site can also serve as a fully functional database to support application access when the primary site is concurrently operational.

Database can be configured for automatic fail-over, which enables main node to failover to a different master site if the first master site fails.

Load Balancing and Multi-Master Replication

MMR is useful for transaction processing applications that require multiple points of access to database information with the following goals:

  • Distributing a heavy application load.
  • Ensuring continuous availability.
  • Providing support for edge computing efficiency.
  • Protect from geographical impacts.

Main goal of MMR is to ensure that data is available when and where you need it. The following describes several different scenarios that have different information delivery requirements.

Applications that have application load distribution requirements commonly include customer service-oriented applications.

What is MaxScale

Maria DB MaxScale is a “DB conductor” which enables database access orchestration and automatic load balancing and have the following characteristics:

  • High availability, scalability, and security.
  • Simplifying application development by providing unified direct client connections, native MariaDB look & feel.
  • Transparent load balancer with support of “Write to primary, read through replicas” distribution.
  • Automatic membership control, and primary recovery on primary loss.

Other Considerations

When comparing various DB options, it is important to note that there could be simply a difference in DB engines which will drive your decision. As an example, while MySQL and MariaDB are close in theory, and MariaDB supports mySQL backward compatibility mode, engines are different and software must be tested to ensure smooth transition between two. Furthermore, after version 10.2 MariaDB no longer considered drop-in replacement for MySQL.

DBaaS Solutions

We have established the foundation, let’s look at what is offered in a cloud. There are more options available, but we keep it short for the purposes of this research.

  • Cloud SQL – mySQL / SQL Server / PostgeSQL – hosted in GCP by Google team. HA* options. No multi-master support. No Maria DB support.
  • SkySQL MariaDB – hosted in GCP/AWS by MariaDB managed services. HA and MMS/MMR.
  • AWS Aurora – mySQL / PostgeSQL – hosted in AWS by AWS team. HA and MMS.
  • Azure Database for MariaDB/MySQL – MariaDB/MySQL – hosted in AWS by AWS team. HA enabled. Scalable.
  • Azure Cosmos – fully managed NoSQL DB – hosted in Azure. HA and MMS.
  • Google Cloud Spanner – non-traditional RDMS as data access requires API abstraction.  HA enabled.

MariaDB/MySQL based DBaaS Solutions Side-by-Side

Cloud SQL Features

Cloud SQL second generation is announced Aug 1, 2016 for general availability and introduces some changes not applicable in previous iteration.

  • Fully managed with 99.95% availability
  • Integrated
  • Based on MySQL 8.0.18 / 5.7 – slightly behind from official but not terribly (8.0.23 is officially available). As of time of this research, CloudSQL no longer includes information about current build used in official release notes.
  • Reliable – SSAE 16, ISO 27001, and PCI DSS compliant and supports HIPAA compliance
  • Offered by Google, but is lesser offering than SkySQL or Cloud Spanner.
  • No Memcache support
  • Read replicas are supported, but not as clustered solution. Application must maintain direct connection to a read replica and perform load balancing internally.
  • CloudSQL High Availability configuration
    Note: The HA standby instance cannot be used for read queries. Read replicas cannot have HA mode enabled.
    HA is achieved by disk replication between Zones to a standby instance. In order for standby instance be brought online, engine on standby would be brought online and would process any pending binlog transactions which may take up to 3 minutes.

HA in both Amazon RDS and GCP CloudSQL implemented via Shared Storage topology (DISK level synchronous replication across different zones). Actually, on failure of Primary server watchdog will switch run instance which is in standby state. During boot MariaDB/MySQL service will analyze binary log and rollback all uncommitted transactions. So, that 2-3 minutes are time required for boot and applying binary log to data. For my opinion 2-3 minutes is very optimistic time. Probably GCP is using Shared Storage topology because it is much cheaper (MaxScale analog – MySQL Router is available under dual license, probably this obstacle don’t let to integrate it as solution for cloud providers).

HA in SkySQL implemented via MaxScale. So it is not correct to compare CloudSQL and SkySQL side-by-side. Also additional MaxScale layer means that if it fails no matter how many replicas are in place, DB will stay unavailable until MaxScale instance is recovered. SkySQL now offers HA option for MaxScale now to address the issue.

SkySQL Features and Benefits

SkySQL is MariaDB’s answer to CloudSQL and builds on top of extensive enterprise knowledge Maria DB team acquired over the years.

SkySQL is managed solution which expands MariaDB, MaxScale and Galera Cluster offering to the cloud and includes all features from individual products. It is available in GCP and AWS allowing for easy hosting migration.

The solution is managed by the SkySQL team – professional services arm of MariaDB Corporation.

SkySQL Features

First announced March 31, 2020. GA announced May 12.

Current versions:

  • MariaDB Enterprise Server – 10.4.14-8
  • MaxScale – 2.5.3
  • ColumnStore – 5.4.1
  • Xpand – used in Distributed SQL configuration. Would resemble CloudSpanner in behaviour without API overhead
  • Galera 4.0+ – offered as an option but can be overlooked by proper tuning of MaxScale cluster or via Distributed SQL configuration.

Other features:

  • Fully managed with 99.95% – 99.995% availability
  • Cluster Nodes split across zones within the same region with DR nodes in different region if necessary.
  • TLS 1.2/1.3 required for all connections

Please refer to attached October 2020 SkySQL Announcement for additional details.

SkySQL usage reviewed below in Transactional usage and offered in 4 configurations:

  • Transactional Standalone – 1 node
  • HA – primary and 2 replicas – 3 nodes
  • Galera – 3 nodes

SkySQL Benefits

  • Always up to date: SkySQL is kept current and updated to stay in line with current products available
  • Cloud Distributed: SkySQL includes a fully distributed cloud data warehouse that now provides massively parallel processing (MPP) for scalability and high availability on large datasets.
  • End-to-end security: All SkySQL databases are secure by default, built from the ground up to provide the ultimate security in the cloud. SkySQL now also enforces secure SSL/TLS connections for any database access, avoiding exposure of data due to insecure defaults or configuration choices. AD integration is available on request.
  • Disaster Recovery: SkySQL combines nightly backups, persistent disks, point-in-time rollback and cross-region replication to minimize the recovery point and recovery time objectives (RPO/RTO) of mission-critical, production databases.
  • Expanded monitoring: SkySQL monitoring shows the status and all vital metrics for database instances and is highly customizable. Health state of the nodes within defined scalability strategies.
  • Radically reduced complexity for application development: SkySQL provides a single connection point for applications rather than exposing individual database instances, primaries or high availability replicas. SkySQL manages read/write-splitting, seamless fail-over and application session migration. The result is fault tolerance and efficient use of resources that is completely transparent to developers and end users.

SkySQL for Distributed SQL

Alternative to SkySQL cluster and Galera is SkySQL for Distributed SQL.

  • Based on Xpand technology.
  • Requires min 3 nodes.
  • Single node failure without loss of data. Auto healing without intervention.
  • Elastic scale-up/scale-down infrastructure topology.

SkySQL SLA Considerations

Official Service Downtime metrics and Service Credits

  • 99.95% service availability in multi node configuration – ex: 22 min max over 30 days
  • if exceeded 99.95% Service Credits will be applied per calculated monthly uptime percentage prorate.
  • Excludes impact of GCP and GKE engine.

SkySQL Support Options

As presented in Pricing section several tiers are available:

  • Normal response time 4-8 hours for Standard tier.
  • SkyDBA / Premium – in minutes / self managed / self driven – additional cost.
  • DBA backup recovery – normal – severity 1 – 8 min avg, 30 min SLA – immediate action upon reporting / detection.
  • GCP Kybernetes engine takes on recovery of the lost node and will rejoin the cluster – does not require ticket.
  • Patches are applied every 3 months with 2 weeks advanced notice with ability to delay.
  • Patches are first applied to replicas. Time can be agreed on.
  • Testing instance can be spin off for advance evaluation.
  • Enterprise binaries are available for download for on prem. testing. Provided for free to SkySQL clients.
  • 7 days daily backup option is included. Backup retention policy can be extended for additional pass-through cost of GCS storage. Nearline storage class is used.
  • AD authentication for access can be implemented during on-boarding.

SkySQL vs CloudSQL

SkySQL vs CloudSQL comparison provided by (by SkySQL team)

Light recap – read above for more detailed comparison.

  • CloudSQL is mySQL based vs MariaDB in SkySQL
  • CloudSQL caps write throughput forcing to upgrade instance type.
  • CloudSQL only allows Read replicas
  • In addition to Primary/replica supported by CloudSQL, SkySQL offers Multi-master and Distributed SQL configuration based on MariaDB Enterprise Server nodes for transactional access.
  • CloudSQL HA option is based on Hot/Spare two node configuration. While “Hot” node is alive, all traffic routed to it. Spare is used for auto healing when main node fails. Per documentation failover can take up 2 minutes to complete transition. May require transaction replay if caught in-flight.
    SkySQL failover takes less than a second. No transaction replay if caught in-flight required.
  • CloudSQL does not offer automatic load balancing.

Expect things to be slower

While analyzing stats from your staging setup before making final decision, do not overlook the fact that you are looking at Apples and Oranges. While DBaaS engine may be a drop-in for your project and not many physical changes are required to keep it running, one thing is important – you are changing IT architecture. Even with GCP dedicated interconnect, you are still looking at virtualized environments and new considerations need to be made for network pipes length, edge networks, regions and zones. Don’t be dissuaded, by making few important, and may be long postponed, changes to your current topology and data patterns, you should be able regain “lost” speed.

Conclusion

This article makes attempt to give reader enough information and references to make a best choice suitable for your application configuration while satisfying your CI/CD requirements and I hope you find it useful. Google Cloud, Azure and AWS offer many different database options and it is important you choose DBaaS solution which makes your project a success. Good luck.

This entry was posted in Technology. Bookmark the permalink.

Leave a Reply

%d bloggers like this: