SQL(RDBMS) vs No SQL(non RDBMS): How do I decide
Choosing Databases while designing solution is most crucial step as if we choose wrong one it might be very painful to revert that decision.
I have tried to call out some factor to consider while deciding the SQL and no-SQL DBs. In this blog I have also talked about some unique characteristics of choices available.
Factors to consider SQL vs NO-SQL
Horizontal vs vertical scaling
SQL DBs are designed to be scaled up which means, when needing extra capacity more hardware needs to be added to same machine which will become constraint after certain points. Scaling out to multiple machines with SQL is possible but it complex and requires extensive effort (partitioning, sharding, clustering, etc.) and cost. This is called vertical scaling.
In NO SQL DBs we can have one database working on multiple nodes. Scaling out (or back in) means we can easily add and remove nodes. This makes NoSQL a perfect match for the cloud. Because it can scale out, you will be maximizing the scalability benefits of the cloud.
Rigid structured Data vs flexible structured data
The ability to store huge amounts of data in a flexible way makes NoSQL faster to develop. It allows for rapid changes to the database schema as the scope evolves and requirements change.
Again, this doesn’t mean SQL is slow. If your data is highly structured and you anticipate minimal change then there’s probably no reason to use NoSQL.
Complex data access
With SQL you can build one script that retrieves and presents your data. NoSQL doesn’t support relations between data types. Running queries in NoSQL is doable, but much slower.
Big transactions requirement
SQL databases are a better fit for heavy duty or complex transactions because it’s more stable and ensure data integrity. Executing stored procedure and triggers could be useful. In No SQL updating multiple entities as part of single transaction might be as easy as in SQL and does not have good support for stored procedures and triggers.
Prediction for data size
If its difficult to predict volume of data at start and constantly new features are being added then NO SQL should be given preference.
Data integrity
If 100% data integrity and consistency is needed then SQL DBs should be the choice. NO SQL DBs mostly guarantee eventual consistency because of network partitions.
Every big system design will have SQL and NO SQL both kind of databases to provide entire solution. Based on what could be the size of entity and use cases for it will decide whether SQL or NO-SQL what will be the best fit.
2 important terms to know before we deep dive into explaining few feature of DBs.
OLTP(Online transaction processing)
An OLTP system captures and maintains transaction data in a database. Each transaction involves individual database records made up of multiple fields or columns. Examples include banking and credit card activity or retail checkout scanning.
In OLTP, the emphasis is on fast processing, because OLTP databases are read, written, and updated frequently. If a transaction fails, built-in system logic ensures data integrity.
OLAP(Online analytical processing)
OLAP applies complex queries to large amounts of historical data, aggregated from OLTP databases and other sources, for data mining, analytics, and business intelligence projects. In OLAP, the emphasis is on response time to these complex queries. Each query involves one or more columns of data aggregated from many rows. Examples include year-over-year financial performance or marketing lead generation trends. OLAP databases and data warehouses give analysts and decision-makers the ability to use custom reporting tools to turn data into information. Query failure in OLAP does not interrupt or delay transaction processing for customers, but it can delay or impact the accuracy of business intelligence insights.
Most popular SQL DBs
MySQL
- Open source RDBMS with two licensing models: free Community Server and proprietary Enterprise Server.
- One of most popular RDBMS after Oracle.
- Supports all properties of RDBMS DBs and could be good fit there you are considering using RDBMs DB.
- Multi-model database and supports both structured data (SQL) and semi-structured data (JSON).
- Used where one database is required for both OLTP and OLAP workload.
- Not good where Multi-Master ACID transaction is a must-have feature.
Oracle
- Oracle is the number one commercially supported database and one of the widely used RDBMS overall.
- Offers ACID transactional guarantee. In terms of CAP, it offers immediate Consistency as a single Server.
- Used where one database is required for both OLTP and OLAP workload.
- It’s not open source hence have a good licensing cost associated with it.
- Multi-model database and supports both structured data (SQL) and semi-structured data (JSON).
- Not fit for unstructured data.
- Not good where Multi-Master ACID transaction is a must-have feature.
Postgres SQL
- PostgreSQL is one of the oldest open-source and free databases
- Offers ACID transactional guarantee. In terms of CAP, it offers immediate Consistency as a single Server.
- Citus Data, a PostgreSQL extension, offers much-coveted Distributed SQL features, i.e., a database with the transactional guarantee of an SQL database and horizontal scaling of NoSQL database.
- Multi-model database and supports both structured data (SQL) and semi-structured data (JSON).
- Not good where one database is required for both OLTP and OLAP workload.
- Not good where Multi-Master ACID transaction is a must-have feature.
Microsoft SQL Server
- Proprietary RDBMS with diverse licenses: Developer and Express (free), Standard, and Enterprise (commercial).
- Offers ACID transactional guarantee. In terms of CAP, it offers immediate Consistency as a single Server.
- Can be considered if development platform is strongly coupled with other Microsoft Products.
- Good to use if Azure Cloud is the preferred public Cloud Technology with a preference for Managed SQL database.
- Good to use if company already has MS-SQL databases and wants to lift-and-shift them in Cloud.
- Not good where Multi-Master ACID transaction is a must-have feature.
- Data is Semi-structured, i.e., JSON data, and needs advanced query operations.
- Not good where one database is required for both OLTP and OLAP workload.
DB2
- Db2 is a proprietory RDBMS with an ACID transactional guarantee. It has different licenses: Community (free), Standard, and Advanced (commercial).
- Used where one database is required for both OLTP and OLAP workload.
- Has Native support for AI. Hence Good fit if AI functionality in the database is required for competitive advantage.
- Not good where Multi-Master ACID transaction is a must-have feature.
Most popular No SQL DBs
No-SQL DBs space is much more complicated then SQL DBs because it has so many classificatons further.
Document DBs: Every record is stored as document and can have its own structure.
MongoDB
- Open-Core NoSQL Document Database (BSON) with various licenses: Community Server (free), Enterprise Server (commercial), Atlas (managed in Cloud).
- Document database that stores the data in JSON-like documents .
- It is most popular NO SQL DB.
- MongoDB also supports various methods of searching such as geographical searching, text searching, graph searching, etc.
- Offers horizontal scaling via Auto-Sharding and has built-in replication via primary-secondary nodes. In terms of CAP, it is CP (Consistent and Partition tolerant).
- Data is Document or semi-structured(e.g. JSON, XML) with advanced query features.
- Not good if multi-Master write is needed for write-heavy loads as it provides consistency in distributed environment.
Apache CouchDB
- Document DB, uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
- The CouchDB file layout follows all the features of ACID properties.
- The main reason for the popularity of CouchDB is a map/reduce system, which allows optimisation in combining of data.
- CouchDB can replicate to devices like smartphones that have a feature to go offline and handle data sync for you when the device is back online.
- CouchDB guarantees eventual consistency to provide both availability and partition tolerance.
- All items have a unique URI(Unique Resource Identifier) that gets exposed via HTTP. It uses the HTTP methods like POST, GET, PUT, and DELETE for the four basic CRUD (Create, Read, Update, Delete) operations on all resources.
- Arbitrary queries are expensive.
- There’s a bit of extra space overhead with CouchDB compared to most alternatives.
- It doesn’t support transactions.
Column Databases: Columnar databases have been called the future of business intelligence (BI). They’re often used in data warehouses, the structured data repositories that businesses use to support corporate decision-making. Most of the time OLAP use cases are dependent on specific columns hence for faster access column DBs could be more efficient. One of the main difference between row and column DBs is the way it stores data on hard disks and how it gets accessed which affects the performance.
Apache Cassandra
- Used as OLAP Database (e.g., Data Warehouse) to handle the gigantic volume of data. Also used as a time-series database.
- Offers linear horizontal scaling and one of the most scalable databases with automatic sharding. In terms of CAP, it is AP (Available and Partition Tolerant).
- Decentralized database (Leaderless) with automatic replication and multi-datacenter replication. As a result, it is fault-tolerant with no single point of failure.
- Has user-friendly and SQL like query language: Cassandra Query Language (CQL).
- Used for As a time series database with massive amount of read and write operations.
- Good for high durable use cases when data-loss is acceptable.
- Not good as a primary database (OLTP) needing an ACID transaction guarantee.
- If Data is frequently deleted, then its performance degrades (“Tombstones”).
Cache Databases: One of the most popular is Redis. To know more about cache DBs read my other blogs:
Graph Databases: A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it. One of most popular in this space is Neo4j.
I have covered the most important points to decide which kinds of DBs could be useful in certain cases. Have also covered some of the most popular DBs in that space. Databases are ever evolving space and if you find I have missed something here please add into comments for me to include here.
Good References:
https://towardsdatascience.com/top-10-databases-to-use-in-2021-d7e6a85402ba
https://www.geeksforgeeks.org/top-10-open-source-nosql-databases-in-2020/