Cassandra - Nejati Notes

**What it is:** - A **highly scalable, distributed NoSQL database**. - Designed for **high availability and fault tolerance** with no single point of failure. - Open-source, originally developed at Facebook. **Core Architecture & Concepts:** - **Decentralized (Peer-to-Peer):** All nodes in a cluster are equal; no master/slave hierarchy. Data and requests can be handled by any node. - **Distributed:** Data is partitioned and replicated across multiple nodes (and potentially data centers). - **Replication Factor:** Defines how many copies of data are stored. - **Consistency Level:** Tunable consistency for reads and writes (e.g., `ONE`, `QUORUM`, `ALL`) balances availability and data accuracy. - **Scalability:** - **Linear Scalability:** Performance increases proportionally with the number of nodes. - **Horizontal Scaling:** Scale out by adding more commodity servers. - **Data Distribution:** - **Partition Key:** Determines which node stores the data (data is sharded based on a hash of the partition key). - **Snitches:** Determine network topology for efficient request routing. - **Gossip Protocol:** Nodes communicate state information with each other. - **Write Path:** Writes are very fast. Data is written to an in-memory `memtable` and a commit log on disk. When memtables are full, data is flushed to `SSTables` (Sorted String Tables) on disk. - **Read Path:** Reads can involve multiple nodes depending on consistency level. Data is retrieved from memtables and SSTables. Bloom filters and various caches optimize read performance. **Data Model:** - **Keyspace:** The outermost container for data, similar to a schema or database in relational systems. - **Tables (Column Families):** Contain rows of data. - **Rows:** Identified by a unique **Primary Key**. - **Partition Key:** The first part of the primary key; determines data distribution. - **Clustering Columns (Optional):** The remaining parts of the primary key; determine the on-disk sort order of data within a partition. - **Columns:** Each row can have a varying number of columns (flexible schema). - **Query-Driven Design:** Data modeling is optimized for specific query patterns, not for relationships like in RDBMS. Denormalization is common. **CQL (Cassandra Query Language):** - SQL-like syntax for interacting with Cassandra (creating keyspaces/tables, inserting/updating/deleting/selecting data). - Lacks joins, referential integrity, and complex transactions found in SQL. **Key Advantages/Use Cases:** - **Massive datasets:** Handles petabytes of data. - **High write throughput:** Excellent for applications with heavy write loads (e.g., IoT, logging, time-series data). - **Always-on availability:** Critical for applications that cannot tolerate downtime. - **Geographical distribution:** Supports multi-data center deployments for disaster recovery and reduced latency. - **Scalability on demand:** Easily add or remove nodes. - Use cases: Time-series data, user activity tracking, messaging systems, e-commerce catalogs, recommendation engines. **Important Considerations:** - **Data Modeling Complexity:** Requires careful planning based on query patterns. - **Eventual Consistency:** While tunable, achieving strong consistency can impact performance/availability. - **No Joins/Transactions (in the traditional RDBMS sense):** Application logic may need to handle these. - **Operational Overhead:** Managing a distributed system can be complex, though tools are improving.