**What it is:**
- A **highly scalable, distributed NoSQL database**.
- Designed for **high availability and fault tolerance** with no single point of failure.
- Open-source, originally developed at Facebook.
**Core Architecture & Concepts:**
- **Decentralized (Peer-to-Peer):** All nodes in a cluster are equal; no master/slave hierarchy. Data and requests can be handled by any node.
- **Distributed:** Data is partitioned and replicated across multiple nodes (and potentially data centers).
- **Replication Factor:** Defines how many copies of data are stored.
- **Consistency Level:** Tunable consistency for reads and writes (e.g., `ONE`, `QUORUM`, `ALL`) balances availability and data accuracy.
- **Scalability:**
- **Linear Scalability:** Performance increases proportionally with the number of nodes.
- **Horizontal Scaling:** Scale out by adding more commodity servers.
- **Data Distribution:**
- **Partition Key:** Determines which node stores the data (data is sharded based on a hash of the partition key).
- **Snitches:** Determine network topology for efficient request routing.
- **Gossip Protocol:** Nodes communicate state information with each other.
- **Write Path:** Writes are very fast. Data is written to an in-memory `memtable` and a commit log on disk. When memtables are full, data is flushed to `SSTables` (Sorted String Tables) on disk.
- **Read Path:** Reads can involve multiple nodes depending on consistency level. Data is retrieved from memtables and SSTables. Bloom filters and various caches optimize read performance.
**Data Model:**
- **Keyspace:** The outermost container for data, similar to a schema or database in relational systems.
- **Tables (Column Families):** Contain rows of data.
- **Rows:** Identified by a unique **Primary Key**.
- **Partition Key:** The first part of the primary key; determines data distribution.
- **Clustering Columns (Optional):** The remaining parts of the primary key; determine the on-disk sort order of data within a partition.
- **Columns:** Each row can have a varying number of columns (flexible schema).
- **Query-Driven Design:** Data modeling is optimized for specific query patterns, not for relationships like in RDBMS. Denormalization is common.
**CQL (Cassandra Query Language):**
- SQL-like syntax for interacting with Cassandra (creating keyspaces/tables, inserting/updating/deleting/selecting data).
- Lacks joins, referential integrity, and complex transactions found in SQL.
**Key Advantages/Use Cases:**
- **Massive datasets:** Handles petabytes of data.
- **High write throughput:** Excellent for applications with heavy write loads (e.g., IoT, logging, time-series data).
- **Always-on availability:** Critical for applications that cannot tolerate downtime.
- **Geographical distribution:** Supports multi-data center deployments for disaster recovery and reduced latency.
- **Scalability on demand:** Easily add or remove nodes.
- Use cases: Time-series data, user activity tracking, messaging systems, e-commerce catalogs, recommendation engines.
**Important Considerations:**
- **Data Modeling Complexity:** Requires careful planning based on query patterns.
- **Eventual Consistency:** While tunable, achieving strong consistency can impact performance/availability.
- **No Joins/Transactions (in the traditional RDBMS sense):** Application logic may need to handle these.
- **Operational Overhead:** Managing a distributed system can be complex, though tools are improving.