Kinesis Data Stream - Nejati Notes

Kinesis Data Streams is a fully managed, serverless streaming data service that you can use to capture, process, and store data streams at any scale. - event-driven architectures - real-time processing - traffic bursts - read from variety of sources: ![[Pasted image 20250306021124.png]] > [!info] > The cost of Kinesis Data Streams depends on factors such as the data volume, number of streams, and required data processing capacity. ### Core Components & Terms: - **Kinesis Data Stream:** This is the fundamental unit. It's a **set of shards**, and it carries a **stream of data records**. Think of it as a pipeline for your data. - **Data Record:** This is the **unit of data** stored in a Kinesis data stream. It's composed of a **sequence number**, a **partition key**, and a **data blob** (your actual data, up to 1 MB). - **Sequence Number:** A unique identifier for each record within its shard. They generally increase over time. - **Partition Key:** Determined by the data producer, this key is used to **group data into different shards** within a stream. All records with the same partition key go to the same shard. This is crucial for ordering and parallel processing. - **Data Blob:** Your actual data, serialized into bytes. - **Shard:** A **uniquely identified sequence of data records** within a stream. It's the **base throughput unit** of a Kinesis data stream. - Each shard provides a fixed capacity: 1MB/second and 1,000 records/second for writes, and 2MB/second and 5 transactions/second for reads. - The number of shards determines the stream's overall capacity. You can increase or decrease shards (resharding) to scale throughput. - **Retention Period:** Data records are stored in shards for a configurable amount of time, from **24 hours (default) up to 365 days**. After this period, records are deleted. ![[Pasted image 20250305224857.png]] ### Kinesis vs SQS+SNS Amazon Kinesis is ideal for real-time processing of streaming big data because of its features like record ordering and replayability. It's not suitable for messaging semantics or individual message delay, which are better handled by Amazon Simple Queue Service (Amazon SQS) |Feature|Amazon Kinesis|Amazon SQS (+SNS)| |:--|:--|:--| |**Primary Use**|Real-time data streaming & analysis|Application decoupling, task distribution, notifications| |**Data Model**|Continuous stream of ordered records|Discrete messages| |**Consumption**|Multiple consumers can read the same data|Typically one consumer per message (from SQS)| |**Ordering**|Guaranteed within a shard (Kinesis Data Streams)|Guaranteed with SQS FIFO; Best-effort for Standard SQS| |**Persistence**|24 hours to 365 days (replayable)|Up to 14 days (SQS), then deleted after processing| |**Throughput**|Very high (MBs/GBs per sec)|High, but generally lower than Kinesis| |**Fan-out**|Built-in for Kinesis streams|Achieved by SNS fanning out to multiple SQS queues| |**Processing**|Real-time, continuous processing|Asynchronous, often batch or individual task processing| |**Complexity**|Can be more complex to set up and manage shards|Generally simpler, especially SQS| ### Kinesis vs Firehose | **Feature** | **Kinesis Data Streams** | **Kinesis Firehose** | | ------------------- | ---------------------------------------------------- | ------------------------------------------------- | | **Primary Purpose** | Real-time custom processing | Automated data delivery to destinations | | **Data Retention** | 24h–365 days (configurable) | No retention—data sent immediately to destination | | **Processing** | Custom code (Lambda, Spark, etc.) | Built-in batch transformations (optional Lambda) | | **Scalability** | Manual scaling via shards | Fully automatic, no capacity planning | | **Consumers** | Multiple consumers (e.g., Lambda, Kinesis Analytics) | Single destination (e.g., S3, Redshift) | | **Latency** | Sub-second (real-time) | 60+ seconds (near-real-time due to batching) | | **Cost Model** | Per shard hour + PUT payload units | Per GB ingested + transformation fees | | **Use Case** | Real-time analytics, multi-stage pipelines | Log aggregation, ETL to data lakes/warehouses | ### Development Kinesis Data Streams development requires understanding the Kinesis Client Library (KCL) for consuming data and the Kinesis Producer Library (KPL) for producing data. > [!info] > Kinesis can integrate with CloudWatch to monitor the health, performance, and usage of your streams > [!tip] > One of ways to save costs in Kinesis is to combine message and have fewer messages (less then 1MiB) in shards. ### to Lambda A normal practice use of Kinesis Streams with Lambda is to process the streams with Lambda in a server-less manner. A Kinesis Stream is connected to Lambda Service. Each `n` milliseconds the Lambda Service checks if any streams exists in Kinesis Shards. If available, it sends the data to Lambda functions. ![[Pasted image 20250306023500.png]]