There is an awesome project here to process real-time data streams. It is a great way to learn the basics of the Kinesis family. As always, here I summarise the project but also add my own comments, hints and tips.
It is based on the fictional WildRydes ride sharing organisation, where customers can order a ride on one of our fleet of unicorns.
Each unicorn has been fitted with a sensor which sends back location and health information once per second to our operations centre.
The first step is to configure a Kinesis Data Streams. We supply a name, and configure the number of shards. In this case, 1 shard will suffice.
As is the case for many projects which involve some code, the project creates a Cloud9 IDE as an environment to run a supplied producer script to send data to the stream. If you have not used Cloud9 before that’s no problem. Instructions are supplied and the IDE is available in minutes.
Once running the producer script, a consumer script displays the once per second output from each unicorn.
A supplied dashboard, running somewhere in AWS, can graphically display the movement of the unicorns. We supply the dashboard with a Cognito Identity Pool ID to give it unauthenticated access to the stream.
In Kinesis, you can monitor the stream. After a while the number of incoming records and Bytes should become stable.
It can be tricky to interpret the data. Hover the mouse over the text that says “Incoming Records Limit” and click the “x” sign to remove it, to display the above amount of incoming records. Hover the mouse over the “IncomingRecords” to see that it is 300 over the last 5 minute interval. This screenshot was taken when there was 1 unicorn flying generating data at 1 second intervals, so that makes 300 records in 5 minutes.
A similar graph shows the number of Bytes:
Which is about 58K over a 5 minute interval, the flat bit above.
Well you could leave the project there, and if you have not seen Kinesis Data Streams before you already have an idea of what it can do.
If you want to continue, you create an Amazon Kinesis Data Analytics application to read from the Amazon Kinesis stream, and calculate the total distance travelled for each Unicorn.
Referring to the architecture diagram above, you specify the input data stream, and a second data stream for the output.
The application automatically discovers the “schema”. We copy and paste some SQL code to aggregate and transform the data to calculate the total distance travelled and send that once per minute to the output stream.
A supplied consumer script shows the results of second stream:
Notice the time. Subsequent records will appear exactly on the minute.
The next part of the project is to create a Lambda, triggered by the output stream above, which writes the records to DynamoDB as they arrive.
Here I am using the console based query editor to query for the per minute items for a particular Unicorn and date.
The last part of the project is to demonstrate Firehose, where we want to save the raw sensor data from the initial stream into S3, for later analyses or ad-hoc queries using Athena.
We create a Firehose Delivery Stream, selecting the original source stream containing the per second sensor data, and a bucket for the output, specifying a frequency of delivery, where we choose 60 seconds. So every 60 seconds, a new file containing the per second data will be created in S3.
To demonstrate using Athena for ad-hoc queries, we create an “external table” which is telling Athena about the format of the data in S3.
Now we can do queries from within the Athena console. Here are a few ideas I tried out:
select distinct name FROM wildrydes select count(distinct name) FROM wildrydes select * FROM wildrydes where name='Shadowfax' select * FROM wildrydes where healthpoints<160 select name,statustime FROM wildrydes where healthpoints<160 select name,statustime,healthpoints FROM wildrydes where healthpoints<160 order by statustime desc select name,statustime,healthpoints FROM wildrydes where name='Shadowfax' and healthpoints<160 order by statustime desc
The project has a clean up step, as alway. You do *not* want to leave the Kinesis stuff running! However, what I found is that the Lamba/S3/DynamoDB is, as always, nearly zero cost for this type of project, so if you want to come back to the project quickly and leave some things in place:
- Delete the Firehose Delivery Stream.
- Delete the Data Analytics App
- Delete the 2 Data Streams
Then when coming back to the project, the whole thing can be quickly brought up in a few steps:
- Recreate the 2 Data Streams.
- Start Cloud9 and the producer script
- Recreate the Kinesis Analytics Application
- Delete and recreate the Lambda trigger
- Recreate the Firehose Delivery Stream
Currently, due to the pandemic, the fleet of Unicorns are not actually flying, so for the moment, all the above incoming data is only simulated.
Sorry about that.