In this post we are explaining the proper replacement for the ISMS log-storm anti-pattern described in the previous post. We are identifying specific infrastructure aspects that need adjusting and we come up with a general rule that allows avoiding log storms in different implementations.
The problem with the anti-pattern presented in the first part on this topic was the feedback loop that was feeding the Lambda events back to the local CloudTrail audit trail. Registering these events in the audit trail in many cases is necessary to maintain the accountability of the code executed within our environment.
The general idea for the log routing for centralization is sound, we just need to look at one factor that is determining on how much logging 'overhead' is this going to cost us in a specific implementation.
The factor in question (let's call it factor S) is equal to how many second-order log entries we need for processing one original entry. The value of this variable varies depending on a specific design. If we use one lambda function to put the log entry into S3 bucket in another account, the S factor is going to be 1.
But, for the sake of example, let's analyze another implementation in which we:
1. execute lambda function routing the entry to the second account
2. execute second lambda function routing the entry to the specific log-group (e.g. based on the application that produced it)
3. execute third lambda function transforming and feeding the log entry to SIEM
In this implementation the value of S is 3, meaning that for each original log entry, we will have 3 second-order entries that come from the lambda function execution, one of those will feed back to audit trail in the original account, the two remaining - in the second account.
The log storm begins to rise when these second-order log entries are in turn subjected to the log processing and routing function, generating another 9 third-order entries. And the process does not stop there, it goes on ad infinitum, accumulating unnecessary entries over weeks and months and driving the ITSec infrastructure costs up.
The way out of this is of course reducing the S factor.
But it has to be done in a smart way. Remember that you need to account for all of the log entries that are going to be fed back into the CloudTrail audit trail. So, if you decide to create a queue for the purpose of collecting X amount of logs before executing the routing function, you might be thinking that the S is now:
S = 1/X,
because routing lambda is executed only once per X entries (for example S = 1/200). But remember to account for the function that will put the entry into the queue. If this function is executed for each entry (e.g., via CloudWatch Logs log subscription target), the actual value is:
S = 1/X + X,
which is slightly worse than that to begin with.
So far the best solution for reducing the 2nd order log entries that I found is theone based on CloudWatch Log Destinations and Kinesis Data Stream.
The overall architecture of the Kinesis-based solution is the following:
The EC2 workload (1) is the source of original log events. The events are being fed to the Kinesis Data Stream instance (3) in the destination account with the help of CloudWatch Log Destination function.
Finally, the Kinesis Data Stream collects the entries and in defined intervals calls appropriate lambda functions for routing and transformations.
The advantages of this implementation are the following.
First of all, Cloud Watch Log Destination instances do not create additional second-order entries. They don't need to, because their functionality is simple and known. It's impossible to e.g. hide a backdoor in such instance. They can also be easily subscribed to from the log-group level.
Furthermore, Kinesis Data Stream in combination with the Delivery Stream allows you to define the buffer size or timeout interval for events processing in such a way that execution of additional defined transformation function is executed for a number (tens, hundreds, thousands) of entries that are currently contained in the buffer.
This way, we can significantly reduce the S factor, and eliminate the potential for the log storms.
The generalized Kinesis-based solution can be found in the AWS Solution library.
In this post we discussed a solution to the Cloud Security Anti-Pattern, Log Storm, which was introduced in a previous post.
The problem with the anti-pattern is that it creates a feedback loop that leads to excessive logging, generating unnecessary entries over weeks and months, and driving up ITSec infrastructure costs.
The solution to this anti-pattern involves reducing the amount of the second-order log entries. Sample solution based on AWS Kinesis has been presented.
Make sure you check out our Secure Architecture assessment service and Security Controls consulting package to find out how can we help you improving your security operations!
Contact us!