Logging is one critical component of our operational infrastructure. A call from a support team member could have us spend hours looking at certain logs files across multiple cluster of servers. A reliable, secure, and scalable log aggregation solution not only makes all the difference during a crunch-time debugging session, but also can help you in anomaly detection, which would be hard to find in the application.
In search of such solution Iexplored ELK stack (Elasticsearch, Logstash, and Kibana) and the EKK stack (Amazon Elasticsearch Service, Amazon Kinesis, and Kibana). It was increasingly clear that the solution had to be cloud based, which would eliminate the need for heavy lifting for deploying, managing, monitoring and scaling the log aggregation solution.
The solution also needed to be secure as the logs are a critical piece of any application’s web components. I tried Elastic cloud (ELK stack) for a while and then finally decide to give EKK stack a try. One big reason I chose EKK stack over the ELK stack was that we were already using a lot of AWS services, so it was easy to initiate the stack. Secondly, security is handled very well in the AWS eco-system. Cost was also a factor in the decision.
With the EKK stack, you can focus on analyzing logs and debugging your application, instead of managing and scaling the system that aggregates the logs. In this blog I will take you through the process in which we analyze our webserver logs.
Amazon Elasticsearch Service is a popular search and analytics engine that provides real-time application monitoring and log and click-stream analytics. Amazon Kinesis Agent is an easy-to-install standalone Java software application that collects and sends data. Amazon Kinesis Firehose provides the easiest way to load streaming data into AWS. In the midst of the kinesis process I also have a lambda function that will process and convert the log format into JSON, thereby making it easier to store in Elasticsearch and analyze the data in Kibana. Kibana is used for discovery and Visualization.
So, to begin with, the application saves all the logs files entries to the access logs locally as the user works through different pages of the server. These access logs are monitored by an agent (AWS kinesis agent) installed for creating a stream of data between the application server and the back-end storage (Elasticsearch in this case). The agent is responsible for securely copying the logs at periodic intervals to a transformation and processing system, in our case it is AWS Lambda function as a transformation system, this function formats the incoming log data.
Lambda function is a server-less piece of the AWS architecture which invokes a script and breaks down the logs files into a key-value pair JSON document. On transforming the data it is then passed onto an Elasticsearch storage service. Data is stored here by creating an index. An index is created to hold data for each month, so traversing the data becomes easier.
The data is then accessed via the Kibana dashboards based on the index pattern. It discovers every field in the index and the field’s associated core type as recorded by Elasticsearch.
Least to say I was startled with some of the findings from the process, they might not be extraordinary but quite informative about the user’s and application’s behavior. Here I will go over some of the findings and analysis from the logs. There are a lot of such findings but for the brevity of the post I will go over just a few here
Top Request on the server
This gives a good perspective on the general usage of pages within our application. Afurther drill down of this allowed us to figure out the usage of certain pages and pinpoint some user habits in the application.
Summary by Response from server
This is a high-level chart of the responses the server would send, having an application running for so many years, with distributed version of the client application, we tend to have certain links that aren’t available on the server but still getting accessed and this graph allowed us to pin-point that. I was startled with the finding that the some requests hitting the server get a http 404 response.
Find request/response by subdomain
Now, at the core of our infrastructure is a shared/co-location model of hosting multiple clients on the same set of servers.This makes it impetus upon us to give each one of them a similar experience. This chart lets us discover high usage users, which would, in the future, help us upgrade the infrastructure based on high usage volume clusters to low usage volume.
Find usage by browser
Who accesses the server most? Are those Android clients or iOS clients or the web users? Only the log analysis can tell us that. With the technology so spread out, this graph gives us the exact usage of the application based on browser. This, in turn, also lets us drill down to the issues and isolate them to a particular browser or device. Secondly, it helps the engineering team keep an close eye on the behavior of the application on an untested browser.
Find access by Region
This is just one of those interesting to plot graphs that gives you a clear picture of the geography of the users. Once we have an aggregated graphs we can combine the access logs from all our data centers and that could help us add more resources and capacity planning for those servers.
This is currently still a work in progress and hasn’t been put into use in our production infrastructure. The main emphasis in this task was to understand the AWS cloud infrastructure that can be used to stream log data, the cost associated with this and the information generated from this.
This is going to get extended into other aspects of the application such as Post processing and API usages. This could also be used in the customer support situations to track if a particular version of the application causes many errors to the end users. In the end the one thing we cannot deny is that “Logs don’t lie.”
Please sign in to leave a comment.