A pocket guide to service health and performance for AWS API gateway, Lambda, SQS

Published on: Sun Jul 31 2022

Series

Goals

  • ✅ Understand the difference between metrics, tracing and logging

  • ✅ Know the AWS Cloudwatch metrics to monitor for AWS API Gateway, Lambda and SQS

  • ✅ Know which AWS tools to use for each category

  • ✅ Know when to use metrics or tracing or logging

Content

Introduction

This is a quick guide on managing service health and performance for the infrastructure we’ve built.

Part of software development is not just everything leading up to the deployment but everything that happens after it is just as important.

That is where the rubber meets the road and where the real learnings happens.

The details discussed in this guide are quite specific to AWS (ie Metrics and tooling), however, the concepts are still applicable across the other cloud vendors.

In addition, some techniques described here are specific to Node.js but the fundamentals still apply in other programming runtimes while the implementation may differ.

We’ll go over the following:

  • Metrics
  • Tracing
  • Logging

Let’s dive right in!

Metrics

Metrics provide insights into the systems within your infrastructure - whether it’s for service health, performance and just for better understanding.

Metrics allows you to see the bigger picture, which also means it is more general and less specific.

Here are some of the metrics within the different components to keep an eye out on.

Note: The metrics listed here are not an exhaustive list, so that means the metrics you monitor will depend on your use case and scenario.

Note: AWS Cloudwatch will collect these metrics out of the box. So, no additional configuration is required!

However, it can be useful to create a dashboard for your infrastructure.

Let’s take a look at the metrics.

API Gateway Metrics

API gateway Cloudwatch metrics

Metrics:

  • 4XXError - client side errors
  • 5XXError - server side errors
  • Latency - Response time it takes API gateway to respond to the client (client -> API gateway)
  • IntegrationLatency - Response time it takes for the backend to respond to the API gateway after relaying the request (API Gateway -> Backend)

📝 Helpful reference:

SQS Metrics

SQS Cloudwatch metrics

Metrics:

  • NumberOfMessagesSent - Number of messages added to the queue
  • SentMessageSize - Size of messages added to the queue in bytes (SQS has a size limit - 256kb)
  • NumberOfMessagesDeleted - Number of messages deleted from the queue (after finish processing)
  • NumberOfMessagesReceived - Number of messages read from the queue

📝 Helpful reference:

AWS Lambda

AWS Lambda Cloudwatch metrics
  • Invocations - Number of times the function is called (includes both success and errors)

  • Errors - Number of invocation that results in an error

  • Throttles - Number of times the requests gets throttled (meaning resource exhaustion - lack of provisioned concurrency for scaling up)

Concurrency

  • ConcurrentExecutions - Number of function instances that are handling the requests

  • ProvisionedConcurrencyInvocations - Number of times function is invoked on provisioned concurrency

  • ProvisionedConcurrencySpilloverInvocations - Number of times your function code goes beyond the provisioned concurrency

💡 Note: Concurrency for AWS Lambda functions is important because depending on the usage patterns, it can drastically impact performance due to cold starts.

Want to read more about my analysis ? check out my analysis on ”Why You Should Not Use Serverless For Everything”

When to use metrics ?

Metrics should be the first things you look at to get high level insights about your infrastructure.

It is analogous to a car dashboard, it gives you an overview of what is happening.

This can be for performance, errors or just to better understand the usage patterns.

Metric Tool

📝 Helpful reference:

Tracing

Trace Map:

AWS X-ray trace map of AWS Lambda and AWS SQS

Tracing:

AWS X-ray trace of AWS Lambda and AWS SQS

Metrics give us the signals into the problems then we can drill down into traces or logs.

Tracing is where it gets more specific and granular so we can better understand the performance of each sub-systems involved within the infrastructure.

This provides insights on performance of a specific execution path of a request.

Often times, this can be useful to understand which part of the sub-system is causing the bottleneck or causing the performance limits.

When to use tracing ?

Use tracing when you want to better understand the performance of the sub-systems involved within your infrastructure.

This will help you identify bottlenecks or performance limits.

It also helps to visualize the paths taken in the sub-system in the sampled requests.

Tracing Tool

Logging

AWS Invoke logs initial

Logging is very similar to tracing where it describe a specific request or event.

It is an invaluable debugging and error tracking tool in a production setting.

Most efficient setup will be using some sort of centralized logging system (ie Elastic stack with ELK or Cloudwatch insights).

That means the code has to setup the logging implementation in a way that each request can be related to the logs.

A good way is using a transaction ID or correlation ID for each request that way we can relate the logs of each request to a particular ID.

This makes it easy to drill down into the logs for a specific issue by just searching by the ID.

The AWS infrastructure we are working with has the following logs:

  • API gateway logs
  • AWS Lambda logs (Ingestion & Process-Queue functions)

Custom filters

On occasion, if you feel like some events are important enough, then you can create custom filters that will tail the logs to help you understand what is going on in production.

You can even convert the counts of these log events into metrics that can be alerted on.

Most of the time you won’t need it but sometimes the out of the box metrics may not be enough for your use case.

When to use logging ?

Use logging if you need to better understand the events within the sub-systems for a paricular request.

Think of it like leaving breadcrumbs in order to track down issues.

It is very useful when you are debugging errors in production!

Logging Tool

📝 Helpful reference:

Conclusion

That’s it! I hope you found this guide helpful.

Let’s do a quick recap!

Recap:

  • Metrics - Use this for high level overview into the infrastrucutre

  • Tracing - Use this to better understand performance of the sub-systems and the path taken across sub-systems in the sampled requests

  • Logging - Use this to better understand specific events occuring in the sub-system in a request

In general, for service health and performance, start with metrics then move onto tracing or logging for more granular insights depending on the use case.

In a follow-up guide, we will get hands on, and see how we can apply some of the things we discussed here including setting up tracing and logging for the infrastructure we built in the technical series.

Stay tuned!


Enjoy the content ?

Then consider signing up to get notified when new content arrives!

Jerry Chang 2022. All rights reserved.