<h2 id="goals"><a href="#goals" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Goals</h2>
<ul>
<li>
<p>✅ Understand the difference between metrics, tracing and logging</p>
</li>
<li>
<p>✅ Know the AWS Cloudwatch metrics to monitor for AWS API Gateway, Lambda and SQS</p>
</li>
<li>
<p>✅ Know which AWS tools to use for each category</p>
</li>
<li>
<p>✅ Know when to use metrics or tracing or logging</p>
</li>
</ul>
<h2 id="content"><a href="#content" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Content</h2>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#metrics">Metrics</a>
<ul>
<li><a href="#api-gateway-metrics">API Gateway Metrics</a></li>
<li><a href="#sqs-metrics">SQS Metrics</a></li>
<li><a href="#aws-lambda">AWS Lambda</a></li>
</ul>
</li>
<li><a href="#tracing">Tracing</a></li>
<li><a href="#logging">Logging</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h2 id="introduction"><a href="#introduction" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Introduction</h2>
<p>This is a quick guide on managing service health and performance for the infrastructure we’ve built.</p>
<p>Part of software development is not just everything leading up to the deployment but everything that happens after it is just as important.</p>
<p>That is where the rubber meets the road and where the real learnings happens.</p>
<p>The details discussed in this guide are quite specific to AWS (ie Metrics and tooling), however, the concepts are still applicable across the other cloud vendors.</p>
<p>In addition, some techniques described here are specific to Node.js but the fundamentals still apply in other programming runtimes while the implementation may differ.</p>
<p><strong>We’ll go over the following:</strong></p>
<ul>
<li>Metrics</li>
<li>Tracing</li>
<li>Logging</li>
</ul>
<p>Let’s dive right in!</p>
<h2 id="metrics"><a href="#metrics" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Metrics</h2>
<p>Metrics provide insights into the systems within your infrastructure - whether it’s for service health, performance and just for better understanding.</p>
<p>Metrics allows you to see the bigger picture, which also means it is more general and less specific.</p>
<p>Here are some of the metrics within the different components to keep an eye out on.</p>
<p><strong>Note:</strong> The metrics listed here are not an exhaustive list, so that means the metrics you monitor will depend on your use case and scenario.</p>
<blockquote class="common">
<b>Note:</b> AWS Cloudwatch will collect these metrics out of the box. So, no additional configuration is required!
<p>However, it can be useful to create a dashboard for your infrastructure.</p>
</blockquote>
<p>Let’s take a look at the metrics.</p>
<h3 id="api-gateway-metrics"><a href="#api-gateway-metrics" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>API Gateway Metrics</h3>
<img src="/images/webhook-cloudwatch-metrics-api-gateway.png" alt="API gateway Cloudwatch metrics" style="width:100%">
<p><strong>Metrics:</strong></p>
<ul>
<li><code class="language-">4XXError
</code> - client side errors</li>
<li><code class="language-">5XXError
</code> - server side errors</li>
<li><code class="language-">Latency
</code> - Response time it takes API gateway to respond to the client (client -> API gateway)</li>
<li><code class="language-">IntegrationLatency
</code> - Response time it takes for the backend to respond to the API gateway after relaying the request (API Gateway -> Backend)</li>
</ul>
<h4 id="-helpful-reference"><a href="#-helpful-reference" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>📝 Helpful reference:</h4>
<ul>
<li><a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-metrics-and-dimensions.html" target="_blank" rel="nofollow noopener noreferrer">AWS - API gateway metrics</a></li>
</ul>
<h3 id="sqs-metrics"><a href="#sqs-metrics" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>SQS Metrics</h3>
<img src="/images/webhook-cloudwatch-metrics-sqs.png" alt="SQS Cloudwatch metrics" style="width:100%">
<p><strong>Metrics:</strong></p>
<ul>
<li><code class="language-">NumberOfMessagesSent
</code> - Number of messages added to the queue</li>
<li><code class="language-">SentMessageSize
</code> - Size of messages added to the queue in bytes (SQS has a size limit - 256kb)</li>
<li><code class="language-">NumberOfMessagesDeleted
</code> - Number of messages deleted from the queue (after finish processing)</li>
<li><code class="language-">NumberOfMessagesReceived
</code> - Number of messages read from the queue</li>
</ul>
<h4 id="-helpful-reference-1"><a href="#-helpful-reference-1" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>📝 Helpful reference:</h4>
<ul>
<li><a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html" target="_blank" rel="nofollow noopener noreferrer">AWS - SQS metrics</a></li>
</ul>
<h3 id="aws-lambda"><a href="#aws-lambda" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>AWS Lambda</h3>
<img src="/images/webhook-cloudwatch-metrics-lambda.png" alt="AWS Lambda Cloudwatch metrics" style="width:100%">
<ul>
<li>
<p><code class="language-">Invocations
</code> - Number of times the function is called (includes both success and errors)</p>
</li>
<li>
<p><code class="language-">Errors
</code> - Number of invocation that results in an error</p>
</li>
<li>
<p><code class="language-">Throttles
</code> - Number of times the requests gets throttled (meaning resource exhaustion - lack of provisioned concurrency for scaling up)</p>
</li>
</ul>
<h4 id="concurrency"><a href="#concurrency" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Concurrency</h4>
<ul>
<li>
<p><code class="language-">ConcurrentExecutions
</code> - Number of function instances that are handling the requests</p>
</li>
<li>
<p><code class="language-">ProvisionedConcurrencyInvocations
</code> - Number of times function is invoked on provisioned concurrency</p>
</li>
<li>
<p><code class="language-">ProvisionedConcurrencySpilloverInvocations
</code> - Number of times your function code goes beyond the provisioned concurrency</p>
</li>
</ul>
<blockquote class="common">
<b>💡 Note:</b> Concurrency for AWS Lambda functions is important because depending on the usage patterns, it can drastically impact performance due to cold starts.
<p>Want to read more about my analysis ? check out my analysis on <a href="https://www.linkedin.com/feed/update/urn:li:activity:6940693321933611009/" target="_blank">”Why You Should Not Use Serverless For Everything”</a></p>
</blockquote>
<blockquote class="common">
<h3 id="when-to-use-metrics-"><a href="#when-to-use-metrics-" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>When to use metrics ?</h3>
<p>Metrics should be the first things you look at to get high level insights about your infrastructure.</p>
<p>It is analogous to a car dashboard, it gives you an overview of what is happening.</p>
<p>This can be for performance, errors or just to better understand the usage patterns.</p>
</blockquote>
<h3 id="metric-tool"><a href="#metric-tool" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Metric Tool</h3>
<ul>
<li><a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html" target="_blank" rel="nofollow noopener noreferrer">AWS Cloudwatch</a></li>
</ul>
<h4 id="-helpful-reference-2"><a href="#-helpful-reference-2" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>📝 Helpful reference:</h4>
<ul>
<li><a href="https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html#monitoring-metrics-invocation" target="_blank" rel="nofollow noopener noreferrer">AWS - Lambda invocation metrics</a></li>
<li><a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html" target="_blank" rel="nofollow noopener noreferrer">AWS - Concurrency configuration</a></li>
</ul>
<h2 id="tracing"><a href="#tracing" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Tracing</h2>
<p><strong>Trace Map:</strong></p>
<img src="/images/webhook-service-trace-map.png" alt="AWS X-ray trace map of AWS Lambda and AWS SQS" style="width:100%">
<p><strong>Tracing:</strong></p>
<img src="/images/aws-xray-normal.png" alt="AWS X-ray trace of AWS Lambda and AWS SQS" style="width:100%">
<p>Metrics give us the signals into the problems then we can drill down into traces or logs.</p>
<p>Tracing is where it gets more specific and granular so we can better understand the performance of each sub-systems involved within the infrastructure.</p>
<p>This provides insights on performance of a specific execution path of a request.</p>
<p>Often times, this can be useful to understand which part of the sub-system is causing the bottleneck or causing the performance limits.</p>
<blockquote class="common">
<h3 id="when-to-use-tracing-"><a href="#when-to-use-tracing-" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>When to use tracing ?</h3>
<p>Use tracing when you want to better understand the performance of the sub-systems involved within your infrastructure.</p>
<p>This will help you identify bottlenecks or performance limits.</p>
<p>It also helps to visualize the paths taken in the sub-system in the sampled requests.</p>
</blockquote>
<h3 id="tracing-tool"><a href="#tracing-tool" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Tracing Tool</h3>
<ul>
<li><a href="https://aws.amazon.com/xray/" target="_blank" rel="nofollow noopener noreferrer">AWS X-Ray</a></li>
</ul>
<h2 id="logging"><a href="#logging" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Logging</h2>
<img src="/images/aws-lambda-invoke-logs-3.png" alt="AWS Invoke logs initial" style="width:100%">
<p>Logging is very similar to tracing where it describe a specific request or event.</p>
<p>It is an invaluable debugging and error tracking tool in a production setting.</p>
<p>Most efficient setup will be using some sort of centralized logging system (ie Elastic stack with ELK or Cloudwatch insights).</p>
<p>That means the code has to setup the logging implementation in a way that each request can be related to the logs.</p>
<p>A good way is using a transaction ID or correlation ID for each request that way we can relate the logs of each request to a particular ID.</p>
<p>This makes it easy to drill down into the logs for a specific issue by just searching by the ID.</p>
<p><strong>The AWS infrastructure we are working with has the following logs:</strong></p>
<ul>
<li>API gateway logs</li>
<li>AWS Lambda logs (Ingestion &#x26; Process-Queue functions)</li>
</ul>
<h3 id="custom-filters"><a href="#custom-filters" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Custom filters</h3>
<p>On occasion, if you feel like some events are important enough, then you can create custom filters that will tail the logs to help you understand what is going on in production.</p>
<p>You can even convert the counts of these log events into metrics that can be alerted on.</p>
<p>Most of the time you won’t need it but sometimes the out of the box metrics may not be enough for your use case.</p>
<blockquote class="common">
<h3 id="when-to-use-logging-"><a href="#when-to-use-logging-" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>When to use logging ?</h3>
<p>Use logging if you need to better understand the events within the sub-systems for a paricular request.</p>
<p>Think of it like leaving breadcrumbs in order to track down issues.</p>
<p>It is very useful when you are debugging errors in production!</p>
</blockquote>
<h3 id="logging-tool"><a href="#logging-tool" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Logging Tool</h3>
<ul>
<li><a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html" target="_blank" rel="nofollow noopener noreferrer">AWS Cloudwatch Logs</a></li>
</ul>
<h4 id="-helpful-reference-3"><a href="#-helpful-reference-3" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>📝 Helpful reference:</h4>
<ul>
<li><a href="https://microsoft.github.io/code-with-engineering-playbook/observability/correlation-id/" target="_blank" rel="nofollow noopener noreferrer">Microsoft CSE - Observabiltiy, correlation ID</a></li>
<li><a href="https://hilton.org.uk/blog/microservices-correlation-id" target="_blank" rel="nofollow noopener noreferrer">Microservice - correlation ID</a></li>
<li><a href="https://github.com/goldbergyoni/nodebestpractices/blob/master/sections/production/assigntransactionid.md" target="_blank" rel="nofollow noopener noreferrer">Node best practices - transactionid</a></li>
</ul>
<h2 id="conclusion"><a href="#conclusion" aria-hidden="true" tabindex="-1"><span class="icon icon-link"></span></a>Conclusion</h2>
<p>That’s it! I hope you found this guide helpful.</p>
<p>Let’s do a quick recap!</p>
<p><strong>Recap:</strong></p>
<ul>
<li>
<p><strong>Metrics</strong> - Use this for high level overview into the infrastrucutre</p>
</li>
<li>
<p><strong>Tracing</strong> - Use this to better understand performance of the sub-systems and the path taken across sub-systems in the sampled requests</p>
</li>
<li>
<p><strong>Logging</strong> - Use this to better understand specific events occuring in the sub-system in a request</p>
</li>
</ul>
<p>In general, for service health and performance, start with metrics then move onto tracing or logging for more granular insights depending on the use case.</p>
<p>In a follow-up guide, we will get hands on, and see how we can apply some of the things we discussed here including setting up tracing and logging for the infrastructure we built in the technical series.</p>
<p>Stay tuned!</p>


A quick guide on tracing, logging and metrics for our AWS infrastructure

Jerry Chang

Go Back

A pocket guide to service health and performance for AWS API gateway, Lambda, SQS

Enjoy the content ?