<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Reliability Enablers: Observability focus]]></title><description><![CDATA[Covering concepts related to observability (aka o11y)]]></description><link>https://read.srepath.com/s/observability</link><image><url>https://substackcdn.com/image/fetch/$s_!hjhf!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ee1dc2-77bf-4ffa-b056-f66dac8ad0d0_128x128.png</url><title>Reliability Enablers: Observability focus</title><link>https://read.srepath.com/s/observability</link></image><generator>Substack</generator><lastBuildDate>Sat, 25 Apr 2026 02:06:07 GMT</lastBuildDate><atom:link href="https://read.srepath.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Ash P]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[srepath@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[srepath@substack.com]]></itunes:email><itunes:name><![CDATA[Ash Patel]]></itunes:name></itunes:owner><itunes:author><![CDATA[Ash Patel]]></itunes:author><googleplay:owner><![CDATA[srepath@substack.com]]></googleplay:owner><googleplay:email><![CDATA[srepath@substack.com]]></googleplay:email><googleplay:author><![CDATA[Ash Patel]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[More telemetry makes reliability worse (until you fix the loop)]]></title><description><![CDATA[Every reliability engineer eventually learns the same painful truth: you can have a thousand dashboards showing you xyz and still miss the real signal.]]></description><link>https://read.srepath.com/p/more-telemetry-makes-reliability</link><guid isPermaLink="false">https://read.srepath.com/p/more-telemetry-makes-reliability</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 23 Dec 2025 15:05:15 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4dc47b11-b51a-4875-b488-48a59c97282f_1280x869.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every reliability engineer eventually learns the same painful truth: you can have a thousand dashboards showing you <em>xyz</em> and still miss the real signal.</p><p>This might feel like an insurmountable hurdle at first glance.</p><p>One of those &#8220;it is what it is&#8221; situations. After all:</p><p>The more data we collect &#8594; the more noise we face &#8594; the less trust we have in our alerts &#8594; the slower we respond &#8594; the more incidents worsen &#8594; the more data we collect to compensate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZhJL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZhJL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 424w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 848w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 1272w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZhJL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png" width="907" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:907,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72469,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://read.srepath.com/i/180281648?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZhJL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 424w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 848w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 1272w, https://substackcdn.com/image/fetch/$s_!ZhJL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ec914f-0d24-46fa-b047-6d8bbbcf1813_907x547.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In MIT&#8217;s world of System Dynamics, this noise amplification problem is what we&#8217;d call a <em>reinforcing loop</em>.</p><p>It&#8217;s a spiralling up of information overload as the loop continues to reinforce, or metaphorically snowball, onto itself. But here&#8217;s the thing&#8230;</p><p>Observability (o11y) isn&#8217;t just telemetry.</p><p>It&#8217;s also who interprets, triages, and learns from the telemetry: a <em>balancing loop</em>.</p><p>In a healthy system, every new signal that enters should trigger an equal and opposite stabilizing action, essentially a check-and-balance. That&#8217;s the balancing loop at work.</p><p>For example, when noise increases, teams should automatically slow alert creation or tighten signal thresholds to wait until trust recovers.</p><p>When signal quality improves, they can loosen up again.</p><p>Without that feedback control, the system loses balance, and the painful reinforcing loop that I mentioned earlier takes over.</p><p>If your team doesn&#8217;t trust the data, or worse, doesn&#8217;t have time to translate it, your observability system isn&#8217;t truly &#8220;seeing everything&#8221;.</p><p>That&#8217;s why engineers with Staff+ potential treat incident retros and observability reviews like a process tuning. They ask:</p><ul><li><p>Who sees which o11y signals, and when?</p></li><li><p>What incentives drive our attention to o11y signals?</p></li><li><p>Where does learning from outputs feed back into o11y design?</p></li></ul><p>Small interventions like taking the time to prune unhelpful alerts can have an outsized impact in the long run because they restore the balancing loop between data and actionability.</p><p>This should be your takeaway: reliability improves when observability helps people modify their impact from using the system, not just seeing the outputs of their services.</p>]]></content:encoded></item><item><title><![CDATA[How to Resolve Bad Observability Data Quality]]></title><description><![CDATA[Observability, data, and quality. If these 3 words don&#8217;t mean much to you, stop reading. Really. It means this guide is not for you. You can enjoy some cat videos on YouTube.]]></description><link>https://read.srepath.com/p/how-to-resolve-bad-observability</link><guid isPermaLink="false">https://read.srepath.com/p/how-to-resolve-bad-observability</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 18 Jun 2024 12:14:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BTuZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We previously covered 3 observability data flow issues that can plague incident response and other SRE activities.</p><p>There&#8217;s another issue &#8212; data quality. Imagine going into an incident and being unsure about the reliability of the data you&#8217;re using.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That&#8217;s what can happen when you have observability data quality issues.</p><p>The scary thing is &#8212; it&#8217;s really easy to end up with low-quality data.</p><p>We&#8217;ll first explore how data quality issues manifest in observability.</p><p>We&#8217;ll then look at the specific problems in data quality.</p><p>And finally, we&#8217;ll work through a few potential solutions.</p><p>Sound good? Let&#8217;s get started.</p><h2>Unpacking poor data quality</h2><p>In plain English, you risk getting <em>dodgy or unreliable</em> data.</p><p>It&#8217;s like trying to fill a swimming pool from a ground well that you don&#8217;t know much about managing well.</p><p>In this scenario:</p><ul><li><p>the swimming pool is your observability data lake <em>and</em></p></li><li><p>the water is observability data **</p></li></ul><p>You might be able to fill your observability data lake or warehouse with the unverified water source, but the pipeline is filling it with swirls of dirt and sludge.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BTuZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BTuZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BTuZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg" width="350" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:350,&quot;bytes&quot;:233913,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BTuZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BTuZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc6f851f-4349-4138-8138-ce873c573614_1024x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The data is tainted with errors or inaccuracies.</p><p><strong>End result and risk:</strong> your observability data is filled with dirty or useless data, making it hard to trust the insights you gain from it.</p><p>It can manifest because of issues like high data cardinality, noisy data, and weak sampling strategy.</p><p>Let&#8217;s now explore a few of these data quality problems&#8230;</p><h2>High data cardinality</h2><p>High cardinality is when you have a large number of distinct values in your dataset.</p><p>It can happen if you pick something with very unique values like user IDs. That&#8217;s never recommended. It can also happen for more strategic fields like instance ID.</p><p>How cardinality can make your data quality issues worse by making it harder to handle duplications and inconsistencies. Bottlenecks can result in weak data output, which will give you a lo-fi picture of a hi-fi software system.</p><p>The key here is to avoid excessive cardinality data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RNSU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RNSU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 424w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 848w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 1272w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RNSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png" width="636" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:636,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:669885,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RNSU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 424w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 848w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 1272w, https://substackcdn.com/image/fetch/$s_!RNSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfedd4a3-48d6-4de8-96a4-4ace27b540af_636x615.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I explore this idea in more detail in the article The Cardinality Conundrum in Observability.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://srepath.substack.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share Boost software reliability with SREpath&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://srepath.substack.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share Boost software reliability with SREpath</span></a></p><h2><strong>Noisy Data</strong></h2><p>A system running at scale can generate a gargantuan amount of data.</p><p>But it&#8217;s almost certain that you don&#8217;t need all of this data to understand your system&#8217;s behaviors.</p><p>Noisy data is the phenomenon when an excessive amount of data overshadows the meaningful signals you need to solve problems.</p><p>For example, a logging system might track all events including routine informational types and a myriad of overcautious warnings.</p><p>You&#8217;re essentially wading through a flood of blogs to try and find the important error messages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-J2n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-J2n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 424w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 848w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 1272w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-J2n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png" width="397" height="376.00777604976673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:643,&quot;resizeWidth&quot;:397,&quot;bytes&quot;:201130,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-J2n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 424w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 848w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 1272w, https://substackcdn.com/image/fetch/$s_!-J2n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdbf7ca6-9493-4e92-9c53-5742a506ceb2_643x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can try to resolve this by implementing filters to reduce the noise at collection or analysis.</p><p>You can also set up better thresholds and apply better sampling.</p><h2><strong>Weak sampling strategy</strong></h2><p>Sampling is a common practice in observability. You are essentially picking out small slivers of your full observability data to analyze and query.</p><p>This is useful when using <em>all</em> the data would <em>too much</em> data to handle.</p><p>You can try to complete the puzzle of your full data through strategic sampling.</p><p>But a bad sampling strategy can hit your observability data quality the wrong way.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fDJJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fDJJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fDJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg" width="360" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:676,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:262157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fDJJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 424w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 848w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!fDJJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45c1250c-249a-4d2d-b744-c53d820f770c_676x676.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-resolve-bad-observability?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Boost software reliability with SREpath. This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-resolve-bad-observability?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://read.srepath.com/p/how-to-resolve-bad-observability?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Let&#8217;s explore a few of the characteristics of a bad sampling strategy:</p><h3>Non-r<strong>andom sampling</strong></h3><p>It&#8217;s surprisingly easy to do this.</p><p>You might be used to picking out data samples from certain time periods such as business hours.</p><p>You might also have a set of criteria to do this.</p><p>Either way, you are not allowing random sampling to occur.</p><p>End result: skewed insights that don&#8217;t accurately represent your system's behavior.</p><h3>Sampling too small</h3><p>You need to consider the dataset that you have to work out what size you need to sample.</p><p>Taking 10 entries out of a 1000 entry database for example will give you very skewed results.</p><p>For small datasets of under 100 entries, you could viably sample 20-50% of the data.</p><p>For larger datasets, you can reduce the sampling size to 5-10% of the data.</p><p>For mega datasets with 100,000+ entries &#8212; like the ones you typically see in systems at scale &#8212; the sampling size can be as low as 1%.</p><p>Keep in mind, these numbers are guidelines. Follow engineering judgment for your context.</p><h3>Ignoring dependencies</h3><p>It&#8217;s easy to overlook the fact that your system is many different parts connected together.</p><p>Ignore these connections at your peril. Or at least your sampling&#8217;s peril.</p><p>Having an idea of the relationships between parts lets you sample effectively.</p><p>Imagine you're responsible for diagnosing latency issues in a web application with components like web servers, databases, and caching systems.</p><p>You already know that these components are dependent on each other.</p><p>Their interactions can affect the overall performance of the application.</p><p><strong>Here&#8217;s what can happen without considering these dependencies:</strong></p><p>You randomly select data points to analyze without considering how the components interplay.</p><p>You notice spikes in latency but struggle to understand why they occur.</p><p>Hard to solve this problem.</p><p><strong>Here&#8217;s what can happen if you consider dependencies:</strong></p><p>Latency in your web application is highly affected by interactions between the web server and the database.</p><p>So you focus your sampling on periods when there are simultaneous spikes in web server usage and database queries.</p><p>A few telltale signs come up from sampling moments when all 3 areas are at higher loads.</p><p>You discover that certain database queries are causing delays in the web server response time.</p><p>This is something you can work on resolving.</p><h2>Wrapping up</h2><p>As you now know, data quality in observability is an ongoing issue.</p><p>It requires handling challenges like high cardinality, noisy data, and weak sampling strategies.</p><p>Some of the highlights of what I suggested include avoiding excessive cardinality, implementing effective noise reduction, and considering dependencies in sampling.</p><p>Maybe then you can get more accurate insights for solving problems in your complex system.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How to Solve 3 Data Flow Issues in Observability]]></title><description><![CDATA[Going into an incident, the last thing you want is an incomplete picture of what&#8217;s going on under the hood. That&#8217;s what can happen when you have observability data flow issues.]]></description><link>https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in</link><guid isPermaLink="false">https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Fri, 07 Jun 2024 12:03:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sieJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sieJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sieJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 424w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 848w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 1272w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sieJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png" width="1017" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1017,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155170,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sieJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 424w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 848w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 1272w, https://substackcdn.com/image/fetch/$s_!sieJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1f2df3-e860-46d5-968c-0298a2373e9e_1017x712.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The scary thing is &#8212; it&#8217;s really easy to fall into this situation.</p><p>We&#8217;ll first explore how data flow issues manifest in observability.</p><p>We&#8217;ll then look at the specific problems in data flow.</p><p>And finally, we&#8217;ll work through a few potential solutions.</p><p>Sound good? Let&#8217;s get started.</p><h2>Unpacking poor data flow</h2><p>In plain English, you risk not getting <em>enough</em> data.</p><p>It&#8217;s like trying to fill a swimming pool with water using leaky or clogged piping.</p><p>In this scenario:</p><ul><li><p>the swimming pool is your observability data lake</p></li><li><p>the water is observability data <em>and</em></p></li><li><p>the clogged piping is your observability data pipeline.</p></li></ul><p>With leaky piping, the data you need might not be coming through properly to fill the pool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y8FA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y8FA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y8FA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg" width="288" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:288,&quot;bytes&quot;:893081,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y8FA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Y8FA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe5825b-cd4e-4d2c-afb0-a49f23a431e4_2048x2048.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>End result and risk:</strong> your big observability data pool isn&#8217;t as full as it should be, leaving you with an incomplete picture.</p><p>Before we dive into specific issues, let&#8217;s do a refresher on how observability data flows&#8230;</p><h2>Refresher on data flow in observability</h2><p>I mentioned in the guide Key Observability Concepts Explained that observability data flows through 4 stages. It flows like so:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CfRM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CfRM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 424w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 848w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CfRM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png" width="1456" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67147,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CfRM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 424w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 848w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CfRM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bee2122-bc53-4342-8dde-87dc11590f33_1515x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I also mentioned in that guide that the arrows have meaning to them.</p><p>They denote how easy (or not) the flow is from one stage to the next.</p><p>Let&#8217;s explore these flow relationships:</p><h3><strong>Instrumentation &#8658; Ingestion</strong></h3><p>This stage is when the data goes from instrumented code and components <em>via collectors</em> to ingestion at the central observability system.</p><p>It <em>should</em> be straightforward but can get a little tricky if you have a large variety of data sources, poorly formatted data, or real-time requirements</p><h3><strong>Ingestion &#8658; Storage</strong></h3><p>This stage is less tricky as the data is already in a format that databases like and simply needs to flow into storage.</p><p>But you can get issues here when there is a large volume of data to push or data coming in at high velocity.</p><h3><strong>Storage &#8658; Usage</strong></h3><p>You can expect a myriad of issues at this stage because data retrieval can be a pain.</p><p>Query complexity can make time-to-usability slower than you&#8217;d like. On top of that, visualizations need intensive processing power.</p><p>&lt;aside&gt; &#8505;&#65039; Data flow is not the only observability data issue. Check out this guide on the <em>other</em> major issue afflicting **observability data: data quality.</p><p>&lt;/aside&gt;</p><p>This is a high-level view of how the data flows and some kinks that can slow things down.</p><p>But there are more serious issues that can plague data flow.</p><p>Let&#8217;s explore them now&#8230;</p><h2>What are the 3 main data flow issues?</h2><p>We can put data flow issues into 3 buckets:</p><ol><li><p>Siloed Data</p></li><li><p>Latency &amp; Delays</p></li><li><p>Incomplete Data</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2nON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2nON!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 424w, https://substackcdn.com/image/fetch/$s_!2nON!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 848w, https://substackcdn.com/image/fetch/$s_!2nON!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 1272w, https://substackcdn.com/image/fetch/$s_!2nON!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2nON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png" width="1278" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87519,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2nON!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 424w, https://substackcdn.com/image/fetch/$s_!2nON!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 848w, https://substackcdn.com/image/fetch/$s_!2nON!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 1272w, https://substackcdn.com/image/fetch/$s_!2nON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980e0a82-e278-4a9d-afcb-abee70c178ac_1278x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;ll explore each of these buckets in detail in a moment.</p><p>Before we do that, let me explain the star rating system I will use later on to guide you through the tasks inside each bucket.</p><p>I will rate each task with the following on a 5-star (&#9733;&#9733;&#9733;&#9733;&#9733;) scale:</p><ul><li><p><strong>ease of doing</strong> &#8212; how easy would it be to implement <em>and</em> keep doing</p></li><li><p><strong>cost</strong> &#8212; factoring in engineer time and potential vendor costs</p></li><li><p><strong>impact</strong> &#8212; what level of difference it can make to data flow in observability systems</p></li></ul><p>(Spoiler alert: nothing gets 5 stars for impact because that is reserved for tasks with exceptionally high or transformative impact)</p><p>My ratings are based on the assumptions that:</p><ul><li><p>the work will get assigned to SREs with at least 1-3 years of experience</p></li><li><p>the team aims to get open-source tooling rather than pay vendors from Day 1 <em>and</em></p></li><li><p>that observability is considered pivotal to providing system insights</p></li></ul><p>For each of the 3 issue types, I&#8217;ll give a quick overview then break down the associated tasks.</p><p>Let&#8217;s begin!</p><h2>Overview of incomplete data</h2><p>This implies that your observability data has gaps. The magnitude depends on the number of issues that are plaguing it. By issues, I mean:</p><ol><li><p>technical errors</p></li><li><p>misconfigurations and</p></li><li><p>infrastructure limitations</p></li></ol><h2>How to reduce incomplete observability data</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vs47!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vs47!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vs47!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vs47!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vs47!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vs47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg" width="1359" height="983" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:983,&quot;width&quot;:1359,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149921,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vs47!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vs47!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vs47!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vs47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019438c-e51d-404d-bfee-7f4183577b06_1359x983.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Boost software reliability with SREpath. This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Summary concept map for solving incomplete observability data issues</p><p>Let&#8217;s look into 6 ways you can resolve the above 3 issues:</p><h3><strong>Reduce technical errors</strong></h3><ol><li><p><strong>Check your error logs</strong></p><p>Step 1 is having a robust error logging process in the first place. You should then have a simple mechanism for checking errors, exceptions, and failures.</p><p><em>Ease of Doing</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Setting up a robust error-logging process is manageable, and checking errors is relatively easy but consistently doing so requires manual effort</p></li></ul><p><em>Cost:</em> &#9733;&#9733;&#9734;&#9734;&#9734;</p><ul><li><p>The cost of setting up error logging is reasonable. Manual checks incur some operational overhead but don&#8217;t take up excessive time</p></li></ul><p><em>Impact:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Regular checking of error logs has a high impact on identifying and resolving issues promptly, providing valuable insights into system health</p></li></ul></li><li><p><strong>Monitor for errors</strong></p><p>Implement monitoring tools to track and alert when technical errors occur in the collection stage. This helps in the real-time identification of issues.</p><p><em>Ease of Doing</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Monitoring tools involve some setup and configuration, which can be straightforward but only if you&#8217;re already familiar with monitoring tools. making it moderately straightforward. This process requires familiarity with monitoring tools.</p></li></ul><p><em>Cost:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>Monitoring tools may come with vendor costs, but the investment is justifiable because it helps with real-time insights needed to assure reliability</p></li></ul><p><em>Impact:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Real-time identification of errors has a high impact on system reliability, enabling proactive issue resolution and minimizing downtime.</p></li></ul></li></ol><h3><strong>Reduce misconfigurations</strong></h3><ol><li><p><strong>Audit configurations.</strong></p><p>Set a regular cadence to check if your system is configured correctly.</p><p><em>Ease of Doing:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>After the initial troubleshooting and training period, conducting regular audits of configurations becomes more straightforward</p></li></ul><p><em>Cost:</em> &#9733;&#9733;&#9734;&#9734;&#9734;</p><ul><li><p>The cost of manual audits is low to moderate in most situations &#8212; so long as your Director of SRE doesn&#8217;t get involved</p></li></ul><p><em>Impact:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>Regular audits are like the annual health checkup that help maintain system reliability</p></li></ul></li><li><p><strong>Automated audits.</strong></p><p><em>Ease of Doing:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>Automated audits may require some scripting and/or integration work &#8212; you gotta code it, so it takes a moderate level of effort for an SRE with coding skills</p></li></ul><p><em>Cost:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>Automation tools come with costs from engineering time, but the efficiency gained in automated checks justifies the investment</p></li></ul><p><em>Impact:</em> <em>&#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Automated audits enhance efficiency by proactively identifying and resolving misconfigurations, contributing to system stability</p></li></ul></li></ol><h3><strong>Overcome infrastructure limitations</strong></h3><ol><li><p><strong>Monitor for bottlenecks.</strong></p><p>Use performance monitoring on the observability system itself to find where the bottlenecks are</p><p><em>Ease of Doing:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>SREs are adept at running performance monitoring tools and addressing bottlenecks</p></li></ul><p><em>Cost:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>The cost of performance monitoring tools varies and the time taken to implement and support them too but it&#8217;s worth the impact</p></li></ul><p><em>Impact:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Getting rid of bottlenecks can have a high impact on data flow</p></li></ul></li><li><p><strong>Scalability planning.</strong></p><p>Scale the observability infrastructure to handle the increase in data volume if it is showing consistent patterns</p><p><em>Ease of Doing:</em> &#9733;&#9733;&#9733;&#9734;&#9734;</p><ul><li><p>SREs with at least 1-2 years of experience should be familiar with capacity planning and system dynamics that govern it</p></li></ul><p><em>Cost:</em> <em>&#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>The cost of scalability planning involves strategic investment in infrastructure to handle increased data volumes</p></li></ul><p><em>Impact:</em> &#9733;&#9733;&#9733;&#9733;&#9734;</p><ul><li><p>Proper scalability planning means the system will be able to handle increased observability data volumes</p></li></ul></li></ol><p>A few notes:</p><ul><li><p>With all of the above tasks, you are mitigating a problem that has already occurred</p></li><li><p>The order in which you tackle the tasks depends on the problems you find in your observability system&#8217;s data flow</p></li><li><p>It&#8217;s not easy to prevent any of these except perhaps persistent infrastructure scaling issues through autoscaling (super tricky work)</p></li></ul><p>That&#8217;s why you should remain vigilant of errors and misconfigurations in the observability system</p><h2>Overview of latency and delays</h2><p>You&#8217;ll find that one of the most useful aspects of modern observability systems is that they can give us real-time insights.</p><p>But this implies your data pipeline is capable of pushing real-time data through all stages in a timely fashion. Latency and delays in data flow will hinder this.</p><p>But the good news is that latency and delay issues can be prevented with good practices, which we will cover now:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>How to reduce latency &amp; delay issues</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uCv1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uCv1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uCv1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg" width="1456" height="870" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:870,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uCv1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uCv1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72072036-6c35-4c14-82b3-8337ab232145_1525x911.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Summary concept map for solving observability latency and delay issues</p><p>We are trying to improve the timeliness of data and minimize bottlenecks. To do that, we can:</p><ol><li><p><strong>Implement real-time monitoring.</strong></p><p>Not for your software, but the observability system itself to see how its data flow is performing at various points</p><p><em>Ease of Doing: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Real-time monitoring for the observability system&#8217;s data flow itself is generally achievable with modern monitoring tools</p></li></ul><p><em>Cost: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p><em>Explainer:</em> The cost is moderate because it involves vendor costs and the possibility of needing additional infrastructure</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Real-time monitoring significantly boosts the observability system's ability to detect issues promptly for swift resolution</p></li></ul></li><li><p><strong>Improve load balancing.</strong></p><p>Implement load-balancing strategies including not limited to:</p><ul><li><p>Weighted round-robin balancing &#8212; distribute more requests to nodes weighted to have more processing capacity</p></li><li><p>Least connections balancing &#8212; route requests to nodes with the least active connections at a given time</p></li><li><p>Geographic load balancing &#8212; route requests to the nearest node to minimize latency</p></li></ul><p><em>Ease of Doing: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>SREs usually stay on top of a few load-balancing strategies for actual application workloads, but it can take time to review the observability system architecture</p></li></ul><p><em>Cost: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>Costs are mainly associated with engineering time to set up load-balancing configurations but might also include commercial load-balancing tools</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Improving load balancing has a significant impact on system performance through more efficient distribution of requests, lower latency, and less overloads</p></li></ul></li><li><p><strong>Optimize data pipelines.</strong></p><p>Set a regular cadence to review and enhance data processing pipelines to reduce bottlenecks and streamline processes</p><p><em>Ease of Doing:</em> &#9733;&#9733;&#9734;&#9734;&#9734;</p><ul><li><p>Optimizing data pipelines requires a systematic approach to reviewing data processing stages and making adjustments to configurations and workflows</p></li></ul><p><em>Cost: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>The cost is moderate because it mainly involves engineer time for the initial review work and then periodic reviews and enhancements</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Optimizing data pipelines has a significant impact on reducing pipeline bottlenecks</p></li></ul></li></ol><h2>Overview of siloed data</h2><p>Observability systems pull data from multiple sources including applications, infrastructure, networks, and more.</p><p>Because all of these components are separate, there&#8217;s a risk that all your observability data can end up sitting in siloes next to each component.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KMP-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KMP-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 424w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 848w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 1272w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KMP-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png" width="710" height="663" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:663,&quot;width&quot;:710,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KMP-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 424w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 848w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 1272w, https://substackcdn.com/image/fetch/$s_!KMP-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7abdabbe-7f03-4f2a-8329-3e9acf914d2e_710x663.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This defeats the whole purpose of observability systems which are supposed to bring together data from multiple disparate sources to give you a big picture.</p><p>Lord help you if you have to solve this problem because doing so almost turns you into a data engineer! Almost&#8230;</p><h2>How to prevent siloed data</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rwro!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rwro!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 424w, https://substackcdn.com/image/fetch/$s_!rwro!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 848w, https://substackcdn.com/image/fetch/$s_!rwro!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!rwro!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rwro!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg" width="1217" height="883" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:883,&quot;width&quot;:1217,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rwro!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 424w, https://substackcdn.com/image/fetch/$s_!rwro!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 848w, https://substackcdn.com/image/fetch/$s_!rwro!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!rwro!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda5c80aa-5b3c-4ab1-95ef-8ed45c5fd492_1217x883.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Summary concept map for busting siloes in observability data</p><p>Siloed data requires bringing in data tools and techniques to unite data. You may need to:</p><ol><li><p><strong>Regularly audit existing integrations.</strong></p><p>Define policies that guide the data your system should generate and then regularly check against them.</p><p><em>Ease of Doing: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Regularly auditing existing integrations is straightforward once you have properly defined your policies</p></li></ul><p><em>Cost: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>The cost is mainly engineer time in doing the initial integration inspections, creating the policy, and then the ongoing checks</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>Regular audits are a simpler and moderate impact way to solve data flow issues stemming from integration issues</p></li></ul></li><li><p><strong>Centralize metadata management.</strong></p><p>Metadata can be useful for identifying various sources of data to pull into the central observability system, so keep a system that tracks it.</p><p><em>Ease of Doing: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Track metadata is a well-documented practice and can be done with ready-to-use tooling like AWS Glue DataBrew and Apache Atlas</p></li></ul><p><em>Cost: &#9733;&#9733;&#9734;&#9734;&#9734;</em></p><ul><li><p>The cost can be reasonable and mainly involves the engineer's time to set up and maintain the metadata management system</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>Centralizing metadata management is impactful for identifying and tracking data sources to organize and push it in the right direction</p></li></ul></li><li><p><strong>Implement data integration middleware.</strong></p><p>This is not as uncommon as you may think, with the possibility of observability systems streaming data through tooling like Kafka</p><p><em>Ease of Doing: &#9733;&#9733;&#9734;&#9734;&#9734;</em></p><ul><li><p>Setting up and configuring middleware tools comes with a degree of complexity. It's a cakewalk for SREs who know Kafka, but those without this skill might need more time</p></li></ul><p><em>Cost: &#9733;&#9733;&#9733;&#9734;&#9734;</em></p><ul><li><p>The cost is moderate as it involves engineer time <s>to learn Kafka</s> for setup and maintenance of middleware tooling, a lot of which doesn&#8217;t have vendor fees</p></li></ul><p><em>Impact: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>This integration can have a high level of impact if you identify data flow issues due to disparate sources</p></li></ul></li><li><p><strong>Explore data federation models.</strong></p><p>You could implement techniques that help you do real-time aggregation of data from various sources as needed</p><p><em>Ease of Doing: &#9733;&#9733;&#9734;&#9734;&#9734;</em></p><ul><li><p>The concept of data federation models is not super complex, but it requires you to have some background with it to hit the ground running</p></li><li><p><em>Cost: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>You may need to invest in tools or technologies &#8212; like customized open source &#8212; that support data federation but it is a strategic investment if you end up needing to do it</p></li></ul></li><li><p><em>Impact: &#9733;&#9733;&#9733;&#9733;&#9734;</em></p><ul><li><p>Data federation models have a significant impact on aggregating real-time data from various sources</p></li></ul></li></ul></li></ol><h2>Wrapping up</h2><p>This is kind of meta, but observability systems need their own observability &#8212; <em>meta-observability</em>! I mentioned monitoring several times to prevent or resolve data flow issues.</p><p>The solutions I&#8217;ve mentioned are just springboards for you to explore each area in more detail on your own. If you have any specific questions, feel free to ask.</p><p>It&#8217;s important that we address the challenges I&#8217;ve outlined. That&#8217;s what it will take to make observable data flow as smooth and pure as freshly melted butter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K5hN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K5hN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K5hN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg" width="474" height="382.19505494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1174,&quot;width&quot;:1456,&quot;resizeWidth&quot;:474,&quot;bytes&quot;:629335,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K5hN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K5hN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03ed677a-f3ce-4b53-9af3-58c0ca69edae_2007x1618.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Boost software reliability with SREpath. This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://read.srepath.com/p/how-to-solve-3-data-flow-issues-in?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[Get to know OpenTelemetry without the confusion]]></title><description><![CDATA[OpenTelemetry is the 2nd most popular CNCF project ever. It can boost your observability and is not rocket science once you get the hang of it. But it feels like it early on. Let's demystify OTel.]]></description><link>https://read.srepath.com/p/get-to-know-opentelemetry-without</link><guid isPermaLink="false">https://read.srepath.com/p/get-to-know-opentelemetry-without</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Thu, 23 May 2024 12:15:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fLTD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What does OpenTelemetry solve?</h2><p>It&#8217;s a framework that promises to solve several messy problems in observability.</p><p>But at its core, it solves one thing unlike anything else out there.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Fragmented instrumentation and collection.</p><p>It can instrument all kinds of services for different kinds of data like metrics, logs, and traces and then process them for you.</p><p>In other words, OpenTelemetry can make it painless to get the data you need on how your software system is performing.</p><p><strong>It helps you generate and capture telemetry data that contributes to better observability.</strong></p><p>You&#8217;ll also hear some people say it democratizes the process of collecting observability data.</p><p>In plain English, it can cut the risk of vendor lock-in. We&#8217;ll talk about this later.</p><h2>What was observability like before OpenTelemetry?</h2><p>Victor Farcic of the DevOps toolkit has said that &#8220;OpenTelemetry can help fix the observability mess.&#8221;</p><p>His words make sense when you think of having to configure and maintain a multitude of tools just to get your telemetry data.</p><p>Before OpenTelemetry, it was not unusual to find 5 to 10 different tools and methods for collecting telemetry data in one software system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fLTD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fLTD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 424w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 848w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 1272w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fLTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png" width="714" height="502" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:502,&quot;width&quot;:714,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:528591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fLTD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 424w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 848w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 1272w, https://substackcdn.com/image/fetch/$s_!fLTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d0c3161-021d-4182-9daf-11babfbb3051_714x502.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before OpenTelemetry, there was a mix and match of tooling that collected disparate telemetry data</p><p>You had a variety of SDKs, collectors, protocols, APIs, and agents trying to work together.</p><p>Each new tool came with an additional layer of complexity because Tool A is configured differently from Tool B is configured differently from Tool C etc.</p><p>So engineers had to stay on top of it all.</p><p>You&#8217;ll hear people talk about this situation with descriptions like lacking standardization and interoperability. Big words highlighting big problems.</p><p>The lack of standardization meant engineers struggled to get consistent outcomes across different languages, frameworks, and environments.</p><p>The worst aspect of it all was &#8212; sure, you could put in the hard work to keep everything running &#8212; but it was all fragmented data.</p><p>You then faced serious challenges in integrating the data and all kinds of other data flow issues.</p><p>By unifying instrumenting, collecting, and exporting data, OpenTelemetry promises to change this for good.</p><p>Its interoperability also allows you to change vendors for your querying and visualization work.</p><h2>Key benefits of OpenTelemetry</h2><ol><li><p>Reduces fragmentation from disparate observability tooling</p></li><li><p>Eliminates the risk of vendor lock-in</p></li><li><p>Offers comprehensive observability covering logs, metrics, and traces</p></li><li><p>Ensures a consistent approach to observability across the organization</p></li><li><p>Works well within complex cloud-native environments</p></li></ol><h2>How does OpenTelemetry prevent vendor lock-in?</h2><p>You might hear from people that OTel makes vendor lock-in a thing of the past.</p><p>Or at least makes it easier for you to switch out components because nothing is proprietary.</p><p>The absence of proprietary elements means that users can switch out instrumentation libraries, collectors, or exporters with ease.</p><p>The key idea here is that you aren&#8217;t relegated to any vendor&#8217;s specific tech stack.</p><p>You don&#8217;t have to recode or instrument your services when you decide to change tooling or vendors.</p><p>Which might be a nice thing when you&#8217;re looking to try something else.</p><p>Now you might be thinking: why would an observability vendor want this to happen?</p><p>It could be that they think OpenTelemetry will enable organizations to instrument more services, leading to a larger volume of observability data.</p><p>Most commercial plans are priced based on data storage and usage. And there you are.</p><p>It&#8217;s still a win for engineers because you don&#8217;t have to use commercial solutions. You could go with unsupported open source, but that rarely works for actual organizations.</p><h2>Where does OpenTelemetry collect data from?</h2><p>The simple answer is almost everywhere you can think of.</p><p>OpenTelemetry pulls data from:</p><ul><li><p>your frontend and backend &#8212; supporting a whole bunch of languages like Go, Java, Python, Ruby, Javascript, and even Erlang</p></li><li><p>containerized environments like Docker and Kubernetes</p></li><li><p>all the major cloud providers including AWS, Azure, and GCP</p></li><li><p>existing observability tools like Prometheus, Jaeger, and more</p></li></ul><h2>What kind of data does OpenTelemetry collect?</h2><p>It can handle the 3 core signals of observability at varying levels of compatibility.</p><p>By core signals, I mean logs, metrics, and traces.</p><p>Traces are the best covered pillar across different programming languages.</p><p>That might have something to do with the much older OpenTracing folding into OTel.</p><p>OpenTelemetry&#8217;s working groups are also investigating other signal types like continuous profiling.</p><p>That&#8217;s when you continuously collect data on the application&#8217;s runtime. That might cover aspects like CPU usage, memory usage, function calls, and more.</p><h2>What is OpenTelemetry&#8217;s architecture like?</h2><p>In the least possible words, I&#8217;d say OpenTelemetry is a <strong>loosely coupled framework</strong>.</p><p>You can use all of it or some of it. It&#8217;s super flexible!</p><p>You could just use the SDK to instrument your services and then use different collectors like Jaeger or Zipkin.</p><p>Or you could go the other way around and use only the OTel collector. When would you do this? Let&#8217;s say you have Kafka as a middleware to stream data from various services.</p><p>You could stream data into the collector to push out to your exporter and then the visualization tool of choice.</p><p>For simplicity, I&#8217;m going to break OTel into 3 separate areas: instrumentation, deployment, and integration.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QGdo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QGdo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 424w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 848w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 1272w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QGdo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png" width="1456" height="284" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:284,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21273,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QGdo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 424w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 848w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 1272w, https://substackcdn.com/image/fetch/$s_!QGdo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08a7af0c-e65a-4271-8f16-54fe9d0643dc_2423x473.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let&#8217;s talk about instrumentation first because&#8230; that&#8217;s the first part of the process.</p><h2>(Phase 1 of 3) Instrumentation</h2><p>This initial phase marks where you instrument (or place) OpenTelemetry code within your services.</p><p>It gives your applications and components the ability to emit telemetry data.</p><p>Like I mentioned earlier, OTel has most of the popular programming languages covered.</p><p>Your job at this point is to incorporate the right SDK or instrumentation libraries into your application components.</p><p>Depending on what kind of data you need from the specific application, you will need to:</p><ul><li><p>create spans for trace data</p></li><li><p>instrument to collect performance metrics and/or</p></li><li><p>add logging statements to generate logs</p></li></ul><p>You can do all of this using API calls within the SDK you&#8217;re using.</p><p>You can choose how you instrument your system&#8217;s components in 3 modes:</p><ol><li><p>auto instrumentation</p></li><li><p>manual instrumentation</p></li><li><p>a hybrid of manual and auto instrumentation</p></li></ol><p>The choice you make will make a huge difference to the amount of time and energy you spend on the initial OTel setup.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lxda!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lxda!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 424w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 848w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 1272w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lxda!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png" width="1097" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1097,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lxda!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 424w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 848w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 1272w, https://substackcdn.com/image/fetch/$s_!Lxda!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef52d29a-3096-411f-bb45-00bf6f0acf8a_1097x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let&#8217;s explore each of these in more detail:</p><h3><strong>Auto Instrumentation</strong></h3><p>This is when the instrumentation mechanism automatically attaches to your applications' runtime and injects code to capture telemetry data.</p><p><strong>When would you do this?</strong></p><ul><li><p>When you want rapid deployment to instrument your application without making extensive changes to the code</p></li><li><p>When your application or other component follows common patterns that work well with auto instrumentation</p></li><li><p>When you need to capture telemetry data from a service that you don&#8217;t directly control e.g. a cloud provider&#8217;s database</p></li></ul><p><strong>When might you consider against it?</strong></p><ul><li><p>Auto instrumentation can have limitations in capturing domain-specific e.g. conversion rate in eCommerce or custom metrics that fit only your use case</p></li></ul><h3><strong>Manual Instrumentation</strong></h3><p>This is when you add the instrumenting code to the application yourself &#8212; without any automation.</p><p><strong>When would you do this?</strong></p><ul><li><p>When you need fine-grained control over what and how you instrument, like the custom metrics I mentioned, or when your application doesn&#8217;t follow standard patterns</p></li><li><p>It can also come in useful with legacy systems or unique architectures</p></li></ul><p><strong>When might you consider against it?</strong></p><ul><li><p>If you&#8217;re pressed for time or don&#8217;t have the resources to do it because manual instrumentation implies manual work</p></li><li><p>When you don&#8217;t have direct access to the codebase of the service you need to instrument</p></li></ul><h3><strong>Hybrid of Manual and Auto Instrumentation</strong></h3><p>A mix of the automatic and manual modes I highlighted earlier</p><p><strong>When would you do this?</strong></p><ul><li><p>When you have a clear idea of which components are eligible for auto instrumentation and which are more suitable for manual tweaks</p></li><li><p>A hybrid approach is useful for transitioning from manual to auto instrumentation or vice versa</p></li></ul><p><strong>When would you consider against it?</strong></p><ul><li><p>When you lack the planning capability to assign the right mode of instrumentation for components</p></li><li><p>This process requires consistent maintenance work as components evolve or completely change</p></li></ul><p>Instrumenting your application doesn&#8217;t push the data out for collection at this point.</p><p>That comes in the next phase &#8212; deployment.</p><h3>Advice for instrumenting complex software systems</h3><p>Before you start to instrument your code, you need to know how OpenTelemetry will best complement your system.</p><p>So you need to audit your stack.</p><p>Have a clear mental or (ideally) written model outlining:</p><ol><li><p>Languages in your software system</p></li><li><p>Kinds of signals needed for each component or service (group them to make it easier)</p></li><li><p>Which protocols you will use (OpenTelemetry&#8217;s default OTLP or another)</p></li><li><p>Which analytics tools you will use</p></li></ol><h3>What does OTel instrumented code look like?</h3><p>OpenTelemetry&#8217;s CNCF ambassadors like Adriana Villela are better people to follow for specific examples of OpenTelemetry instrumentation.</p><p><a href="https://www.srepath.com/making-sense-opentelemetry-observability-adriana-villela/">Listen to our podcast episode interview where she discusses OpenTelemetry</a>.</p><p>I&#8217;ll show you a simple example to get you thinking about instrumentation mechanics.</p><p>Below you can see 3 different code samples.</p><p>We are instrumenting a Node.js application with OpenTelemetry for tracing that then exports span data to Jaeger, a dedicated tracing tool.</p><p><strong>The first code sample installs OpenTelemetry packages for our Node.js application:</strong></p><pre><code><code>npm install \\
  @opentelemetry/api \\
  @opentelemetry/context-base \\
  @opentelemetry/core \\
  @opentelemetry/exporter-jaeger \\
  @opentelemetry/instrumentation \\
  @opentelemetry/instrumentation-express \\
  @opentelemetry/instrumentation-http \\
  @opentelemetry/sdk-trace-node \\
  @opentelemetry/tracing \\
  @opentelemetry/resources
</code></code></pre><p><strong>This second code sample installs the exporter needed to send data to Jaeger</strong></p><pre><code><code>npm install @opentelemetry/exporter-jaeger
</code></code></pre><p><strong>This third code sample is within our Node.js application showing the instrumentation steps</strong></p><pre><code><code>// Step 1: Import OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor, ConsoleSpanExporter } = require('@opentelemetry/tracing');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

// Step 2: Import Jaeger exporter
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

// Step 3: Create and configure a TracerProvider with Jaeger exporter
const tracerProvider = new NodeTracerProvider();
tracerProvider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter())); // For console logging
tracerProvider.addSpanProcessor(new SimpleSpanProcessor(new JaegerExporter({ serviceName: 'example-service' }))); // Export to Jaeger
tracerProvider.register();

// Step 4: Instrumentation - Import instrumentations for popular libraries
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Step 5: Register instrumentations with the tracer provider
registerInstrumentations({
  instrumentations: [
    HttpInstrumentation,
    ExpressInstrumentation,
    // Add more instrumentations as needed
  ],
});

// Step 6: Use OpenTelemetry APIs in your code
const { context, trace } = require('@opentelemetry/api');

// Function to simulate a simple HTTP request
async function makeHttpRequest() {
  const span = trace.getTracer('example-tracer').startSpan('makeHttpRequest');

  // Simulate some work
  await new Promise(resolve =&gt; setTimeout(resolve, 100));

  span.end();
}

// Sample application logic
async function main() {
  const span = trace.getTracer('example-tracer').startSpan('main');

  // Simulate some work
  await makeHttpRequest();

  span.end();
}

// Step 7: Execute the application
main();
</code></code></pre><h2>(Phase 2 of 3) Deployment</h2><p>After your application has been successfully instrumented, OpenTelemetry needs to run alongside the application&#8217;s runtime.</p><p>It can do this in various ways including sidecars, daemonsets, gateways, etc.</p><p>You might hear about these as &#8220;Collectors&#8221;.</p><p>This is a critical step that cannot be messed up.</p><p>It&#8217;s the point when telemetry data is gathered during the execution of your application.</p><p>Now that we&#8217;ve covered deployment, let&#8217;s look at the integration phase:</p><h2>(Phase 3 of 3) Integration</h2><p>This phase involves receiving, transforming, and then pushing the data out for human use.</p><p>Some vendors will have an &#8220;Observability Agent&#8221; or &#8220;Agent&#8221; for short that will do all of this.</p><p>You can mix and match depending on your programming language, OS, etc.</p><p>We can break OpenTelemetry&#8217;s integration phase into components like:</p><ol><li><p>Receivers</p></li><li><p>Processors</p></li><li><p>Exporters</p></li><li><p>OTLP</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3LUN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3LUN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 424w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 848w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 1272w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3LUN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png" width="1303" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:1303,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3LUN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 424w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 848w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 1272w, https://substackcdn.com/image/fetch/$s_!3LUN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d6d8cfe-5f27-4804-acc3-7c31c9532375_1303x529.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s talk about each of these components now:</p><h3>1. Receivers</h3><p>Receivers collect raw telemetry data from your instrumented applications or system components.</p><p>They are the entry points for your incoming observability data.</p><h3>2. Processors</h3><p>Processors manipulate (or better put) transform the raw telemetry data taken in by receivers.</p><p>You can transform the data by filtering, sampling, and enriching it.</p><h3>3. Exporters</h3><p>Exporters take your processed telemetry data and send it to external systems. This can be your observability dashboard for example.</p><p>You can also direct the data to storage for future querying or visualization needs.</p><h3>4. OTLP</h3><p>OTLP is not so much a stage for doing anything to the data as much as the protocol that moves it around. It stands for OpenTelemetry Protocol.</p><p>In a way, this is the secret sauce that keeps the data consistent. It&#8217;s the common language for communicating within the OTel framework.</p><p>There are 2 ways the data can be transported: as protobuf or JSON.</p><p>Protobuf is a compact and efficient way of encoding structured data, so it comes as the default.</p><p>JSON is available if you want the data to be human readable, but you will increase the data size.</p><p>Keep in mind that you don&#8217;t have to use OTLP at all for transporting the data.</p><p>You can use other protocols like Jaeger Thrift for tracing, Prometheus exposition format for metrics, and more.</p><h2>Challenges with taking up OpenTelemetry</h2><h3>Not all of it will work well with your architecture</h3><p>The most important thing you have to keep in mind is that OTel is made up of various components, and not every component is stable or even available.</p><p>For example, as of December 29th 2023, logs were not available for Go-based services.</p><p>But you can be confident in OTel&#8217;s tracing capabilities as that part is mature across most languages.</p><p>You can easily check if OTel will work with your various system components.</p><p>Go to <a href="http://opentelemetry.io/status">opentelemetry.io/status</a> and follow the instructions there.</p><p>You will see that various OTel components are marked as stable, experimental, in development, or not yet started.</p><p>Once a component is stable, you&#8217;re good to use it in most situations because stable implies:</p><ol><li><p><strong>long term support</strong> &#8212; the component is well-tested and ready for production use</p></li><li><p><strong>dependency isolation</strong> &#8212; designed to minimize dependencies and provide clear APIs</p></li><li><p><strong>backwards compatibility</strong> &#8212; future updates will strive to avoid breaking existing functionality</p></li></ol><h3>Some systems just won&#8217;t work with OpenTelemetry</h3><p>Now, I said that a stable component can be used in most situations.</p><p>There are some situations where you might not be able to or want to use it.</p><p>The first instance is when you&#8217;re looking to instrument a legacy system.</p><p>You&#8217;ll have to make considerations about whether that component will work with your legacy system effectively.</p><p>Another instance is when you have a low-latency system like a trading platform.</p><p>The performance overhead of OpenTelemetry might call for further looking into how it affects your system&#8217;s performance.</p><h3>OpenTelemetry has a learning curve</h3><p>While it&#8217;s a lot easier than learning 5 or even 10 different tools, OpenTelemetry is still intricate and you will need to learn how to use it well.</p><p>A few things I&#8217;d recommend for you to do are:</p><ol><li><p>deep dive into the documentation on the OpenTelemetry website</p></li><li><p>talk with developer advocates at the observability provider you use</p></li><li><p>reach out to CNCF ambassadors or maintainers who are focused on OpenTelemetry</p></li></ol><p>While I&#8217;m on that train of thought, a quick shout out to the hardworking maintainers who maintain OpenTelemetry&#8217;s codebase and are working to make every component stable.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Intro to logs, metrics, and tracing]]></title><description><![CDATA[This introductory guide is a deeper look into logs, metrics, and traces than what&#8217;s inside the What is Observability? guide.]]></description><link>https://read.srepath.com/p/intro-to-logs-metrics-and-tracing</link><guid isPermaLink="false">https://read.srepath.com/p/intro-to-logs-metrics-and-tracing</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Sat, 11 May 2024 12:15:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1A-w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Logs, metrics, and traces are considered the 3 golden pillars of observability.</p><p>I&#8217;d say they are the minimum you need for effective observability.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>There are a lot more signals you can pick up from your systems, but these 3 are the core ones you should master before you move on to others.</p><h2>Logs</h2><h3>What are logs?</h3><p>Logs are text-based records of events and messages that your software system emits.</p><h3>Why would I use logs?</h3><p>Logs are an important activity in developing software.</p><p>Debugging software is much harder without knowing what kind of error messages your software is producing.</p><p>You can also create automated alerts to trigger when a specific log level e.g. WARN or ERROR occurs.</p><p>More modern logging systems can even support your efforts to collect metrics and traces, which we&#8217;ll get to later.</p><h3>What kinds of logs exist?</h3><p>Here are some examples:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1A-w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1A-w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 424w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 848w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 1272w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1A-w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png" width="570" height="647.0112589559877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1109,&quot;width&quot;:977,&quot;resizeWidth&quot;:570,&quot;bytes&quot;:57676,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1A-w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 424w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 848w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 1272w, https://substackcdn.com/image/fetch/$s_!1A-w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecfc79e-669c-4bac-8585-07a8ce9532b9_977x1109.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>What is found in a log?</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PJIO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PJIO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 424w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 848w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 1272w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PJIO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png" width="1311" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1311,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87293,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PJIO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 424w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 848w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 1272w, https://substackcdn.com/image/fetch/$s_!PJIO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070380a3-e58b-48f9-8037-0d176f99ae65_1311x825.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You will typically find the following elements in a log:</p><ul><li><p>Timestamp &#8212; the time the event or message occurred</p></li><li><p>Source identifier &#8212; this could be your microservice&#8217;s name or instance ID</p></li><li><p>Levels &#8212; helps you know the importance of a record e.g. INFO, DEBUG, WARN, ERROR, etc.</p></li><li><p>Description &#8212; gives further details of the event or message</p></li></ul><h3>What do the various log levels mean?</h3><p>INFO means you are getting details about the normal functioning of your app or system</p><p>DEBUG can help developers trace an execution flow for troubleshooting purposes</p><p>WARN will indicate that something is starting to go wrong or deviate from normal</p><p>ERROR will indicate that there is some kind of issue or failure related to the source</p><h3>How is log data structured?</h3><p>Log data for most of its existence has been unstructured data stored in plain text files.</p><p>Here&#8217;s an example of unstructured log data:</p><pre><code><code>2022-01-08 15:30:00 | Error | Application crashed
2022-01-08 15:35:20 | Info  | User logged in
2022-01-08 15:40:45 | Warning | Disk space low
</code></code></pre><p>Now, the modern way is JSON to represent log data.</p><p>Here&#8217;s an example of JSON-based log data:</p><pre><code><code>[
  {
    "timestamp": "2023-01-08T15:30:00",
    "level": "Error",
    "message": "Database overload!"
  },
  {
    "timestamp": "2023-01-08T15:35:20",
    "level": "Info",
    "message": "User logged out"
  },
  {
    "timestamp": "2023-01-08T15:40:45",
    "level": "Warning",
    "message": "Disk space approaching 90%"
  }
]
</code></code></pre><h3>How do you gather logs on your systems?</h3><p>A lot of popular frameworks like Ruby, Python, and Javascript come with their own logging libraries, so be sure to check them out.</p><p>You can also add logging functionality through external frameworks like Log4j (for Java).</p><p>OpenTelemetry also has SDKs to instrument for logging in several popular languages.</p><h3>3 golden rules for maintaining your logs</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g4eZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g4eZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 424w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 848w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 1272w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g4eZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png" width="1159" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1159,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37745,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g4eZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 424w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 848w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 1272w, https://substackcdn.com/image/fetch/$s_!g4eZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc28a38a-1359-4fe6-8189-33d875f5bf5f_1159x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s unpack each of these a little further:</p><p><strong>Rule 1: Push all logs into a central repo</strong></p><p>Life was simpler when monoliths were around. You could get all of its log data in one place.</p><p>But it&#8217;s no longer so simple, now that most software systems have distributed components like VMs, databases, and more.</p><p>A fragmented view of the system&#8217;s performance is one of the biggest risks from log data being held near each of these components.</p><p>It also increases the complexity of solving log issues like storage and security.</p><p>Log aggregation can help push all the fragmented log data into a centralized repository.</p><p>This can drastically simplify storage and management of logs.</p><p><strong>Rule 2: Stay on top of log storage</strong></p><p>The simple aspect of this rule is <em>the more log data you store, the harder it gets to manage them.</em></p><p>Logs can come with an operational overhead in terms of capacity planning work, storage costs, and complexity.</p><p>I know of companies that are spending high 6-figures every month to store their log data when they could cut those costs down to low 5-figures by applying good techniques.</p><p>Techniques like:</p><ul><li><p><strong>Log compression</strong> &#8212; reducing the log data size using compression algorithms</p></li><li><p><strong>Log retention policies</strong> &#8212; having a set policy for when log entries should get purged ensures ongoing control of log data size</p></li><li><p><strong>Automated log file reduction</strong> &#8212; automatically trim older log entries at certain time intervals or file size limits</p></li></ul><p><strong>Rule 3: Keep your logs secure</strong></p><p>Logs can contain sensitive information like error and warning messages that can tell bad actors about your system&#8217;s architecture and its inner (non)workings.</p><p>They can work out the breadth of your system and where they can find the weakest spots to infiltrate with measures like brute force attacks, evasion-based testing, and more.</p><p>You can apply some of these methods to reduce your logs&#8217; security risk:</p><ol><li><p><strong>Sanitize logs regularly</strong> &#8212; clean up any sensitive information e.g. user input or tokens from logs once it's a second past its usefulness for debugging and incident handling</p></li><li><p><strong>Restrict your logging levels</strong> &#8212; sensitive levels like DEBUG (which can link to a vulnerability) should have restricted access</p></li><li><p><strong>Regular audits of logs &#8212;</strong> check your logs to make sure that sensitive information is not staying on long-term and that access is restricted</p></li></ol><h3>Log data lifecycle</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WK_l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WK_l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 424w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 848w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 1272w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WK_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png" width="807" height="569" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:807,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17495,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WK_l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 424w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 848w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 1272w, https://substackcdn.com/image/fetch/$s_!WK_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a09a7e-68c8-47f1-976b-836aeb6a9657_807x569.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>3 ways to make analyzing log data easier</h3><p>Logs are a firehose of data since they collect it from all kinds of components in your system.</p><p>You need to use a few analysis techniques to make sure you get good value out of them. Techniques like:</p><ol><li><p><strong>Search and filtering</strong></p><ul><li><p>Helps you sift through the big data of logs fast by finding or filtering down to log entries with keywords, time ranges, or levels</p></li><li><p>Tools that can help include Elasticsearch, Splunk, or ELK</p></li><li><p>As an example, you could use this ability to find entries with the type ERROR throughout the system so that you can track issues faster</p></li></ul></li><li><p><strong>Parsing</strong></p><ul><li><p>This is very useful if you haven&#8217;t got JSON-based logs and your data is unstructured</p></li><li><p>You can break log data into meaningful attributes to extract valuable insights</p></li><li><p>Of course, you can always turn your logs into JSON to make it easier to process</p></li></ul></li><li><p><strong>Trend analysis</strong></p><ul><li><p>You will benefit most from your log data when you can identify patterns and trends within</p></li><li><p>One of the easier ways to do this is to visualize the data on dashboards and graphs</p></li><li><p>You can then see how key numbers like ERROR levels are moving and more</p></li></ul></li></ol><h2>Metrics</h2><h3>What are metrics?</h3><p>Metrics are measurements that give you an indication of how a certain aspect of your software system is doing or performing.</p><p>They are always number-based i.e. quantitative measurements.</p><h3>Why do I need metrics?</h3><p>Metrics are important data for your software system.</p><p>They can help guide your decision-making as to how to improve your system.</p><p>You can use metrics to guide capacity planning efforts which is a big part of Site Reliability Engineering (SRE) work.</p><p>They also come in very handy during incident response to see where issues are occurring, which once again is a huge part of SRE work.</p><p>They are the main data contributor to alerts.</p><h3>What kind of metrics exist?</h3><p>Here are some examples of metrics:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dezs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dezs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 424w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 848w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 1272w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dezs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png" width="895" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d587fa38-c17a-4a32-999d-f3de17da8002_895x881.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:895,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39246,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dezs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 424w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 848w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 1272w, https://substackcdn.com/image/fetch/$s_!Dezs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd587fa38-c17a-4a32-999d-f3de17da8002_895x881.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can develop and consume metrics for:</p><ul><li><p>applications</p></li><li><p>infrastructure</p></li><li><p>networks and other system components</p></li></ul><p>You will encounter 4 types of measures when it comes to metrics:</p><ul><li><p><strong>Counters</strong> &#8212; measure the number of times a particular action or event occurs e.g. http_requests_total gives a count of the number of HTTP requests</p></li><li><p><strong>Gauges</strong> &#8212; provide a snapshot of the current level of a metric at a particular time e.g. memory usage_mb gives a measurement of memory usage at a particular time</p></li><li><p><strong>Timers</strong> &#8212; measure the duration or latency of an event or process e.g. request_duration measures the length of time it takes to initiate and complete a request</p></li><li><p><strong>Histograms</strong> &#8212; provide a distribution of values broken down into buckets or set intervals e.g. response_time_histogram shows the breakdown of times falling into 0ms, 10ms, 20ms, etc</p></li></ul><h3>What is found in a metric?</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sb0M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sb0M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 424w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 848w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 1272w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sb0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png" width="659" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:659,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sb0M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 424w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 848w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 1272w, https://substackcdn.com/image/fetch/$s_!sb0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149b4dda-3192-4842-aa41-5dbc39cb31bc_659x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You will find that metrics are composed of various key:value pairs like above.</p><p>The example above shows a metric called &#8220;Page Load Time&#8221; that is measured in seconds with the average being 2.5 seconds and an acceptable limit of 3 seconds. Because the average is well below 3 seconds, the status is considered optimal.</p><p>You will rarely see a simple metric like this in your system, but I wanted to keep it simple to explain the breakdown of a metric to you.</p><p>You will more likely see metrics like this:</p><pre><code><code>{
  "metric_name": "http_requests",
  "region": "us-east",
  "http_method": "GET",
  "status_code": 200,
  "response_time": 25.3,
  "bucket": "low",
  "timestamp": 1642267200
}
</code></code></pre><h3>How do you gather metrics from your system?</h3><p>You can use various open-source and commercial tools for gathering metrics.</p><p>Open-source options include Prometheus, Grafana, and more.</p><p>You can also use commercial tools like Datadog and New Relic.</p><p>OpenTelemetry once again has stable SDKs to instrument for metrics in several popular languages. You can read an intro guide to OpenTelemetry here.</p><h3>3 golden rules for building your metrics</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qfZE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qfZE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 424w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 848w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 1272w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qfZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png" width="971" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:971,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35932,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qfZE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 424w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 848w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 1272w, https://substackcdn.com/image/fetch/$s_!qfZE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f6601f-db82-4f1b-ab08-3b868a196a2f_971x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s unpack each of these a little further:</p><p>Rule 1: <strong>Define your metrics clearly</strong></p><p>Your metrics should fit a very specific purpose that supports you in solving a problem, known incident type, or business need.</p><p>A metric can consist of multiple key:value pairs that relate to a specific aspect of the metric.</p><p>You can precisely define your metrics by stating your key:value pairs as:</p><ol><li><p>what is being measured e.g. latency</p></li><li><p>how it&#8217;s calculated e.g. time elapsed from start to end of request <em>and</em></p></li><li><p>the unit of measurement e.g. milliseconds</p></li></ol><p>A metric can get complicated fast with a rise in key:value pairs.</p><p>Pay special attention to the cardinality of each key:value pair.</p><p>High cardinality key:value pairs like user IDs can cause a surge in data leading to storage issues.</p><p><strong>Rule 2: Establish baselines and thresholds</strong></p><p>It&#8217;s all well and good to define what data you want to collect, but a lot of the time you want the system to <em>tell you</em> if a metric is not getting optimal results.</p><p>You can do this by establishing baselines and thresholds.</p><p>Your baseline number outlines your comfort level for the specific measurement.</p><p>Thresholds set the level where you start getting red flags about the metric you&#8217;re collecting.</p><p>Let&#8217;s look at the metric example below:</p><pre><code><code>{
  "metric_name": "http_requests",
  "region": "us-east",
  "http_method": "GET",
  "status_code": 200,
  "response_time": 25.3,
  "timestamp": 1642267200,
  "baseline": {
    "average_response_time": 20,
    "acceptable_status_codes": [200, 201, 204]
  },
  "threshold": {
    "response_time": 30,
    "error_rate": 5
  }
}
</code></code></pre><p>Your baseline has an average response time of 20ms.</p><p>We&#8217;ve also added status codes within the 2xx range that signify successful requests.</p><p>Now, let's talk thresholds.</p><p>If we see the response time going beyond 30ms, that's a signal that something might be off.</p><p>But that's not all &#8211; if this happens and we notice error codes repeating more than 5 times, it's a double whammy, and we consider it a problem.</p><p>Now you might be thinking: why can&#8217;t I just set thresholds in my alerting tool or monitoring system? You can, and that&#8217;s what a lot of people do.</p><p>But setting the threshold within the metric itself protects it from multiple individuals setting different alerts over time for the same metric. That can reduce alert fatigue risk.</p><p><strong>Rule 3: Know the context of your metrics</strong></p><p>Context is your secret decoder ring.</p><p>It turns raw data into meaningful insights.</p><p>Not having the right context means you might think your site is crashing. But in reality, it's just handling the Black Friday rush like a champ.</p><p>When you have context, you can distinguish normal behavior from the unusual.</p><p>You can look into the context in terms of:</p><ul><li><p>Day or time</p></li><li><p>Special events</p></li><li><p>Typical user behavior</p></li></ul><h3>Metrics data lifecycle</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H4uv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H4uv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 424w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 848w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 1272w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H4uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png" width="811" height="699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:699,&quot;width&quot;:811,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21228,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H4uv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 424w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 848w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 1272w, https://substackcdn.com/image/fetch/$s_!H4uv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14e4b13-5a4a-4898-8b4c-3ab0182b474f_811x699.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>How metrics are consumed</h3><ol><li><p><strong>Alerting</strong></p><ul><li><p>By far the most critical use of metrics</p></li><li><p>These help identify when system and application issues are occurring in real-time</p></li><li><p>They continuously assess values and trigger when thresholds are hit</p></li></ul></li><li><p><strong>Dashboards</strong></p><ul><li><p>One of the most common ways metrics are consumed</p></li><li><p>They can provide real-time and historical views of metrics through charts, counts, and graphs</p></li><li><p>You can use this to help resolve chronic or wide-spanning issues</p></li></ul></li><li><p><strong>Machine learning and data analysis</strong></p><ul><li><p>One of the more promising uses of metrics data</p></li><li><p>Can analyze and uncover hidden patterns and predict trends &#8212; without human intervention</p></li><li><p>Helps you get deeper insights into system behavior than your manual analysis work</p></li></ul></li></ol><p>There are a lot more areas that use metrics data, but the above 3 can take most engineers very far in their work.</p><h3>Tracing</h3><h3>What are traces?</h3><p>Traces are like detailed records that show every step a request or process takes in your system, helping you understand and fix any issues.</p><p>In essence you are understanding the lifecycle of a request or process.</p><h3>Why would I use traces?</h3><p>Traces can help you find bottlenecks within a particular process or request chain.</p><p>By nature, request chains can have multiple requests within. That can make it very hard to pinpoint where the error is occurring.</p><p>Tracing breaks the chain down to work out how each individual request is performing as it initiates and hands off to the next request.</p><p>This makes it a lot easier to work out where in the request chain you are experiencing excessive latency, for example.</p><p>It could mean finding the issue with a database that forms a critical step in a request.</p><p>In another situation, it could be a particular piece of code thats causing trouble.</p><h3>What kind of traces exist?</h3><p>You can find traces in two forms, but one can fit inside the other.</p><p>You can trace internally within a service e.g. within a backend service.</p><p>You can trace across multiple services e.g. from frontend to backend to database and back.</p><p>The latter is known as distributed tracing and is the more common way to run tracing.</p><h3>What is found in a trace</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dQgm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dQgm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 424w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 848w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 1272w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dQgm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png" width="695" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:695,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dQgm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 424w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 848w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 1272w, https://substackcdn.com/image/fetch/$s_!dQgm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b295bc4-d1e6-4992-8338-750dd26f2693_695x393.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Original image via <a href="http://jaegertracing.io">jaegertracing.io</a></p><p>The core element of tracing is spans.</p><p>Spans are called the &#8220;unit of work&#8221; within a trace. They measure the duration of an activity i.e. how long it takes a particular action to start and complete.</p><p>In most situations, a span is measured in milliseconds. Each span comes with a span ID to make it easy to deepdive into later on.</p><p>Spans can be children of other spans. In the above example, the various spans under <em>Inventory</em> are children of the <em>Inventory</em> span above them.</p><h3>How do you gather tracing data from your system?</h3><p>Open-source options include Jaeger and Zipkin.</p><p>Tracing is an area that OpenTelemetry excels in since it incorporates the very mature OpenTracing framework.</p><p>You should be able to find stable SDKs to instrument for tracing in several popular languages.</p><h3>3 golden rules for managing your tracing</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k4U6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k4U6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 424w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 848w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 1272w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k4U6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png" width="976" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:976,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k4U6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 424w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 848w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 1272w, https://substackcdn.com/image/fetch/$s_!k4U6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6a1bb85-efb5-4f76-b148-fc828203af75_976x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s unpack each of these a little further:</p><p><strong>Rule 1: Define your tracing context</strong></p><p>This means you need to set unique identifiers and propagate them consistently across your request. You need these to be clear with your spans.</p><p>You can&#8217;t link spans together without this work, which makes it hard to get full visibility into a particular request.</p><p>So to do this right and define the context:</p><ol><li><p>Set a unique trace ID for each incoming request</p></li><li><p>Give each component within the request a unique span ID</p></li><li><p>Propagate these identifiers by including them in headers or context objects</p></li></ol><p><strong>Rule 2: Instrument thoughtfully</strong></p><p>Tools like OpenTelemetry make it easy for you to instrument everything for tracing, but that doesn&#8217;t mean you should.</p><p>You want to gain meaningful insights from important requests.</p><p>You don&#8217;t want to overwhelm your system with excessive data.</p><p>So how do you instrument thoughtfully? Here&#8217;s how:</p><ol><li><p>Work out what services and components are critical to knowing your system&#8217;s behavior</p></li><li><p>Instrument only the ones that will help you understand:</p><ol><li><p>system performance</p></li><li><p>diagnosing bottlenecks or</p></li><li><p>improving latency</p></li></ol></li><li><p>Avoid the temptation to instrument <em>every</em> component to prevent noisy tracing data</p></li></ol><p><strong>Rule 3: Correlate traces with logs &amp; metrics</strong></p><p>One of the most important aspects of observability is that it integrates multiple data types to give you a bigger, stronger picture of the system.</p><p>Doing so can help you identify root causes more effectively and often faster.</p><p>Not doing so means tracing data is own its own when really, it could benefit from log data to support its context.</p><p>So how do you make this integration happen? Here&#8217;s how:</p><ol><li><p>Ensure that logs include trace IDs</p></li><li><p>Be consistent with how you tag components and services so you can link up their traces, logs, and metrics</p></li><li><p>Leverage tracing and logging libraries that integrate with observability platforms</p></li></ol><h2>How logs, metrics and traces work in unison</h2><p>Picture logs, metrics, and traces as the dynamic trio in the backstage of your software.</p><p>There are a bunch of other signals, but these 3 are the core foundation of your observability work.</p><p>Logs keep a record of every event, significant or not.</p><p>Metrics help you keep a scorecard on system health.</p><p>Traces map out the journey of each request from start to finish.</p><p>Bringing these three together is like weaving a compelling story &#8212; traces connect the dots, metrics add the numbers, and logs provide the backstage commentary.</p><p>It's your backstage pass to understanding and fine-tuning your system.</p><p>This trio has your back when you need to work out, &#8220;What&#8217;s going wrong here?&#8221;</p><p>They are your ticket to more visibility during troubleshooting and improvement work.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[What is observability? [Key concepts explained]]]></title><description><![CDATA[Last week, I sent you an article about the cardinality conundrum in observability. That brought up a few questions. The one I didn't expect was, "What is observability, really?" So let's find out...]]></description><link>https://read.srepath.com/p/observability-key-concepts</link><guid isPermaLink="false">https://read.srepath.com/p/observability-key-concepts</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Thu, 25 Apr 2024 12:30:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!54oj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>A simple definition of observability</h3><p>If I had to explain observability to you for the first time, I&#8217;d say it like this:</p><blockquote><p>Observability is a practice that helps us collect, refine and use data on how our software system is behaving. We can query or visualize patterns in the data to solve problems.</p></blockquote><p>You can also call it o11y to save writing observability over and over again.</p><p>Why the <em>11</em>s? Those are the number of characters between the o and y in observability.</p><p>I sometimes pronounce it as &#8220;Olly&#8221; to people who I think would get the context.</p><h3>My favorite &#8220;textbook&#8221; &#129299; definition of observability</h3><p>If I were to make things a little harder and give you a textbook definition of observability, I&#8217;d pick this one:</p><blockquote><p>Observability is the measure of a system's ability to allow operators or engineers to understand its internal state and behavior. It involves the collection, analysis, and visualization of relevant telemetry data, such as logs, metrics, traces, and events. This data helps facilitate effective monitoring, debugging, and performance optimization of the software system.</p></blockquote><p>Both definitions give us an idea of what observability can offer, but only one will cut the mustard on Reddit. I&#8217;ll let you guess which one.</p><h3>One other factor is important</h3><p>Observability is designed to support the <strong>real-time nature of needing access to data-based insights</strong>. It is also more suited to the nature of cloud-based software.</p><h2>What kinds of questions can observability answer?</h2><p>Important ones is going to be my upfront response!</p><p>All the effort behind observability is to get solid data to answer critical questions like:</p><ul><li><p>Are there issues affecting users&#8217; ability to complete requests?</p></li><li><p>How do we NOT fly blind when an outage or security incident occurs?</p></li><li><p>How can we scale our system to meet changing demand?</p></li><li><p>Are there any bottlenecks in our system performance and where?</p></li><li><p>Is our service meeting SLOs?</p></li><li><p>How do we provide evidence of service uptime in an SLA dispute?</p></li><li><p>How can we investigate system performance or uptime issues faster?</p></li></ul><h2>The big picture of observability</h2><p>It's a complex area with much to do, a lot of data generated, and a fair risk of messing up.</p><p><em>But it's worth it.</em> Observability not only helps us respond to incidents with the confidence of data.</p><p>It supports works that enhance the system. Works like improving other focus areas like system design, release engineering, performance tuning, and more.</p><p>SREpath's <em>Island of Observability</em> highlights the big picture of this focus area.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!54oj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!54oj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 424w, https://substackcdn.com/image/fetch/$s_!54oj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 848w, https://substackcdn.com/image/fetch/$s_!54oj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!54oj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!54oj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg" width="1217" height="1046" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1046,&quot;width&quot;:1217,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:275523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!54oj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 424w, https://substackcdn.com/image/fetch/$s_!54oj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 848w, https://substackcdn.com/image/fetch/$s_!54oj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!54oj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42896e9-94ef-41b0-9634-f509a1e5c8dc_1217x1046.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Observability vs monitoring vs APM</h2><p>I&#8217;ve heard a myriad of engineers in different roles mix up the terms just above.</p><p>They think monitoring is observability or that their APM tool can fully handle observability.</p><p>Nope!</p><p>APM and monitoring are older paradigms from simpler times.</p><p>They don&#8217;t encompass the breadth of observability, but all 3 of these terms are interrelated. We&#8217;ll get to that in a moment.</p><p>I think it&#8217;ll help to define what each paradigm stands for first. Let&#8217;s start with APM.</p><h3>What does APM mean?</h3><p>APM stands for <strong>Application Performance Monitoring</strong>. It focuses on the application layer and debugging any problems within applications.</p><p>In other words, APM tooling gives you data on how your application performs at a given time. It does not cover other parts of the system like infrastructure or networks.</p><p>This model worked well for monolithic architectures where the APM tool was part of the application&#8217;s code.</p><p>Can you see how this would be problematic to do in cloud-native architecture?</p><p>You might have heard of companies like Splunk and New Relic. They started off with APM offerings but are now also focused on observability for the cloud-native reason.</p><p><strong>APM is a subset of monitoring</strong>, so let&#8217;s define monitoring.</p><h3>What does monitoring mean?</h3><p>Monitoring within a software system is all about picking up its health data.</p><p>Prometheus is an example of a tool solely focused on monitoring.</p><p>You can gather data like error rate, traffic levels, latency, and saturation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ix6_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ix6_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 424w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 848w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 1272w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ix6_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png" width="605" height="473.2737169517885" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:503,&quot;width&quot;:643,&quot;resizeWidth&quot;:605,&quot;bytes&quot;:7911,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ix6_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 424w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 848w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 1272w, https://substackcdn.com/image/fetch/$s_!Ix6_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53541834-abbf-44b6-8f5a-ce7ae44c311f_643x503.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>I chose this example because latency is my favorite monitoring data &#8212; maybe I should have worked as a performance engineer &#129300;</p><p>By the way, these are known as the ****4 golden signals in the Site Reliability Engineering (2016) book.</p><p><strong>Think of monitoring as something that&#8217;s always reactive</strong>. It&#8217;s like a radar that lets you see what&#8217;s in your field of view <em>at that point in time</em>.</p><p>The key purpose of monitoring for a long time has been to support predefined alerting.</p><p>I&#8217;ll share an analogy with you in a moment that will highlight this alerting use case, but I need to explain observability again for that. Let me throw a bombshell first&#8230;</p><p><strong>Monitoring is a part of observability systems</strong>, so let&#8217;s define observability again now that we are comparing it with others.</p><h3>What is observability when compared to APM and monitoring?</h3><p>Unlike APM, observability can collect data from everywhere in the software system. This includes applications, databases, storage, networks, and more.</p><p>Observability factors in the complex nature of cloud computing that APM does not.</p><p>Unlike monitoring, observability captures data from a proactive standpoint and helps engineers anticipate where issues will arise. It also gives context to alerts when issues occur.</p><p>Keep this in mind: <strong>APM is a part of monitoring. Monitoring is a part of observability.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BHtT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BHtT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 424w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 848w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 1272w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BHtT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png" width="1456" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29467,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BHtT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 424w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 848w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 1272w, https://substackcdn.com/image/fetch/$s_!BHtT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55685efe-47f9-44a1-9cc1-c9f0967b38d8_1553x880.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Non-software story to highlight observability vs monitoring</h3><p>Pavan Elthepu gave a brilliant example in the context of a hospital situation.</p><p>Imagine a bedbound patient is connected to a heart rate monitor.</p><p>The monitor&#8217;s heart rate reading suddenly spikes, causing an alarm to sound in the hospital ward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HFq-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HFq-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 424w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 848w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 1272w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HFq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png" width="354" height="264.96363636363634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:494,&quot;width&quot;:660,&quot;resizeWidth&quot;:354,&quot;bytes&quot;:46923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HFq-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 424w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 848w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 1272w, https://substackcdn.com/image/fetch/$s_!HFq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff76db0ba-9a6a-4f6b-9df2-ec02252f66d6_660x494.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A nearby doctor rushes to the scene to investigate the issue.</p><p>She looks at the monitor and notices the spike in heart rate, but that&#8217;s all it can tell her.</p><p>What should her next step be to prevent the situation from worsening?</p><p>The doctor is well-seasoned and starts looking into medication charts and historical health data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_muv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_muv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_muv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_muv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_muv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_muv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg" width="270" height="351.86746987951807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:498,&quot;resizeWidth&quot;:270,&quot;bytes&quot;:75855,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_muv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_muv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_muv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_muv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702be725-5719-4db1-81e9-25f29fbe2ef7_498x649.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>She finds that the patient had been administered a new pain management drug recently.</p><p>Her conclusion is that the drug is causing an increase in heart rate.</p><p>The nursing staff then work to administer countering treatments to resolve the issue.</p><p>In this situation, <strong>the heart rate monitor is analogous to monitoring</strong> in that it gathers data to give reactive alerts.</p><p>The <strong>medication charts and historical health data are analogous to observability data</strong> in that they give context to the alert. They helped pinpoint the underlying issue.</p><p>You can see from this analogy that observability can be extremely useful for adding context during incidents like outages and security breaches.</p><h2>What does observability data look like?</h2><p>You might have heard of <strong>logs, metrics, and traces, and events</strong>.</p><p>These are the main types of data you will find in an observability system.</p><p>Each has its use case for helping you investigate and resolve system problems.</p><h3>Log data definition and example</h3><p>Logs are recorded messages or events the system generates, providing a chronological record of its activities.</p><p>Here&#8217;s a simplified example of what logs can look like:</p><pre><code><code>2023-12-01 15:30:45 [INFO] User 123 logged in successfully.
2023-12-01 15:32:10 [ERROR] Database connection timeout.
2023-12-01 15:33:25 [WARNING] Disk space is running low (90% used).
</code></code></pre><h3>Metrics data definition and example</h3><p>Metrics are quantitative measures of various aspects of the system and its underlying performance</p><p>Here&#8217;s a simplified example of what metrics can look like:</p><pre><code><code>cpu_usage_percent: 85
memory_used_mb: 1200
http_requests_total: 1500
error_rate: 0.05
</code></code></pre><h3>Trace data definition and example</h3><p>Traces are data that cover the underlying responses within a request as it moves through different components of a distributed system.</p><p>It consists of data points known as span, which <a href="https://www.notion.so/Intro-to-logs-metrics-and-tracing-8ca70d2a6b4e4140b186bf675a281651?pvs=21">we explore in another guide that deep dives into logs, traces, and metrics</a>.</p><p>Here&#8217;s a simplified example of what trace span data can look like:</p><pre><code><code>Trace ID: 12345
  Span 1: Service A (Start Time: 15:30:00, Duration: 500ms)
    Span 2: Service B (Start Time: 15:30:05, Duration: 300ms)
      Span 3: Database Query (Start Time: 15:30:07, Duration: 100ms)
    Span 4: Service C (Start Time: 15:30:10, Duration: 200ms)
  Span 5: Service D (Start Time: 15:30:12, Duration: 150ms)
</code></code></pre><h3>Event data definition and example</h3><p>Event data in observability provides fine-grained details about notable events.</p><p>Each event includes a timestamp, event type, contextual information, and a payload.</p><p>Events are crucial support data for incident response and root cause analysis.</p><p>Here&#8217;s a simplified example of what event data can look like:</p><pre><code><code>{
  "timestamp": "2023-05-01T12:34:56Z",
  "eventType": "UserLogin",
  "context": {
    "userId": "12345",
    "location": "Homepage"
  },
  "payload": {
    "status": "Success",
    "ipAddress": "192.168.1.100"
  },
  "source": "AuthService"
}
</code></code></pre><h2>Observability data can come from multiple sources</h2><p><a href="https://www.catchpoint.com/asset/2023-sre-report">Catchpoint&#8217;s 2023 SRE Report</a> found that over 53.5% of its 550+ respondents in the reliability space worked in an environment with 3 or more sources feeding their observability system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NfBz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NfBz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 424w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 848w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 1272w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NfBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png" width="1240" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NfBz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 424w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 848w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 1272w, https://substackcdn.com/image/fetch/$s_!NfBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1856634d-f491-4d5b-8633-690b8e1ffac1_1240x356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>What&#8217;s the significance of this finding?</h3><p>It&#8217;s the recognition that having multiple data sources is crucial for obtaining a comprehensive and accurate understanding of the system's dynamics.</p><p>The risk of having a single source of truth (that IT vendors love selling) is that you risk missing the complete picture of what&#8217;s happening.</p><h3>What kinds of sources can we get data flowing in from?</h3><p>Here are 5 examples that are most relevant to software in production:</p><ol><li><p>applications</p></li><li><p>infrastructure</p></li><li><p>network</p></li><li><p>front-end UX monitoring</p></li><li><p>client-side device monitoring</p></li></ol><p>Imagine having only infrastructure observability data and no information about how applications behave. Such a scenario would mean doing guesswork to pinpoint issues.</p><p>This is why it&#8217;s best to cover infrastructure and applications <em>at a minimum</em>.</p><h2>Where is all this observability data going?</h2><h3>Observability data follows a journey to usefulness</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1mGA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1mGA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 424w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 848w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 1272w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1mGA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png" width="1456" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67147,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1mGA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 424w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 848w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 1272w, https://substackcdn.com/image/fetch/$s_!1mGA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03fb2378-23e8-4711-ae3e-2135bcaa53ad_1515x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#9888;&#65039;&nbsp;<em>The shape of the arrows means something. I&#8217;ll explain after we go through the flow process.</em></p><h3><strong>Let&#8217;s do a simplified rundown of what happens at each stage:</strong></h3><ol><li><p><strong>Instrumentation</strong></p></li></ol><ul><li><p>You insert monitoring code (listeners) into the services you want to collect metrics from</p></li><li><p>This code gathers data about the service's performance, errors, etc</p></li><li><p>Examples of tooling at this stage include Prometheus, OpenTelemetry</p></li></ul><p> &#8505;&#65039; <strong>A quick aside on OpenTelemetry.</strong> CNCF is <em>the</em> open-source cloud computing tooling network, and OpenTelemetry is the second most contributed project there. Its purpose is to make instrumenting your code for data collection painless. </p><ol start="2"><li><p><strong>Ingestion</strong></p></li></ol><ul><li><p>The listener code then sends collected metrics data to a data collection service</p></li><li><p>These are often called collectors. These collectors organize and process the incoming data</p></li><li><p>Examples of tooling at this stage include Prometheus, Kafka, FluentD</p><p></p></li></ul><ol start="3"><li><p><strong>Storage</strong></p></li></ol><ul><li><p>The metrics data then gets pushed to designated databases for storage</p></li><li><p>These databases are often designed to store time-series data, common in monitoring and observability systems</p></li><li><p>Examples of tooling at this stage include Prometheus, OpenTSDB, InfluxDB</p></li></ul><ol start="4"><li><p><strong>Usage</strong></p></li></ol><ul><li><p>This is the final stage and where you can use the data for several purposes</p></li><li><p>You can use it for alerting, creating dashboards, and supporting machine querying</p></li><li><p>This helps you stay on top of system behavior, performance, and issues</p></li></ul><p>As I mentioned under the last image, data flow issues can happen in observability.</p><p>I will cover them more in-depth in an upcoming guide called How to Solve Poor Data Flow in Observability.</p><h2>How is observability data stored?</h2><p>Most modern observability tools work with time series databases (TSDBs).</p><p>These databases help us understand what a metric is up to at specific moments in time.</p><p>You might catch Comp Sci PhDs calling these time series a <strong>"temporal aspect."</strong></p><h3>Components of a time-series database (TSDB)</h3><p>In a TSDB, time series data is organized into individual <em>series</em>.</p><p>Definition: we are ordering a sequence of data points in terms of time. Hence time series.</p><p>Each data point usually has a timestamp and corresponding value/s.</p><p>Here&#8217;s an example of what time series data looks like:</p><p>Metric Name Dimensions (labels) Timestamp Value http_requests_total {status=&#8221;200&#8221;, method=&#8221;GET&#8221;} 2023-11-24 09:00:00 1010 http_requests_total {status=&#8221;404&#8221;, method=&#8221;GET&#8221;} 2023-11-24 09:00:00 239 http_requests_total {status=&#8221;200&#8221;, method=&#8221;GET&#8221;} 2023-11-24 09:00:01 1028 http_requests_total {status=&#8221;500&#8221;, method=&#8221;GET&#8221;} 2023-11-24 09:00:02 10383</p><p>The above example captures what time series data could have looked like for some eCommerce sites at the beginning of the Black Friday sale on November 24th, 2023.</p><p>Let&#8217;s break it down:</p><ul><li><p>At exactly 9 am, they had 1010 successful &#8220;200 OK&#8221; requests and 239 unsuccessful &#8220;Page not found&#8221; 404 requests.</p></li><li><p>A second later, they had 1028 successful &#8220;200 OK requests&#8221;</p></li><li><p>But only a second after that, they had 10383 requests that got a &#8220;500 internal server error&#8221; response &#8212; possibly due to more requests than server capabilities</p></li></ul><h3>Series? Huh?</h3><p>Think of a <em>series</em> like this. It is the unique combination of a metric name (what you are trying to gather) and the key-value pairs that make it up.</p><p>The cherry on top is a timestamp that marks: &#8220;This is what happened at this point.&#8221;</p><p>That&#8217;s why it&#8217;s called a time-series database &#128521;</p><p>You can also see adding each new key-value pair as <em>adding a dimension.</em></p><p>You will hear these terms being used in observability conversations a lot:</p><blockquote><p>Don&#8217;t add a new <em>dimension</em> to your <em>series</em> unless you know its <em>cardinality</em> is worth it.</p></blockquote><p>You can explore the <a href="https://www.notion.so/Solving-The-Cardinality-Conundrum-in-Observability-b63637ed67ed4066864506850c56dbb1?pvs=21">important concept of cardinality here</a>.</p><h3>Let&#8217;s break down a series</h3><p>Remember it has the metric name and the key:values that compose it.</p><p>Here&#8217;s a common example of a metric name:</p><p><strong>http_request_duration_seconds</strong></p><p>Dimensions can include:</p><ul><li><p><code>method</code> (e.g., GET, POST)</p></li><li><p><code>status_code</code> (e.g., 200, 404, 500)</p></li><li><p><code>endpoint</code> (e.g., /api/v1/resource)</p></li><li><p><code>server</code> (e.g., server-1, server-2)</p></li></ul><p>Every series will give a response on a new line like:</p><ol><li><p><code>http_request_duration_seconds{method="GET", status_code="200", endpoint="/api/v1/resource", server="server-1"}</code></p><ul><li><p>Duration: 0.235 seconds</p></li></ul></li><li><p><code>http_request_duration_seconds{method="POST", status_code="404", endpoint="/api/v1/user", server="server-2"}</code></p><ul><li><p>Duration: 0.540 seconds</p></li></ul></li><li><p><code>http_request_duration_seconds{method="GET", status_code="500", endpoint="/api/v1/resource", server="server-3"}</code></p><ul><li><p>Duration: 1.120 seconds</p></li></ul></li></ol><h2>How is observability data being used?</h2><p>It&#8217;s going places, don&#8217;t you worry about that. I mean real practical uses.</p><p>We are ingesting, storing, and then querying it for 3 purposes:</p><ol><li><p><strong>Alerting</strong> &#8212; helps us respond to outages and other incidents</p></li><li><p><strong>Dashboards</strong> &#8212; gives SREs and other rockstars a big-picture view of system health</p></li><li><p><strong>Intelligence</strong> &#8212; machine learning pulls data to analyze for predicting trends and proactive work</p></li></ol><h2>How does observability add value to software operations?</h2><p>As I mentioned earlier, observability is often used as a conduit to doing other software operations activities better. It can help with:</p><ol><li><p><strong>Incident response &#8212;</strong> Observability provides real-time insights into system behavior, which helps minimize the impact of incidents through rapid identification and resolution of issues</p></li><li><p><strong>Improving system design &#8212;</strong> System architects and engineers can gain a deep understanding of system internals to enhance its design by analyzing observability data</p></li><li><p><strong>Capacity planning &#8212;</strong> Capacity planners like SREs can use observability data to see trends in resource usage to make informed decisions for scaling systems up or down</p></li><li><p><strong>Assuring smooth releases.</strong> Observability can help identify and cut the risk of potential issues that can cause post-release incidents, resulting in smoother software deployments.</p></li><li><p><strong>Performance tuning &#8212;</strong> Observability data can help pinpoint system bottlenecks and inefficiencies that performance engineers and SREs can use to optimize components</p></li></ol><h2>What are some risks in observability practice?</h2><p>I mentioned several data flow issues earlier, but many of these can be kept in check by staying on top of the constant upgrades in observability tooling.</p><p>Some other issues can be on top of mind in observability.</p><p>The most critical one I&#8217;ve encountered is the high cardinality issue.</p><p>I wrote a full guide on the cardinality conundrum to give you a view of what it means.</p><p>The high cost of observability systems (some reaching 7-figure costs) is also a bugbear for many software-dependent organizations.</p><p>Data quality issues can also come and can be directly related to the flow issues I mentioned earlier, but sometimes they&#8217;re a beast of their own.</p><p>And finally, the one I&#8217;ve been asked about more than once is alert fatigue. Poorly configured alerting can cause burnout in engineers from excessive cognitive load.</p><p>Whether it&#8217;s purely an observability-related issue is up for debate. I&#8217;ll leave it here.</p><h2>Wrapping up</h2><p>Observability is not as straightforward as it may be marketed to be.</p><p>There are a lot of nuances in making it work well and using the data to its full potential.</p><p>Keep the data flowing well, and it will serve your reliability work just as well.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Solving Observability's Cardinality Conundrum]]></title><description><![CDATA[Introduction]]></description><link>https://read.srepath.com/p/observability-cardinality-conundrum</link><guid isPermaLink="false">https://read.srepath.com/p/observability-cardinality-conundrum</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 09 Apr 2024 12:00:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fd726afa-fea9-4584-8816-54ee574df401_1024x540.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><p>Cardinality is a term you&#8217;ll hear over and over again if you&#8217;re looking into how to do observability.</p><p>And especially if you are talking with vendors! They love this topic!</p><p>A lot of people have been thinking about high cardinality for a while and for a good reason.</p><p>Because high cardinality can cost you a LOT of money and time if you go about it wrong&#8230; but really, what we want to do is <strong>cut down</strong> <em><strong>excessive</strong></em><strong> cardinality</strong>.</p><p>The kind that doesn&#8217;t add value to your querying and intelligence.</p><p>That&#8217;s why dealing with cardinality is not a straightforward solution. Cutting labels willy-nilly is not the answer. We&#8217;ll get onto ways to deal with this later on.</p><p>But we should first talk about what cardinality actually means.</p><h2>What is cardinality?</h2><p>Because it&#8217;s not something you think about every day&#8230; unless you&#8217;re an observability engineer or vendor.</p><p>The last time I heard &#8220;cardinality&#8221; being used this often was in my SQL classes, which was a while back.</p><p>So I did a refresher and went down a rabbit hole of math, logic, and all that fun stuff.</p><p>Cardinality refers to <em><strong>how many unique values there are in a data set.</strong></em></p><p>You&#8217;re essentially looking for how diverse &#8212; and ultimately complex &#8212; the data is.</p><h3>How do you differentiate between low and high cardinality data?</h3><ul><li><p>Low cardinality means low complexity with few dimensions to data. This is fine for analyzing aggregate data but lacks granularity that engineers often seek to solve system problems</p></li><li><p>High cardinality contains more data dimensions. This lets you slice and dice data for more detailed analysis, but you then have to deal with complexity issues.</p></li></ul><p>By dimensions, I mean attributes like method, status_code, instance_id, etc.</p><p><strong>Let&#8217;s run through a simple example to cement the difference:</strong></p><p>Let&#8217;s say you have the table <em>fruits</em> with only the fruit <em>apple</em> in your database. We want to add a <em>color</em> key to identify each apple in your table.</p><p>Now, the data on apples is low cardinality if you only find red or green apples like so:</p><p>But it becomes higher cardinality if you have apples with colors like orange, blue, red, pink, green, violet, black etc.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZNbn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZNbn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 424w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 848w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 1272w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZNbn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Low vs high cardinality data example visual&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Low vs high cardinality data example visual" title="Low vs high cardinality data example visual" srcset="https://substackcdn.com/image/fetch/$s_!ZNbn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 424w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 848w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 1272w, https://substackcdn.com/image/fetch/$s_!ZNbn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eb2ed1-2439-4109-a02e-c9360acd104e_1024x540.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The more color options you have for <em>apple</em>, the higher the cardinality of the <em>color</em> data.</p><h3>Examples of high cardinality</h3><p>Here are some examples of high cardinality data:</p><ul><li><p>email addresses (never append these to metrics!)</p></li><li><p>user IDs (<em>&#8221;Observability engineer: why is this in my beautiful TSDB?!&#8221;</em>)</p></li><li><p>IP addresses (sometimes appended for AppSec purposes)</p></li><li><p>instance name (fair use case for identifying instance issues)</p></li><li><p>pod s (like the kind you find within Kubernetes)</p></li></ul><h2>3 key benefits of high cardinality in observability data</h2><p>High cardinality enables these traits within observability data:</p><h3>1. Granularity</h3><p>With granular data, you can slice and dice to deeper and deeper levels to precisely pinpoint where issues like performance degradation and outages are happening.</p><p>For example, with each <code>500 error</code>, you may want to dig a little further.</p><p>With a high cardinality metric, you can dig into the 50x for <code>GET</code>, <code>POST</code>, and <code>DELETE</code> requests.</p><h3>2. Segmentation</h3><p>Helps you collect disparate data and organize it into chunks for easier human processing.</p><p>Examples include:</p><ul><li><p>payments by age group e.g. 18-25, 25-34, 35-49, etc.</p></li><li><p>payments by market region e.g. North America, South East Asia, LatAm, EU, etc.</p></li></ul><h3>3. Performance evaluation</h3><p>High cardinality means you can go beyond &#8220;success&#8221; and &#8220;fail&#8221; for responses.</p><p>You can classify it as performance levels to improve reliability for different instances and groups.</p><p>Imagine what you can work out by putting responses into buckets of 0-10 scoring.</p><p><strong>Before we continue, I will assume that you know the answers to the following questions:</strong></p><p>What does observability data look like?</p><p>How does observability data flow?</p><p>Where observability data gets stored and how</p><h2>How do observability data types do in terms of cardinality?</h2><p>Metrics are the main data type when we think about cardinality. But let&#8217;s still cover each of the 3 main observability data types to see its cardinality issue:</p><h3>High cardinality in l<strong>ogs</strong></h3><p>Logs of yesteryear are typically less affected by high cardinality than metrics and traces.</p><p>Some engineers still use unstructured logs for small-scale systems.</p><p>At the commercial scale, modern logs are structured or at least shifting toward that. JSON shows the most promise through that.</p><p>So why shift to a structured format? A few things come into play like:</p><ul><li><p>Readability</p></li><li><p>Better ability to group or segment logs by keys for better more refined analytical outcomes</p></li><li><p>Speed to read as less string manipulation or string processing has to take place during the reading of the aggregated log data</p></li></ul><p>You can derive metrics from logs now!</p><p>Logs in a commercial setting can experience high cardinality, just like metrics and traces.</p><h3>High cardinality in m<strong>etrics</strong></h3><p>Among the trio, metrics are the most significantly affected by high cardinality.</p><p>Metrics get complicated by the fact that they can have many dimensions.</p><p>You can have a multiplier effect when you add a new dimension.</p><p>This multiplication is what defines the high cardinality we are talking about.</p><p>You benefit in one way from this through more granularity that helps deeper analysis.</p><p>But then you also impact storage, querying, and visualization performance.</p><h3>High cardinality in t<strong>races</strong></h3><p>Traces can also be affected by high cardinality.</p><p>Picture a trace to be the bus route from New York to Los Angeles and back.</p><p>This route can be divided into sections &#8212; when you plan a trip along this route you may want to stop for a breather, food, or a bathroom break.</p><p>Each section of this route or round trip is equivalent to what in tracing we would call a span.</p><p>A common dimension for spans is the time or duration of that specific section of work.</p><p>We can add more detail or context for each span by appending metadata to the span.</p><p>You get high cardinality within tracing when you use spans like user IDs and specific names of what happened.</p><p>All this once again impacts storage and querying performance.</p><h2>Calculating the impact of cardinality within a metric</h2><p>Adding a dimension to a metric does not cause high cardinality.</p><p>It&#8217;s what that key-value pair stands for that determines this.</p><p>So adding a 6th dimension doesn&#8217;t increase cardinality if it&#8217;s a boolean or a &#8220;success&#8221;/&#8221;fail&#8221; type.</p><p>But if it were something like instance ID and you have 100 or 1000 instances, that would.</p><p>We do not need to calculate cardinality unless we are sitting in a math class.</p><p>We are more interested in how many time series</p><h3>Calculating time series count for a metric</h3><p>Let&#8217;s run through a simple example.</p><p>Say you have a metric called <em>network_latency_distribution</em> covering 100 instances with 10 buckets, 10 possible response codes, and 10 network paths.</p><pre><code>**Calculating the series would look like this:**
= 100 instances * 10 buckets * 10 response codes * 10 paths
= 100,000 series
</code></pre><p>This is a reasonable size for a series, but things can get out of hand as you add more dimensions.</p><p><strong>Adding a dimension like </strong><em><strong>region</strong></em><strong> can significantly increase cardinality.</strong></p><pre><code>**Say we have 6 regions to choose from. The calculation would become:**
= 100 instances x 10 buckets x 10 response codes x 10 paths x 6 regions
= 600,000 series (! &#128533;)
</code></pre><p>Okay. It doesn&#8217;t look excessive, but can significantly increase querying time!</p><p>I&#8217;ll share some query time data with you in a minute.</p><p><strong>The dimension of </strong><em><strong>pods</strong></em><strong> (within containers) can increase cardinality by a huge factor, too.</strong></p><pre><code>**Say we have a low 1,000 pods in action. The calculation would become:**
= 100 instances x 10 buckets x 10 response codes x 10 paths x 1,000 pods
= 100,000,000 series (!! &#128550;)
</code></pre><p><strong>Excessive cardinality comes into the picture when you add a dimension like user_id.</strong></p><pre><code>**Even with a modest user count of 10,000, the time series blows out with:**
= 100 instances * 10 buckets * 10 response codes * 10 paths * 10000 users
= 1,000,000,000 series (!!! &#128561;)
</code></pre><p>We&#8217;ll talk about Excessive cardinality in more detail later on.</p><p>&#128680; <strong>PSA:</strong> The more types of data you collect, the more fragmented view you have of the whole piece. And if you have a fragmented view, that reduces your ability to TAKE ACTION on the data. Remember that <strong>Quality of data &gt; Quantity of data</strong>.</p><h2>High cardinality data is rising because of system trends</h2><h3>Shift from monolith to microservices architecture</h3><p>Where you had one humongous service emitting metrics, you now have 10, 20, 50, 100+ microservices with each emitting its metrics. &#8216;Nuf said.</p><h3>Shift from VM-based to container-based infrastructure</h3><p>Who would&#8217;ve thought that life was easier when we depended on VM infrastructure rather than containers? &lt;/sarcasm&gt;</p><p>Containers have their benefits but generate a whole bunch of time-series data.</p><p>For every 1 VM in an older system, there are 10s of containers to match workloads.</p><p>The key culprit behind this is their <em><strong>high ephemerality</strong></em>.</p><p>The ephemeral nature of containers refers to when they start, shut down and then a new container takes their place.</p><p>All this stop-start emits a lot of data for metrics collectors to ingest and push to storage.</p><h3>Serverless functions as a large part of the system</h3><p>Every time you invoke a Lambda function, that induces a metric with a time series. Depending on the particular service you are running on serverless, this time series number can be HUGE.</p><p>A whole bunch of factors come into play like:</p><ol><li><p>how often you're invoking the lambda function</p></li><li><p>the number of serverless functions in your system</p></li><li><p>concurrency of handling requests and scale</p></li></ol><p>Your time series data incurs higher and higher cardinality as these 3 factors rise in frequency or occurrence (depending on the factor).</p><h2>When to deal with high cardinality</h2><p>Remember how observability data flows. There are 4 distinct stages:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PUeG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PUeG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 424w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 848w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 1272w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PUeG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!PUeG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 424w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 848w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 1272w, https://substackcdn.com/image/fetch/$s_!PUeG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53f01e36-55fa-4fae-ad38-5e8caac71fa0_1024x288.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Dealing with high cardinality at the <em>Instrumentation</em> stage is not ideal. You will not confidently know what dimensions to filter out at this stage.</p><p>It&#8217;s also not ideal at the <em>Usage</em> stage. It&#8217;s too late to deal with cardinality issues because you&#8217;ve incurred high ingestion and storage costs, so&#8230;</p><p>Deal with high cardinality at the <em>Ingestion &amp; Storage</em> stages. This is when you can put in practices like in-flight aggregation, cardinality isolation, and cardinality limiters.</p><h2>When to ditch high cardinality data</h2><p>High cardinality data is a necessary evil in some situations. As I mentioned earlier, it helps with granularity, segmentation, and deeper performance evaluation.</p><p>If it does not add value to your querying, intelligence, or alerting. Ditch it.</p><p>It&#8217;s just an ornament to give you pretty data.</p><p>What I mean by this is it <em>looks</em> like it <em>could</em> be important.</p><p>But in reality:</p><ul><li><p>it&#8217;s giving you limited analytical value</p></li><li><p>rarely contributes to usable insight <strong>and</strong></p></li><li><p>with either of the above two is resource intensive.</p></li></ul><p>Because if resource intensiveness did not cost money or time, we wouldn&#8217;t care about it.</p><p>We&#8217;d just let high cardinality data sit there and do its thing.</p><p>But it does cost money in terms of storage and processing power.</p><p>And it costs time in terms of how long you have to wait before you can start working with the data.</p><p>And it&#8217;s definitively something you need to look into if it&#8217;s slowing things down to the point where you&#8217;re pushing past your MTTR target.</p><p>(MTTR = your mean time to recovery, repair, resolution, whatever you want to call it).</p><p>Now, you might be saying thank you, Captain Obvious.</p><p>But think about how often this is still a real problem in software systems.</p><p>You need to pose it as a challenge for people to think about and solve it.</p><h2>The problem with <em>excessive</em> cardinality</h2><p>The working group behind the open-source Prometheus monitoring tool has warned about this for a while.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PQcc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PQcc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 424w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 848w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 1272w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PQcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!PQcc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 424w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 848w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 1272w, https://substackcdn.com/image/fetch/$s_!PQcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b851b84-dee2-4011-8c7c-9218741d74f7_1024x173.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Excerpt from <a href="http://Prometheus.io">Prometheus.io</a> documentation on <a href="https://prometheus.io/docs/practices/naming/">Metric and label naming</a></p><p>A quick TLDR of their alert message on key-value pairs:</p><blockquote><p>&#8220;Every unique combination of key-value pairs represents a new time-series. This significantly increases the data stored.&#8221;</p></blockquote><p>An ever-increasing number of possible key-value pairs or dimensions does something sinister within time series databases (TSDBs).</p><p>The number of series for a single metric will explode &#128165;&nbsp;to the point that your querying eventually slows down to a crash.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0EGq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0EGq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 424w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 848w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 1272w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0EGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0EGq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 424w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 848w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 1272w, https://substackcdn.com/image/fetch/$s_!0EGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0698506b-f6d3-4d8c-9be5-fcdee4925179_636x615.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Query performance data by Chris Marchbanks (ex-Splunk, Grafana Labs) says it all.</p><p>This is how long it took to query quantities of time series:</p><ul><li><p>100,000 series took 1.5 seconds (acceptable)</p></li><li><p>200,000 series took 5 seconds (slow-ish)</p></li><li><p>10,000,000 series took 15 minutes (!)</p></li></ul><p>Keep this in mind: high or (better put) <strong>excessive cardinality is a data problem at its core</strong>.</p><p>Not only does it slow down query time to a grind, but it can also cause trouble like:</p><ul><li><p>data being wasted on dimensions that are not needed for system improvements</p></li><li><p>need for engineers to constantly maintain performance of the observability system</p></li><li><p>higher storage needs and processing resources means higher cost to run</p></li></ul><p>We will think through a few solution starters toward the end to prevent or at least reduce this risk.</p><h2>What&#8217;s contributing to excessive cardinality in a software system?</h2><h3>Bad dimension selection</h3><p>Remember my calculation example above?</p><p>Putting in user or request IDs can skyrocket your observability metrics&#8217; cardinality.</p><p>Each unique identifier will contribute to a new entry, increasing the overall cardinality of the logs.</p><h3>Improper sample practices</h3><p>Observability systems generate a ton of data all the time.</p><p>It&#8217;s not an easy task to query all of that data all the time.</p><p>This is where you want to bring sampling practices to select a portion of the data to analyze.</p><p>But if you don&#8217;t use the right sampling techniques, you will deal with high cardinality data and more series than your dashboard or query tool can handle.</p><h3>Unbounded event types</h3><p>Metrics are not the sole culprits for pushing out excessive cardinality data.</p><p>Putting weak boundaries around your event data can do the same to logs and traces.</p><p>A good system will have few event types while a system suffering from excessive cardinality will have numerous possible event types.</p><p>The more event types there are, the more logs and spans you&#8217;ll have to push to ingestion.</p><p>This is not an exhaustive list of ways excessive cardinality can happen. I want to illustrate the idea that there are several ways you can end up with it.</p><h2>How to solve <em>excessive</em> cardinality</h2><h3>Be selective about dimensions</h3><p>I feel like I&#8217;ve mentioned it several times already, but never, ever use dimensions like email address, user ID, transaction ID, or anything with overly unique data in your metrics.</p><h3>Split the metric into smaller metrics</h3><p>Ask yourself these two questions:</p><blockquote><p>Do I need to have this single metric with all these dimensions?</p></blockquote><blockquote><p>Can I split it into two separate metrics that can still help me answer the questions I will pose at querying, and still give me the necessary alerting?</p></blockquote><p>An example might better highlight why you&#8217;d want to do this:</p><p>Let&#8217;s return to our metric called <em>network_latency_distribution, which</em> covers 100 instances with 10 buckets, 10 possible response codes, and 10 network paths.</p><pre><code>**Calculating the series would look like this:**
= 100 instances * 10 buckets * 10 response codes * 10 paths
= 100,000 series
</code></pre><p>Now what would happen if we were to split this metric into 2 individual metrics, one without paths and one without buckets?</p><pre><code>**Here's the first metric without paths:**
= 100 instances * 10 buckets * 10 response codes 
= 10,000 series
</code></pre><pre><code>**Here's the second metric without buckets:**
= 100 instances * 10 response codes * 10 paths
= 10,000 series
</code></pre><p>This gives us a grand total of 20,000 series and <strong>a whopping 80,000 series reduction</strong>!</p><p>This split works perfectly if we don&#8217;t need to correlate paths with buckets to solve system issues.</p><p>How much faster would it be to query 20,000 vs 100,000 series?</p><p>Answer: enough to feel instantaneous vs. time spent hearing, &#8220;Please wait while we process your query.&#8221;</p><h3>Allow high(er) cardinality for high-value metrics only</h3><p>It can still make sense to generate a whole bunch of series for metrics that add business value.</p><p>You may have some metrics that need 100,000 series rather than being split into multiple mini-metrics and losing their strength.</p><p>How do we define a high-value metric? It is this if it supports:</p><ul><li><p>critical decision-making processes</p></li><li><p>developing actionable insights or</p></li><li><p>enhancement of overall system performance</p></li></ul><p>What hits one of these criteria depends on your software architecture, industry context, and the problems you&#8217;re looking to solve.</p><p>Here are a few considerations to make:</p><ol><li><p>These kinds of metrics might not be suitable for time-of-essence needs like alerting</p></li><li><p>You can try and save the designation of high-value metrics to ad-hoc queries, low-usage dashboards for specialist or special interest groups, and periodic reports</p></li><li><p>You still need to weigh the impact vs querying &amp; visualization time</p></li><li><p>Even a high-value metric can start looking too expensive with the way observability is priced these days, so work out the costings vs the value you attain and discuss with management</p></li></ol><p>We must still remember to control cardinality levels to meet our cost and time-to-productivity constraints.</p><h2>Wrapping up</h2><p>If you can only remember one thing from this guide, I want it to be this &#128071;&#127996;</p><p><strong>Cardinality is valuable, but excessive cardinality is expensive &#8212; in terms of time to query, cost to store, and resource consumption to process &amp; analyze. Keep it in check.</strong></p>]]></content:encoded></item><item><title><![CDATA[#35 Boosting your Observability Data's Usability ]]></title><description><![CDATA[Starting this newsletter off with some changes to the newsletter then onto the writeup about this podcast episode.]]></description><link>https://read.srepath.com/p/boost-observability-usability</link><guid isPermaLink="false">https://read.srepath.com/p/boost-observability-usability</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 02 Apr 2024 12:15:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/be121f9c-db6c-4bc8-8169-d4d6a58ffdc5_691x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hey SRE friend!</p><p>I&#8217;ve decided to focus this newsletter on publishing written-only content in the future. The podcast writeups I used to do will not continue after Episode #35.</p><p>That episode&#8217;s titled <em>Boosting Your Observability Data&#8217;s Usability</em>. Check it out if you are interested in observability and maximizing its value.</p><p>There&#8217;s a write-up on the episode in the second half of this newsletter.</p><p>To stay on top of the podcast, follow us on your favorite player (or <a href="https://www.srepath.com/podcast/">bookmark this link to the web player</a> on the SREpath website).</p><p>Here are a few of the top podcast players that you&#8217;ll find the SREpath podcast on:</p><ul><li><p><a href="https://open.spotify.com/show/6NQJio0Lyu0aa9vt98O0ab?si=6f46e24f52f44125">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/au/podcast/s-r-e-path-podcast/id1683437925">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com.au/podcasts/d7eb5ead-33a4-4c64-9c43-f540bec6c2ce/s-r-e-path-podcast">Amazon Music</a></p></li><li><p><a href="https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy9kZjgxNzVlYy9wb2RjYXN0L3Jzcw">Google Podcasts</a></p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://read.srepath.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Boost software reliability with SREpath! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Episode 35 [SREpath Podcast]</strong></p><h2>About this episode</h2><p>The observability (o11y) revolution is underway with many organizations already instrumenting services. But are we getting the most from the data that is being collected?</p><p>Richard Benwell thinks we have room for improvement in this area, especially at the usage stage where we query and visualize the o11y data.</p><p>He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space.</p><p>Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights.</p><p>You can <a href="https://www.linkedin.com/in/richard-benwell-ab887b11/">connect with Richard via LinkedIn</a></p><h2>Key concepts we explored</h2><p>Our conversation touched on fascinating observability concepts like:</p><ol><li><p><strong>Overemphasis on Collecting Observability Data</strong>. The key theme of our conversation was that observability shouldn't solely focus on collecting vast amounts of data. The data should be <em>used judiciously</em> for insights and actionable intelligence rather than just being stored.</p></li><li><p><strong>A Human Approach to Observability.</strong> Richard shared the need for human intuition and engagement in observability. Engineers should prioritize understanding how to use data effectively for engaging human stakeholders rather than solely focusing on the technical aspects.</p></li><li><p><strong>Socio-Technical Perspective of Observability.</strong> Observability is not just a technical challenge but also a socio-technical one. It involves understanding human behavior, collaboration, and organizational dynamics alongside technical infrastructure.</p></li><li><p><strong>Visualization Increases Engagement with O11y Data.</strong> It can make observability data more meaningful and engaging for stakeholders. The core premise of this is to simplify complex data into <em>simple visuals</em> that provide clear and rapid insights as well as give context.</p></li><li><p><strong>Observability in Big Tech vs. Other Industries</strong>. We need to recognize that most of the ideas being shared are by BigTech for BigTech. While big tech focuses heavily on real-time data and rapid innovation, other industries need to prioritize stability, reliability, and business continuity.</p></li><li><p><strong>Dashboards Should Be Interactive.</strong> While dashboards are valuable for sharing information, they are not very robust as static views. Aim for interactive dashboards that allow for exploration and drilling down into observability data.</p></li><li><p><strong>Keep your Stakeholders Engaged</strong>. Lack of engagement in observability leads to a vicious cycle of declining data quality and relevance. Engaging users and stakeholders ensures continuous feedback and improvement in observability outcomes.</p></li><li><p><strong>Make Observability Data Relevant to Business Context. </strong>Tailor your querying and visualizations to the specific needs and context of the business. You could use approaches like representing data in the form of a customer journey, to make data more relatable and understandable for stakeholders.</p></li><li><p><strong>Don't Sleep on Continuous Improvement in o11y</strong>. Just like anything in technology, observability is an ongoing process. It requires continuous improvement and adaptation to changing business needs, technological advancements, and user feedback.</p></li></ol><div><hr></div><p>You will not want to miss Richard's insights that can improve the usability and ROI of your observability data.</p><p>In&nbsp;<a href="#">Episode #35 of the SREpath podcast, Richard Benwell gives ideas on making the most of your observability data</a>&nbsp;[Spotify link]</p>]]></content:encoded></item><item><title><![CDATA[#30 Clearing Delusions in Observability (with David Caudill)]]></title><description><![CDATA[Episode 30 [SREpath Podcast]]]></description><link>https://read.srepath.com/p/clearing-observability-delusions</link><guid isPermaLink="false">https://read.srepath.com/p/clearing-observability-delusions</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 05 Mar 2024 11:59:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7920!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Episode 30 [SREpath Podcast]</strong></p><p>Keep scrolling for a full write-up on this topic below &#128071;&#127996;</p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;subtitle&quot;:&quot;&quot;,&quot;description&quot;:&quot;&quot;,&quot;url&quot;:&quot;https://open.spotify.com/embed/episode/49z2l9MDQumamJpYr6zyb0?utm_source=generator&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:true}" src="https://open.spotify.com/embed/episode/49z2l9MDQumamJpYr6zyb0" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM" scrolling="no"></iframe><h2>Show notes</h2><p>How critical is observability (o11y) to SRE work?</p><p>To me,&nbsp;observability is&nbsp;<em>the</em>&nbsp;core foundation practice for&nbsp;<em>all&nbsp;</em>other SRE practice areas.&nbsp;Without it, you're flying blind.&nbsp;</p><p>David Caudill&nbsp;is not afraid of making controversial viewpoints about this area. But he doesn't do it for glory. To him, it's about&nbsp;driving better practices.&nbsp;</p><p>After all, it's not in his interest to promote (sell) shiny objects.&nbsp;He runs engineering teams at Capital&nbsp;One, one of America's largest banks, so he wants a functional offering more than overly done hype.&nbsp;</p><p>He believes that delusions are getting in the way of our success in observability. We explore some of them in this episode of the SREpath podcast.</p><p>You can <a href="https://www.linkedin.com/in/davidcaudill/">connect with David via LinkedIn</a></p><h2>More about our conversation</h2><p><strong>David's stance is simple:</strong>&nbsp;observability itself is&nbsp;not bad; it's just that we are often not doing it with the right mindset! We are seeking elegant technical solutions to complicated real world problems.&nbsp;Notice the disconnect?</p><p>I had a chat with him a few weeks back about all of this.</p><p>Our conversation touched on ideas you likely&nbsp;won't have heard anywhere else including:</p><p>&#10145;&#65039; David's analogy for&nbsp;<strong>observability of&nbsp;software architectures</strong>&nbsp;"from monoliths to rocks,&nbsp;pebbles and gaseous clouds"</p><p>&#10145;&#65039; The&nbsp;<strong>need to handle cognitive load&nbsp;</strong>effectively when it comes to your observability system by simplifying measures (inspired by Google)</p><p>&#10145;&#65039;&nbsp;<strong>Moving toward event-based SLOs&nbsp;</strong>rather than leaning too heavy&nbsp;on time-based metrics for your observability&nbsp;</p><p>Let's unpack each of these:</p><h3><strong>Observability of software architectures "from monoliths to rocks, pebbles, and gaseous clouds"</strong></h3><p>Did I mention that David likes to use fun and peculiar analogies to highlight his ideas? I love it, and you might too.&nbsp;</p><p>But what does he mean by this particular analogy (above)?&nbsp;&#129300;</p><p>David put it like this (quote truncated for brevity):</p><blockquote><p><em>"They start with a monolith, and they decide, 'we could break this up into a few service oriented architecture' blobs [rocks]...&nbsp;then go to microservices [pebbles], and then into lambdas [gaseous clouds]. Before long, you're down to really, really intense atomicity in this architecture."</em></p></blockquote><p>This is a common enough&nbsp;pattern in more than a few organizations I know inside details of. It makes sense as microservices is the shift that cloud pundits have been pushing for a decade now.&nbsp;</p><p>Serverless these days is not the quiet one in the corner.</p><p>So all seems well and good.&nbsp;How much can atomicity hurt?</p><p>Well, it can be painful if your observability system can't handle it.</p><p>David added the fact that&nbsp;this kind of atomicity&nbsp;adds cardinality to observability.&nbsp;It adds overhead that itself needs to be observed. This overhead can start to feel as&nbsp;large as the application itself.</p><p>I've heard people echo David's sentiment that this overhead can become&nbsp;a completely invisible cloud of noise around your application that has nothing to do with your application.</p><p><br>A cautionary tale indeed for carefully reworking your software architecture.&nbsp;Be sure to keep your o11y capabilities in mind when doing so!</p><h3>Handling cognitive load when it comes to observability systems&nbsp;</h3><p>Our discussion of architectural woes in o11y&nbsp;drove me to bring up&nbsp;an area that is so critical to successful engineering work but is often neglected:&nbsp;<em>cognitive load</em>.&nbsp;</p><p>Observability systems are overloading engineers with too much data and things to see.&nbsp;</p><p>David's suggestion to my assertion around cognitive load made me chuckle and grimace at the same time:</p><blockquote><p>"<em>I'm very much of the opinion that you want to preserve simplicity as long as you can. There's a lot of anticipatory architecture changes that happen really on the optimism like, 'Oh, it's going to blow up!' I have worked in environments where it didn't blow up and it was still really complex because we were prepared for a legion of people that never showed up."</em></p></blockquote><p>He added that engineers need to "avoid complexity like the plague" and keep things as simple as they possibly can.&nbsp;</p><h3><strong>Moving toward event-based SLOs rather than solely time-based metrics</strong></h3><p>Drawing from experience, David recalls his attempts at making time-based SLOs work. Over time and several failures, he learned that event-based SLOs are more appropriate in many situations.&nbsp;</p><p>He stresses the rationale for not using time-based SLOs through this&nbsp;quote:</p><blockquote><p><em>"Because not every minute of time is the same as every other minute. And if your service goes down when no one is using it, who cares? That's not a problem. And, you know, you're not, in most cases, contractually obligated to provide this many minutes a month."</em></p></blockquote><p>Doing so can become "a really confusing side quest" that's difficult to connect to reality.&nbsp;David emphasizes that SLOs are a sociotechnical construct and because of this, they need people to buy into what you're trying to achieve.&nbsp;</p><p>His experience has shown that it's a lot easier to get people behind simpler SLOs like error rate rather than time-based SLOs.For example, you could set SLOs for the&nbsp;number of 50<em>x&nbsp;</em>that are occurring in your services.</p><p>Your service might normally return a 0.1% error rate, but if that number doubles, it can start a conversation. The key behind doing any of this is knowing that you can't do bottom-up with SLOs. You need senior leadership support to make it happen.<br><br>If you can't get that buy in and that can be really tough.&nbsp;This could explain the sheer volume of failed SLOs in the industry today.&nbsp;&nbsp;</p><p>You need to invest your time in SLOs, but keep it simple. David recommends to not&nbsp;jump into the whole hype surrounding SLOs from the market.</p><div><hr></div><h2>Here are 10 more takeaways from the show:</h2><ol><li><p><strong>Understand your&nbsp;observability billing model:</strong>&nbsp;Gain a clear understanding of your service provider's billing model to avoid unexpected expenses. This helps in planning and optimizing costs associated with high cardinality data and log retention.</p></li><li><p><strong>Employ cost optimization tactics:</strong>&nbsp;tactics&nbsp;like sampling and selective logging can help manage costs without sacrificing the visibility needed for effective monitoring. This is particularly valuable in high-traffic scenarios where full resolution logs are not financially justifiable.</p></li><li><p><strong>Enhance your log retention policies:</strong>&nbsp;Use automated tools to manage log retention, ensuring that you're only keeping what's necessary and utilizing cost-effective storage solutions like Amazon S3 for long-term storage.</p></li><li><p><strong>Make querying and data access cost-effective:</strong>&nbsp;Look into using&nbsp;tools like Amazon Athena for querying large datasets at a low cost, which enables you to access your logs as needed without incurring high storage and processing fees.</p></li><li><p><strong>Take the time to prioritize your&nbsp;data:</strong>&nbsp;Focus on logging and monitoring the most critical aspects of your system to avoid information overload and reduce costs. This involves understanding what data is truly valuable for your specific context.</p></li><li><p><strong>Learn to differentiate status vs. diagnostic information:</strong>&nbsp;Status indicators and diagnostic data often give mixed up as solving the same problem.&nbsp;They don't.&nbsp;Simplify your monitoring by providing clear status indicators (red, yellow, green) for quick assessments and&nbsp;reserve detailed diagnostic data for deeper analysis.</p></li><li><p><strong>Effectively leverage the power of metrics:</strong>&nbsp;Develop work metrics that reflect the&nbsp;<em>actual</em>&nbsp;performance and health of your platform. This approach helps in quickly identifying issues without sifting through irrelevant data.</p></li><li><p><strong>Simplify your incident management:</strong>&nbsp;Aim for a "single pane of glass" dashboard that offers a straightforward view of system health to reduce time spent in identifying and debating the presence and scope of issues.</p></li><li><p><strong>Educate your stakeholders&nbsp;and manage their expectations:</strong>&nbsp;Help your various stakeholders understand the cost-benefit analysis behind observability practices including&nbsp;data logging and retention strategies that foster support for cost-effective practices.</p></li><li><p><strong>Go beyond the metrics mindset:</strong>&nbsp;Recognize the value of logs and stack traces as critical diagnostic tools that can support your monitoring and alerting. Ensure that your observability strategy integrates these elements effectively for pinpointing issues.</p></li></ol><div><hr></div><p>You will not want to miss his insights (and his charismatic humor).</p><p>In&nbsp;Episode #30 of the SREpath podcast, David Caudill gives me his take on clearing delusions in observability.</p><p>I&#8217;m sure your team will appreciate you clearing any accidental misgivings they may have developed about observability&nbsp;&#128513;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7920!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7920!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 424w, https://substackcdn.com/image/fetch/$s_!7920!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 848w, https://substackcdn.com/image/fetch/$s_!7920!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!7920!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7920!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png" width="1456" height="1367" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1367,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3131885,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7920!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 424w, https://substackcdn.com/image/fetch/$s_!7920!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 848w, https://substackcdn.com/image/fetch/$s_!7920!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 1272w, https://substackcdn.com/image/fetch/$s_!7920!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ef79c9-3d67-4ecf-8c96-f3fa84d7d931_1562x1466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[#13 Making Sense of OpenTelemetry and Observability (with Adriana Villela)]]></title><description><![CDATA[Episode 13 [SREpath Podcast]]]></description><link>https://read.srepath.com/p/making-sense-opentelemetry-observability-adriana-villela</link><guid isPermaLink="false">https://read.srepath.com/p/making-sense-opentelemetry-observability-adriana-villela</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Tue, 31 Oct 2023 12:41:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hjhf!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ee1dc2-77bf-4ffa-b056-f66dac8ad0d0_128x128.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Episode 13 [SREpath Podcast]</strong></p><iframe class="spotify-wrap" data-attrs="{&quot;image&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;subtitle&quot;:&quot;&quot;,&quot;description&quot;:&quot;&quot;,&quot;url&quot;:&quot;https://podcasters.spotify.com/pod/show/srepath/embed/episodes/Making-Sense-of-OpenTelemetry-and-Observability-Landscape-with-Adriana-Villela-e2aknqg/a-aag07rp&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:true}" src="https://podcasters.spotify.com//pod/show/srepath/embed/episodes/Making-Sense-of-OpenTelemetry-and-Observability-Landscape-with-Adriana-Villela-e2aknqg" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM" scrolling="no"></iframe><p>Ash Patel interviews Adriana Villela who is a CNCF ambassador, OpenTelemetry contributor, and senior developer advocate at Lightstep.</p><p>Adriana talks about her experiences discovering observability, life as a team leader, and the promise of OpenTelemetry.</p><p>She sheds light on the importance of observability practices and the role of OpenTelemetry in standardizing instrumentation collection, enabling organizations to achieve comprehensive observability in complex systems.</p><h2>Episode Transcript</h2><p><strong>Ash Patel:</strong>&nbsp;Adriana, great to have you. I'm looking forward to learning what you have to teach us about all things SRE, or at least a few interesting things you've been working on recently.</p><p>And I know you've been talking about OpenTelemetry quite a bit. So, looking forward to talking about that. But let's do a quick introduction of who you are and what you do.</p><p>What do you do SRE-wise in terms of you and your company?</p><p><strong>Adriana Villela:</strong>&nbsp;I guess I'm in the SRE adjacent space if you will. I guess it is part of SRE.</p><p>So I'm a Senior developer advocate at LightStep, which now is part of ServiceNow. So it's actually called ServiceNow Cloud Observability. And Lights or ServiceNow Cloud Observability, I'm, I'm still getting used to the new name is an observability tool. And observability I would say is one of the important aspects of SRE.</p><p>What I do as part of my job is to educate folks on observability practices with a focus on OpenTelemetry, which is a great open-source CNCF project that aims to standardize how we collect instrumentation from our systems so that we can achieve those beautiful observability goals.</p><p><strong>Ash Patel:</strong>&nbsp;OpenTelemetry has got to be one of the hottest projects out there on CNCF right now.</p><p><strong>Adriana Villela:</strong>&nbsp;Mm hmm. I think it's got the second-highest contributions behind Kubernetes, which is pretty wild when you think about it.</p><p><strong>Ash Patel:</strong>&nbsp;I was gonna say. I know Kubernetes is number one for obvious reasons.</p><p>Actually, not obvious reasons, but you know what I mean. Yeah. Yeah, open telemetry is such an important area.</p><p>I say to a lot of people that observability is the foundation practice that any cloud operation needs to have to be effective.</p><p><strong>Adriana Villela:</strong>&nbsp;Yes, yes, absolutely.</p><p>It is such a foundational practice. I think one of my pet peeves, and I had to edit myself when I was introducing myself and talk about SRE because I think a lot of the industry tends to treat observability as a separate thing when in fact it's part of the overall practice.</p><p>When you consider the fact that we are in this microservices world, right? I think a lot of organizations have moved away from monoliths and have embraced the microservices model, which, you know, has its ups and downs.</p><p>But one of its downs is you have so many different services. And depending on the size of your organization, you can have like ton of microservices interacting with each other in ways that you would never even predict. Right? And so observability helps to untangle those interactions.</p><p>All of a sudden, You're like, oh, I see how data flows from point A to point B of my application.</p><p><strong>Ash Patel:</strong>&nbsp;You mentioned monoliths. What are your thoughts on a lot of companies talking about how they're switching back to monoliths, but I don't know why they would do that. Is it to save money or is it to cut down on their spending on doing things like observability? What are your thoughts on monoliths?</p><p><strong>Adriana Villela:</strong>&nbsp;I'm kind of happy to hear that because I do feel like a lot of organizations went on this microservices craze because it was a fad.</p><p>And so I feel like some C suite executive attended some seminar or whatever where they were talking about microservices and then all of a sudden it's like, Oh, we must do this at, our organization, it's going to solve all of our problems and feel like a lot of companies rushed onto the microservice bandwagon without really thinking about does this actually do what I need it to do?</p><p>So I'm kind of happy that some companies are taking a step back to the monolith.</p><p>I think what people conflate with monoliths and microservices is that there are still efficiencies to be made when it comes to monoliths where I think if you structure your project properly, you don't have to have some of those huge long build times that you would normally see with your monoliths in the past. And I think that's one of the things that I think folks try to alleviate with the microservices model. Cause it's like, you just change one little thing. You don't have to build the entire frigging project.</p><p>But I think if you structure your monolithic app properly, it doesn't have to be painful either.</p><p><strong>Ash Patel:</strong>&nbsp;Exactly. You don't actually have to do a monolith as a waterfall project either. Mm. Mm. Which I think people construe a few things together. They say, hey, microservices is agile, monoliths are waterfall.</p><p>And I'm saying, no. But anyway, we could talk about architectural decisions all night long.</p><p>How did you get to working with people in observability in the first place? How did you get into the space in terms of your work history, your personal history with it?</p><p><strong>Adriana Villela:</strong>&nbsp;So I'll say I got into the observability space.</p><p>I caught wind of Charity Major's tweets when I was doing some research for something else. I had like a little side hustle with, with a friend of mine and I don't know, we were just doing some late-night research and then I happened upon one of her tweets, which I don't think even had anything to do with observability.</p><p>And then my friend, he started obsessing and reading all her tweets. And then he's like, Hey, you should check out this observability thing. It's really cool. And I'm like, okay, but I don't get what it is. And he tried to explain it to me a bunch and I'm like, I sorta get it, but I don't really get it.</p><p>And then like the company where I,</p><p>I was out of the time.</p><p>I was at the time they were exploring some observability products. They had some microservices and they were like looking into getting into Kubernetes. They, they were like hung up on this one product.</p><p>I'm like, you know, we should check out these other products around observability. Me saying this with like a vague notion of what observability was about. And then a couple of months later I found myself working at Tucows where I was managing two teams. It was a platform engineering team where, instead of using Kubernetes, they were using Nomad.</p><p>So I had to learn that. But I was also managing an observability team, which was kind of funny because I had a vague notion of observability from my, you know, like, my chats with, with my friend doing, during our side hustle. But also, I was trying to wrap my head around and when I got this job I started thinking I'm like, I'm like leading a team of people.</p><p>I better know what observability is because otherwise how am I supposed to like lead an entire organization on this journey, right? So I set out to do what I do best, which is learn the thing that I don't know so I can do my job. And as part of that, I started blogging about it.</p><p>And I already had a blog at the time on Medium which was focused on, like, more the Kubernetes, Argo CD side of things. And so I started exploring observability. Blogging about it.</p><p>As I learned more, then I was able to start setting a direction for my team. So this team that I inherited It was an observability team.</p><p>I'll say this in air quotes. And really what the team did was manage a set of, of open source tools in house. But at the same time, the organization was sending its observability data to a SaaS vendor. So I'm sitting here thinking, okay, we're already paying somebody to receive our data. Why the hell is this team managing a set of tools in house?</p><p>At this point, I had a better idea of what observability was all about, and I said, let's pivot this team completely. I want to focus on observability practices and not on observability tooling. Leave the tooling to the experts. So then I sought to educate my team on like, what are proper observability practices?</p><p>And that's around the time that I got wind of open telemetry. Started looking at it a little bit closer. So the company before I was there was like that, I think that was the first time I'd heard of open telemetry, but it's. It seemed like, I don't know, it, it, it seemed like a very small thing. And then when I, when I went to my job where I was managing this observability team It was something that I started looking at more closely as I started reading blog posts about it and whatnot and educating myself and I'm like, huh, I feel like this is the future of observability.</p><p>So I started pushing really hard for the organization to adopt OpenTelemetry, which was very difficult because I found myself in a position, this was 2021, where OpenTelemetry, like, the tracing specification was not yet in general availability. And this was like summer, fall of 2021. And I'm sitting there thinking, no, no, this is going to be the next big thing.</p><p>I assure you it's going to be the next big thing. But how can I convince these folks that OpenTelemetry is the next big thing when tracing isn't even in general availability?</p><p>So I thought, I, I need to like, calm these folks concerns. So what I did, which is what I do best, I think, in my which I've been able to leverage in my capacity of as a developer advocate now, which is like make connections with people.</p><p>And so I had made some friends in the open telemetry space.</p><p>I'd talk to a couple of folks from two different vendors that we were also considering at the same time, because we were, you know, we had the SaaS vendor, but I'm like, well, is, is this the right fit for us? We should explore other vendors.</p><p>I reached out to these folks who were, who worked for other vendors, but they put on their OpenTelemetry hats for this, right? They said, please come speak with us on a unified front under the OpenTelemetry umbrella to answer questions and concerns for folks in the organization who are hesitant about adopting OpenTelemetry.</p><p>And to my surprise, they went along with my idea, I, I honestly thought, this is a total long shot, this is not happening. But they went with it and I remember telling my team, we need some seed questions in case this thing is going to be crickets, otherwise I'm going to look like total ass.</p><p>So we came up with some seed questions, but guess what?</p><p>For the whole hour, the folks from the dev teams, they had one question after another, and we didn't even have to ask any of our seed questions, which I was very relieved to hear. To me, it showed that there was an interest in what was being done.</p><p>There was a genuine interest in OpenTelemetry, and I think at the end of the day, mission accomplished. We managed to quell those fears as a result. And so the company was a lot more comfortable in terms of adopting OpenTelemetry. So that was my original foray into it. And then I ended up getting hired by one of the observability vendors as a developer advocate because of my blog posts around observability and OpenTelemetry, my explorations.</p><p>Which for me, honestly, is, is something that I've dreamed of.</p><p>I love educating folks through my blogging and I love digging into technical topics. So I feel like it was the best use of my, of my skills. And it's been great because also like this job has been my first opportunity to contribute to open source, which I'd never done before.</p><p>And it's scary if you've never contributed to open source cause you're being vulnerable. You're putting yourself out there. It is terrifying. Like what if someone says your thing is crap, but I've been lucky cause in the OpenTelemetry community, no one goes and says your PR is crap.</p><p>They always have very thoughtful suggestions always useful suggestions. So, even though I'm scared every single time, my fears are always put at ease because the folks are just genuinely nice about it.</p><p>That's been my journey into OpenTelemetry and observability.</p><p><strong>Ash Patel:</strong>&nbsp;That's the great thing about open source communities. It's psychologically safe in that everyone's trying to learn and trying to improve what they're working on and they'll share ideas, but in a constructive way.</p><p><strong>Adriana Villela:</strong>&nbsp;Yes, absolutely.</p><p><strong>Ash Patel:</strong>&nbsp;And I said that because I'm calling out to people to try and do the same as Adriana and contribute to open source if you can, and if you have the time to do it.</p><p><strong>Adriana Villela:</strong>&nbsp;I do want to mention one thing actually on that same vein, like when, when I was at Tucows, I myself wasn't practicing what I was preaching, but I did encourage my team as a whole, like my observability team, like when they were running into issues with, Oh, I'm waiting on this feature for OpenTelemetry and blah, blah, blah.</p><p>I'm like, dude, just try to work on it if you can, if you feel like you, you can like put in a fix, do it. It took me a little bit longer and, you know, switching roles to like muster the courage to do it myself. But I, I totally agree with you.</p><p>I think it's so important for us to contribute to open source.</p><p>And for our organizations to encourage contributing to open source, especially if we are consumers of that open source software.</p><p><strong>Ash Patel:</strong>&nbsp;I have a question about your thoughts on how organizations that are non tech native feel about open source. We'll get to that in a second, but I want to actually go back one step and talk about how OpenTelemetry fits into the whole observability area of software.</p><p><strong>Adriana Villela:</strong>&nbsp;I think the important distinction to make is that open telemetry is not equal to observability. Open telemetry is one of those tools that enables observability. And honestly, like, you know, before open telemetry, there were things like open tracing and open census. OpenTelemetry is at the forefront now because it is the one that all of the big vendors have rallied behind and, and for a little bit of history, open source projects like OpenCensus and OpenTracing, they actually merged to form OpenTelemetry.</p><p>But I wouldn't be surprised if, like, somebody for funsies decided to, like, create their own tracing library or whatever. But at the end of the day, like, observability is about, you have that information. What do you do with the information about your system so that you can follow the breadcrumbs to answer the question, why is this happening?</p><p>OpenTelemetry enables that to happen, right? It's the thing that supplies that information.</p><p>I would encourage folks to use OpenTelemetry as the standard for that sort of thing because if we get to the point where, like, we're all using different tools, we end up losing out on the standardization and the work that has gone into OpenTelemetry, if that makes any sense.</p><p>But a tool like this is only as good as the user base. And if we don't have that standardization, if we don't rally as a community to standardize around how we get our observability how we emit the observability data then we end up with all these different factions.</p><p>I don't want to be super prescriptive about it, but I think it ends up losing its worth if, if we go all like everyone on their own, do whatever they want for these sorts of things. Right?</p><p><strong>Ash Patel:</strong>&nbsp;Yeah, I would be totally against it myself.</p><p>If I were managing a team or multiple teams, which like I have in the past, I always used to tell my teams, don't reinvent the wheel. Look at what's available out there. And try and use that first. And then if we have to make a solution from scratch, then we'll look at it then. But start learning from what everyone else has done, and then we can work with that.</p><p>And this is how OpenTelemetry is doing. It's giving you all this capability from the get go.</p><p><strong>Adriana Villela:</strong>&nbsp;Yeah, totally. And the other thing that I want to emphasize is that because, you know, all the major observability vendors are behind it. It means that these vendors are not competing on the data that is ingested. They're just competing on how they're rendering the data.</p><p>We all speak the same language. The data that we send to whatever vendor, vendor X and vendor Y, it's the same data. Vendor X will probably render it differently from vendor Y, and so as a consumer. It's up to you to decide which one is more useful for me for my troubleshooting purposes.</p><p><strong>Ash Patel:</strong>&nbsp;You actually have a bit of experience with running teams. So, I'd be interested to know in terms of your observability team, how you managed them, how you hired them, how you managed the performance. .</p><p><strong>Adriana Villela:</strong>&nbsp;When I joined Tucows, I was lucky enough that there were some folks on both of my teams already, and it was, like I said, it was managing two teams, and they were similar but different.</p><p>And so there were already folks on those teams, but I needed to also hire a headcount. At that point, I already knew what I wanted out of my teams. And in particular, I'll talk about the observability team.</p><p>I wanted to hire somebody who either appreciated what observability had to offer either as an expert in observability or somebody who maybe they didn't come from that observability background, maybe they came from a more traditional monitoring background, but saw the importance of observability and were willing to do a paradigm shift.</p><p>It was a weird role because it was a technical role that wasn't terribly hands on. So my team our mandate was to be able to educate folks at the organization on observability practices.</p><p>We taught them the overarching best practices, but also we assisted them as far as unpacking some of the gnarly bits of OpenTelemetry so that they could instrument their code.</p><p>Now that was very different from actually instrumenting the code themselves, because the slippery slope that you can get into by having an observability team, and it wasn't just my team you can end up getting asked to instrument your developer's code. Which is a very bizarre ask, because I know nothing about your code.</p><p>I don't know what's important to you for debugging purposes. I don't know what the code does in general. I might have a vague idea from a high level. But realistically speaking, it is not my code. I don't have that kind of vested interest. I will help you in terms of, like, Hey, these might be the things that you might want to instrument.</p><p>And I can provide some assistance on how to instrument in whatever language. But as far as instrumenting your own code, that was something that was on purpose outside of the realm of my team, because we wanted to encourage developers to instrument their own code.</p><p>So hiring for the team was... was tricky in that sense, because I had a lot of candidates applying for the position who came from like an infrastructure as code background and were super gung ho on managing tools and I'm like that's not exactly what we do.</p><p>I wanted somebody who had an appreciation from an infrastructure as code background, somebody who had an appreciation from like an SRE type of background, you know, someone who felt the pain so that they could then go on and advocate for those observability practices.</p><p>In a way, I guess I was hiring for an internal team of advocates in the organization. So it makes it a very tricky type of role to hire for because it's a more squishy role that's not completely hands on, but I need you to be an engineer. I need you to understand the pain around this sort of thing.</p><p>That was on the hiring side, but then for the folks who I had inherited was a little bit tricky too, because all of a sudden I come in, I'm saying we're pivoting the team's mandate. That went well for some folks and for other folks they're like, but I want to just do infrastructure as code.</p><p>That is my thing. I enjoy it. So I had to have some hard conversations with some of my team members saying, look, I understand that this is your preference and that I have gone ahead and shifted what our team's priorities are.</p><p>And I am okay. If we don't align, if you want to go look elsewhere in the organization, I will support you in that search, to ensure that you find something that's more in line with your career goals in the organization.</p><p>It's kind of a tough conversation to have because it means that person admitting, yeah, I don't really want to do this and me saying I'm okay letting you go because I think at the end of the day, when you're managing a team you kind of have to all toe the party line together.</p><p>You have to be aligned as to what the overarching team goals are. It's okay to question certain things. I think it's always important to question direction. Hey, maybe you should be doing X instead of Y. That's totally fine, but to have folks who are constantly questioning what you're doing as if you're butting heads because you don't see eye to eye on the philosophies and where the team is going, I think it ends up being very unhealthy for the team, and so you're better off parting ways because you're better off parting ways.</p><p>All it takes is one person to start sowing dissent on the team to ruin the team cohesion. And then that becomes a very toxic work environment for everybody, right? Because then you start creating factions and people aren't working together, they're working against each other, and that's not something that you can have.</p><p>So, I mean, at the end of the day, being aligned on what our mandate was was very important, whether it was somebody already existing on the team or somebody being hired into the team.</p><p><strong>Ash Patel:</strong>&nbsp;That was very insightful about a lot of things that you can apply beyond what your team issue was in terms of there's so many teams that are facing this exact thing. There are people already there, there are people that you need, and you need to all align on whatever mandate you as a manager have to create to align with goals.</p><p>Coming back to now, what you're doing right now, you'd be having some interesting conversations with a lot of people every day about what their challenges are. So what kind of person typically reaches out to you or you end up having conversations with around observability and OpenTelemetry?</p><p><strong>Adriana Villela:</strong>&nbsp;So we talked to, I'd say, a lot of people who are new to OpenTelemetry, who are just trying to figure this thing out. I had a conversation recently with somebody who was just starting to wrap their head around it. They knew that observability is important. They knew that OpenTelemetry is important, but they weren't sure how to get started.</p><p>So one of the things that I like to do is, I'll give them a little overview and I'm happy to, anyone who wants to talk about observability, OpenTelemetry, like reach out to me. Let's have that conversation. You know, I work for a vendor, but I'm not here to sell you a vendor product.</p><p>I'm here to sell you on observability first and foremost. I'm here to sell you on observability, OpenTelemetry, the power of that.</p><p>And I think that's the most important thing we, we need to do right by, by our systems and make sure that we have the tools necessary to be able to troubleshoot in production.</p><p>So having those conversations with folks to understand what their needs are, what their difficulties are. I can come because I have the experience of having been on the other side of looking at a couple of different vendors and really just focusing on observability practices.</p><p>I can have that conversation with folks around what do you need to do in order to be successful in terms of bringing observability and open telemetry to your organization? Just, just like wrapping your mind around observability. I think with a lot of things in tech it's very easy to get into various anti patterns.</p><p>We saw that with the DevOps movement. I think one of my biggest pet peeves is still the idea of, like, DevOps as a job rather than as a practice because I think we, we ended up moving away from what DevOps was trying to promise.</p><p>And I feel like, you know, it, it can be a similar slippery slope with observability where I have used OpenTelemetry, therefore my system is observable, and I've encountered situations where you can instrument till you're blue in the face, but if you're instrumenting the wrong thing that's useless.</p><p>Recently I'm one of the co leads of the Open Telemetry End User Working Group, and we had one person come in and present to us, Hazel Weekly, and she was talking about a story where one of the organizations where she was in.</p><p>They instrumented their code. They over instrumented their code to the point where like, great, we've got open telemetry. It is useless. We have no idea what is going on here. It's not giving us any meaningful information. So she actually had to spend some time unraveling that big mess and trying to get it to a state where the instrumentation was more meaningful.</p><p>We need to be aware of the initial lack of knowledge around both observability and open telemetry - both from a I don't know where to get started, to, Oh, I've come in guns blazing and I've over instrumented and it's still not helping me.</p><p>We really need to take a step back and really, really focus on the practices.</p><p><strong>Ash Patel:</strong>&nbsp;Okay Adriana. Where do you see the category of observability heading in the next two, three, five years?</p><p><strong>Adriana Villela:</strong>&nbsp;This is my, my hope and I, I, I hope that it, it follows my hope. So, in terms of observability, I hope to see more integrated with the SRE practice and not so much of this separate adjacent practice. We need to start talking about it from a more unified standpoint. The other thing that I wanted to mention is that I really want us to look at observability as a team sport.</p><p>And not as just the thing that is the burden of the SRE. Because when you think about it, yes, the SRE is probably going to be your first line of defense when there's an issue happening in production. But, let's think back to who instrumented that code in the first place, the developer.</p><p>So the developer, you would expect, would want to have a vested interest in making sure that the instrument, the code properly so that our poor friend, the SRE has an easier time of troubleshooting issues in production. But I think we can take a step further and say, the developers should take advantage of these observability platforms to even help troubleshoot their own code while they're debugging, while they're unit testing.</p><p>But let's take it a step further than that, which is before it goes into production, we send it over for testing, right, to our friends, the QAs and this idea of: "why don't we empower QAs to be able to use observability data to make sense of what is happening in the application when they're testing the code so that when there is an error, then the QA can go to the developer and say, "Hey, there's bug and I can tell you where it is".</p><p>Or if there's a bug and the QA can't tell the developer where it is, that's fine too because they can say to the developer, " Hey, your system's not emitting enough information. I don't know what the problem is. I think you're missing some instrumentation. Go back. And then the other piece of that puzzle is leveraging those traces that you already have in place and creating trace based tests.</p><p>Which is basically, as the name implies, you use traces to create automated integration tests, and there are some tools out there that already do that. The most talked about one in the space is TraceTest. It's an open source tool that facilitates trace based testing.</p><p>There's another one called Maliby: it's more of a JavaScript based trace based testing tool, and I don't think it's been maintained for the last little while. And then Helios I believe they were looking at trace based testing definitely last year. I don't know where things are at now, but these are three different types of tools that you might want to look at if you want to explore trace based testing.</p><p>I think it's a really, really cool concept and that's something like I want to see more. I want to see us talking more about trace based testing in the future because I think trace based testing can give us extra superpowers in terms of achieving more reliable systems especially because now you're not having to worry about wrangling different testing frameworks to be able to do end to end tests of your systems, especially if you know, you're dealing with a system where the microservice is written in different languages.</p><p>Now trace based testing is like your single unified language for writing your your integration tests, which is awesome because the unified language itself is the trace.</p><p>So again, we need to make sure that we have good quality traces so that we can leverage those traces to really make them work for us.</p><p><strong>Ash Patel:</strong>&nbsp;It's interesting you're talking about tracing because I was crossing University Avenue in Toronto and I overheard two developers talking about tracing and testing. It was a little haphazard.</p><p>They were talking about trace testing, but they were very confused about what they had just heard in a meeting prior. So there's a lot of, it was surprising to me because then I stopped them. I said, Hey, what do you guys know about observability? And they said. Not much. And I asked them, do you need to learn more?</p><p>They said, yes, please. Well, they didn't say it like two boys in a Charles Dickens novel, but you get the idea that they were pretty keen on figuring out what is this thing that's now suddenly become part of our responsibility in this increasing you build it, you run it model.</p><p>It's interesting that these kinds of things are being brought out. Trace testing was just mentioned. I'm gonna say one thing, and I'm not sure whether you're gonna agree with me. It's that all this needs to be brought together cohesively, so that when it's shared with people, they can see this is where this fits into the observability practice.</p><p>Where do you see SRE heading in the same way in the future? Where do you see it heading in the next few years?</p><p><strong>Adriana Villela:</strong>&nbsp;I think for SRE, I'd like to see more conversations happening around things like self service provisioning.</p><p>Actually Anna and I have a talk coming up at KubeCon where that is our subject matter. The idea that there's the rise of platform engineering, which this may or may not be a popular opinion, but in, in the definition that we're using, we see platform engineering as an extension of SRE.</p><p>Where you're not only concerned about the external customer, but the internal customer, i. e. the developer. And this idea that in platform engineering, as a developer, it's so annoying when you're trying to get your job done, and you need certain tools to get your job done, and you gotta wait on this platform team to provision stuff for you.</p><p>And so you're sitting there waiting, waiting, waiting, thinking, Frigman, why can't I just use, like my own Kubernetes cluster? I don't want to use this stupid Kubernetes cluster that's, you know, in the cloud and blah, blah, blah. And, like, this is taking forever. I need to get my job done. And by having self service tooling available, it means that developers can then provision the resources that they need to get their jobs done on demand.</p><p>These resources are provided by the platform team. However, the platform team automates the provisioning of those resources, right? They're not having to like, you know, execute on these requests manually but on top of that, this whole idea of being packaged, so that it's compliant within the organization.</p><p>So security concerns are addressed. It's using standard configurations. So you're not deviating from what the standards within the organization are. So the thing that's delivered to you that you're able to use as part of your tooling is wrapped in a nice little neat bow.</p><p>It's got all the configurations you need and you don't have to wait days and days and days to get the thing that you need to be able to do your job. So having the self service provisioning.</p><p>And then the other thing that is important that I think follows in with that is whenever we go from one organization to another, I can't tell you how many times you figure out this automation, then you go to the next organization.</p><p>Oh crap. I got to do this exact same automation. Like, come on. We keep reinventing the same automations over and over and over again. Wouldn't it be nice if we just had this marketplace of standardized automations where we can just pick and choose, right?</p><p>And then make some customizations here and there, but we're not having to like reinvent these friggin automations every single time for our SRE practice.</p><p>The last one is policy as a service where I feel like we're always at odds with InfoSec. InfoSec has a very tough job because they're protecting us from all these attacks in our systems making sure that everything's safe.</p><p>And I feel like there's... And they can't even talk about their triumphs, right, because if you talk about, like, oh, I, I blocked this attack from happening, I blocked that attack from happening, that already, like, that's not secure by nature. So, you, you can't even, like, you can't even gloat about your accomplishments but to be able to work together with Infosec so that we have certain policies in place that are codified, and again, so we don't have to go through these long drawn out processes of, Oh, I need access to XYZ and have like all these approvals.</p><p>Well, yes, we've already had the pre approvals because we already have all of these policies, like pre packaged, pre approved. Everything's compliant. And so we're not mad at our InfoSec friends. They're just trying to do their jobs and we're getting these compliant things. These things that are properly secured. Our Kubernetes clusters that are properly secured and are cloud resources that have the right permissions without having to like wait forever and ever and do this song and dance with InfoSec.</p><p>Those are the areas that I, I would like to see like SRE going towards.</p><p><strong>Ash Patel:</strong>&nbsp;We can only wait and see.</p><p><strong>Adriana Villela:</strong>&nbsp;I know, right? Fingers crossed.</p><p><strong>Ash Patel:</strong>&nbsp;What advice would you give to SREs who are dealing with a lot of challenges right now to help them in their working lives?</p><p><strong>Adriana Villela:</strong>&nbsp;I would say never stop being curious. Whether or not you're an SRE, I think that will take you very far in your career. Because by being curious, by always looking to do things a better way. It helps us grow in our practice. I think it allows us to innovate. I think we risk stagnation when we stop caring about innovating in our technologies, in our jobs.</p><p>So let's continue to be curious.</p><p><strong>Ash Patel:</strong>&nbsp;There we have it. Never stop being curious. Thank you so much for joining me, Adriana.</p><p><strong>Adriana Villela:</strong>&nbsp;Thanks for having me, Ash.</p>]]></content:encoded></item><item><title><![CDATA[How Jaeger tracing fits into software observability]]></title><description><![CDATA[In this article, I will share how tracing and more specifically Jaeger tracing can fit into your wider software observability strategy.]]></description><link>https://read.srepath.com/p/jaeger-tracing-software-observability</link><guid isPermaLink="false">https://read.srepath.com/p/jaeger-tracing-software-observability</guid><dc:creator><![CDATA[Ash Patel]]></dc:creator><pubDate>Wed, 15 Jun 2022 17:55:08 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/69d881e0-22d5-4555-b262-03fcf17234ac_480x270.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article, I will share how tracing and more specifically Jaeger tracing can fit into your wider software observability strategy.</p><p>Before we get into tracing, let's define observability.</p><h2>What is observability?</h2><p>Observability is a comprehensive means of gaining data on how software services perform in production.</p><p>This data gives you <strong>a picture of the health and performance of individual services</strong>, as well as the cloud infrastructure that supports them.</p><p>It can be broken down into 3 actions: logging, tracing, and monitoring. Our focus in this article will be on tracing.</p><h2>What is tracing?</h2><p>Tracing is an action that <strong>tracks a request from initiation to completion</strong> within a microservices architecture.</p><p>It usually starts when a user or service starts a request which moves along a chain of interconnected services needed to fulfill the request.</p><p>With tracing enabled, software engineers and SREs can pinpoint any issues within the chain of requests among the various involved services.</p><h2>Where Jaeger fits into the tracing paradigm</h2><h3>What is Jaeger tracing?</h3><p>Jaeger is an open-source tracing tool that allows engineers to <strong>track request performance and issues among 10s, 100s, and even 1000s of services</strong> and their dependencies. It<strong> </strong>collects tracing data and then populates Grafana dashboards.</p><p>The key benefit of this is that it highlights downtime/load-time risks and errors. This makes it an essential component of a strong observability practice.</p><h3>Jaeger's origin story</h3><p>Jaeger was created in 2015 by an engineer at Uber, Yuri Shkuro, who wanted to help engineers work out&nbsp;<em>where</em>&nbsp;issues were popping up. This emerged as a critical need at Uber over time.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lkw6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lkw6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lkw6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Glimpse of microservices that drive the Uber app. A large number of these services get triggered every time you request an Uber ride.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Glimpse of microservices that drive the Uber app. A large number of these services get triggered every time you request an Uber ride." title="Glimpse of microservices that drive the Uber app. A large number of these services get triggered every time you request an Uber ride." srcset="https://substackcdn.com/image/fetch/$s_!lkw6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!lkw6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86c339cd-c62b-4e82-af84-d3a3aaba79ae_480x270.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Above: a glimpse of services that support the Uber app. Many of these services get triggered every time you request an Uber ride.&nbsp;<em>(Source: Youtube,&nbsp;<a href="https://youtu.be/UNqilb9_zwY?t=185">Jaeger Intro &#8211; Yuri Shkuro</a>)</em></figcaption></figure></div><p>The Uber app may seem simple to its end users, but behind the facade runs a complex network of microservices. Many of these services depend on other services and their sub-services.</p><p>Weaknesses in the service chain can risk the whole user request falling apart i.e. no ride.</p><p>In business terms, Uber risks losing ride fares at a large scale if one or some component services fail or slow down.</p><blockquote><p><em>&#8220;In deep distributed systems, finding&nbsp;</em><strong>what</strong><em>&nbsp;is broken and&nbsp;</em><strong>where</strong><em>&nbsp;is often more difficult than&nbsp;</em><strong>why</strong><em>&#8220;</em></p><p>&#8212; Yuri Skhuro, Founder &amp; Maintainer, CNCF Jaeger</p></blockquote><p>Jaeger tracing helps engineers find out what services are experiencing issues and where. That way, they can fix small issues before they snowball into serious problems or crises.</p><h3>Do your observability needs justify using Jaeger?</h3><p>You might be wondering whether you even need Jaeger. After all, your use case might not be as complex as Uber&#8217;s. Jaeger was designed to <strong>make sense of a complex web of services and up to millions of daily requests</strong>.</p><p>Tracing is not an absolute must-have for simpler software architectures. However, it is useful for finding bottlenecks if you have more than a handful of services. Having more than 10 services is a fair threshold of need.</p><p>Would the following situation ever pose a problem for your software? Your application has more than 10 services and suddenly gets a traffic spike. A large volume of requests has not been completed.</p><p>How will you find the culprit fast enough to fix the issue?</p><p>If this compels your need for tracing, let's explore how Jaeger tracing works from a high-level view:</p><h3>How Jaeger tracing works</h3><h4><strong>Step 1</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FSz6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FSz6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 424w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 848w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 1272w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FSz6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png" width="75" height="75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:75,&quot;width&quot;:75,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FSz6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 424w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 848w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 1272w, https://substackcdn.com/image/fetch/$s_!FSz6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe029c729-3b99-4bc3-8cf1-f5cd13550f30_150x150.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Jaeger Agent</strong>&nbsp;gathers &#8220;span data&#8221; by sampling parts of UDP packets transmitted by microservices</p><h4><strong>Step 2</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m7C8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m7C8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m7C8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png" width="64" height="64" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:64,&quot;width&quot;:64,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!m7C8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!m7C8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43eab006-c12c-40a7-a35f-43a08410fcc7_512x512.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Data (service name, start time, duration) gets sent on to the&nbsp;<strong>Collector</strong></p><h4><strong>Step 3</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FwV0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FwV0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FwV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png" width="64" height="64" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:64,&quot;width&quot;:64,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FwV0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!FwV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e61572-c2e9-4e27-b3a3-853e77c33636_512x512.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Collector sends data to 2 places:&nbsp;<strong>Analytics</strong>&nbsp;and&nbsp;<strong>Visual Dashboard</strong></p><p><em>Et voil&#224;!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0B_w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0B_w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0B_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2799a2e0-39d3-4699-b301-55958742efb7_480x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0B_w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!0B_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2799a2e0-39d3-4699-b301-55958742efb7_480x270.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Above: this is what tracing data looks like in the Jaeger UI <em>(Source: Youtube,&nbsp;<a href="https://youtu.be/UNqilb9_zwY?t=185">Jaeger Intro &#8211; Yuri Shkuro</a>)</em></figcaption></figure></div><p>Now let's explore how to install Jaeger on a Kubernetes cluster</p><h2>How to setup Jaeger</h2><h3>2 ways to install Jaeger on Kubernetes</h3><p>I will assume that you know how Kubernetes clusters are structured in terms of containers, nodes, pods, sidecars, etc.</p><p>Jaeger Agent can run on a Kubernetes cluster in two distinct ways: as a daemon or sidecar. Let&#8217;s compare both of them.</p><h4><strong>Setup Jaeger as a daemonset</strong></h4><p><strong>Mechanism:&nbsp;</strong>Jaeger Agent runs as a pod and collects data from all other pods within the same node</p><p><strong>Useful for:&nbsp;</strong>single tenant or non-production clusters</p><p><strong>Benefits:</strong>&nbsp;lower memory overhead, more straightforward setup</p><p><strong>Risk:&nbsp;</strong>security risk if deployed on a multi-tenant cluster</p><p><a href="https://www.digitalocean.com/community/tutorials/how-to-implement-distributed-tracing-with-jaeger-on-kubernetes">LEARN BY DOING: simple Jaeger setup tutorial</a> via Digital Ocean</p><h4><strong>Setup Jaeger as a sidecar</strong></h4><p><strong>Mechanism:&nbsp;</strong>Jaeger Agent runs as a container alongside the service container within every pod</p><p><strong>Useful for:&nbsp;</strong>multi-tenant clusters, public cloud clusters</p><p><strong>Benefits:</strong>&nbsp;granular control, higher security potential</p><p><strong>Risk:&nbsp;</strong>more DevOps supervision required</p><p><a href="https://github.com/jaegertracing/jaeger-kubernetes#deploying-the-agent-as-sidecar">LEARN BY DOING: deploy Jaeger as a sidecar</a> via Jaeger's Github</p><p>Remember from earlier that Jaeger samples parts of UDP packets transmitted by services?</p><p>There are 2 sampling methods for sampling UDP packets: heads-based sampling and tails-based sampling. Each has its benefits and downsides. Let&#8217;s explore:</p><h4><strong>Heads-based sampling</strong></h4><p><strong>Also known as</strong>&nbsp;upfront sampling</p><p><strong>Mechanism:&nbsp;</strong>sampling decision is made before request completion</p><p><strong>Useful for:&nbsp;</strong>high-throughput use cases, looking at aggregated data</p><p><strong>Benefits:</strong>&nbsp;cheaper sampling method &#8211; lower network and storage overhead</p><p><strong>Risk:&nbsp;</strong>potential to miss outlier requests due to less than 100% sampling</p><p><strong>Work required:</strong>&nbsp;easy setup, supported by Jaeger SDKs</p><p><strong>Configuration notes:&nbsp;</strong>sampling based on flip-of-coin or until a certain rate is achieved</p><h4><strong>Tails-based sampling</strong></h4><p><strong>Also known as&nbsp;</strong>response sampling</p><p><strong>Mechanism:&nbsp;</strong>sampling decision is made after the request has been completed</p><p><strong>Useful for:&nbsp;</strong>catching anomalies in latency, failed requests</p><p><strong>Benefits:</strong>&nbsp;more intelligent approach to looking at request data</p><p><strong>Risk:&nbsp;</strong>temporary storage for all traces &#8211; more infra overhead, a single node only</p><p><strong>Work required:</strong>&nbsp;extra work &#8211; connect to a tool that supports tail-based sampling&nbsp;<a href="https://web.archive.org/web/20210421143122/https://lightstep.com/jaeger/">like Lightstep</a></p><p><strong>Config notes:&nbsp;</strong>sampling based on latency criteria and tags</p><p>Now that you've picked your sampling method, you will also need to consider that Jaeger's collector has a finite data capacity.</p><h3>Prevent Jaeger's collector from getting clogged</h3><p>Jaeger&#8217;s collector holds data temporarily before it writes onto a database. The visual dashboard then queries this database. But the collector can get clogged if the database can&#8217;t write fast enough during high-traffic situations.</p><h4><strong>Problem:</strong></h4><ul><li><p>Collector&#8217;s temporary storage model becomes problematic during traffic spikes</p></li><li><p>Some data gets dropped so the collector can stay afloat from the flood of incoming request data</p></li><li><p>Your tracing may look patchy in areas because of the gaps in sampling data</p></li><li><p><strong>Risk of missing failed or problematic requests</strong>&nbsp;if they were in the sampling that gets dropped</p></li></ul><h4><strong>Solution:</strong></h4><ul><li><p>Consider&nbsp;asynchronous span ingestion technique to solve this problem</p></li><li><p>This means adding a few components between your collector and database:</p><ol><li><p>Apache Kafka &#8211;&nbsp;<em>real-time</em>&nbsp;data streaming at scale</p></li><li><p>Apache Flink &#8211; processes Kafka data <em>asynchronously</em></p></li><li><p>2 jaeger components &#8211; jaeger-ingester and jaeger-indexer &#8211; push Flink output to storage</p></li></ol></li></ul><p>Once these components are in place, the collector will be less likely to get overloaded and dump data.</p><h4><strong>Implementation reading:</strong></h4><p>These links &#8211; access them in order &#8211; might help you get started with your implementation:</p><p><a href="https://web.archive.org/web/20210421143122/https://youtu.be/UNqilb9_zwY?t=1721">Youtube &#8211; Jaeger straight-to-DB vs asynch write method</a></p><p><a href="https://web.archive.org/web/20210421143122/https://www.kubenuts.com/jaeger-tracing-kubernetes/">Youtube &#8211; Apache Kafka videos by Confluent</a></p><p><a href="https://web.archive.org/web/20210421143122/https://www.kubenuts.com/jaeger-tracing-kubernetes/">Practical overview (with example) of Apache Flink</a></p><h2>Wrapping up</h2><p>This concludes our article on Jaeger and the promise it holds for distributed tracing of microservices, as well as the wider observability apparatus.</p>]]></content:encoded></item></channel></rss>