MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

How I Built a Zero-Dependency Token Compressor for AI Coding Agents (During My High School Exams)

2026-06-16 04:37:04

as developers, we are spending more and more time working alongside AI coding agents like Cursor, Claude Code, GitHub Copilot, Windsurf, or Cline.

But as your session grows, you quickly run into two major problems:

  1. Context Window Inflation: Long-running loops, verbose model reasoning, and unfiltered terminal log dumps clog the context window, causing the LLM to get "lost in the middle" and start hallucinating.
  2. Financial Overhead: Large context windows mean higher token usage, which translates directly to higher API costs.

To solve this, I built TITAN (Token Intelligence Through Agent Narrowing): a universal, zero-dependency CLI framework designed to compress AI agent token consumption by 70% to 85% without degrading reasoning quality.

And to make things interesting, I wrote and shipped it this week entirely on my own, right in the middle of my high school final exams (la maturità here in Italy).

Here is how it works under the hood.

The Core Philosophy: Multi-Layer Compression

TITAN approaches token optimization not as a single post-processing step, but as three orthogonal, multiplicative layers:

Total Savings = 1 - ( (1 - L1_Savings) * (1 - L2_Savings) * (1 - L3_Savings) )

Layer 1: Linguistic Compression (Caveman Engine)

Instead of letting the LLM output standard verbose English prose (pleasantries, hedging, filler words, technical narrations), the Caveman Engine instructs the model to use a dense, telegraphese grammar:

  • Strips filler/hedging: basically, actually, likely, probably $\to$ removed.
  • Strips articles: the, a, an $\to$ removed (when safe).
  • Fragments OK: subject/auxiliary drops $\to$ e.g., "Component re-renders" instead of "The component is re-rendering".
  • Preserves Sacred Tokens: Code blocks, URLs, file paths, and exact technical names are protected and left untouched.

Layer 2: Structural Code Compression (Ponytail Lazy Ladder)

Before the agent writes a single line of code, it must traverse a 6-rung logical ladder to guarantee the laziest, most minimal solution:

  1. YAGNI: Does this feature actually need to exist right now? If not, skip.
  2. Stdlib: Can Node.js/JS native stdlib do it? If yes, use it.
  3. Native: Is there a platform native API? Use it.
  4. Existing: Is there an already installed package? Don't add a new npm dependency.
  5. One Line: Can it be written as a single line? Inline it.
  6. Minimum: Only then, write the absolute minimum working code.

Every deliberate simplification is documented inline: // ponytail: <ceiling>, <upgrade path> (e.g. // ponytail: local memory cache, use Redis if multi-node setup is required).

Layer 3: Contextual Compression (CLI Utilities)

  • Memory Files: Static documentation files (like CLAUDE.md) are compressed post-hoc to strip prose while keeping code conventions exact, saving up to 45% input tokens on every turn.
  • Terminal Stream Filtering: Pipes build/test logs to strip Vite/Webpack startup noise, husky banners, and contract large stack traces down to the error header + first relevant application frame.
npm run build 2>&1 | titan filter

The Zero-Dependency Rule

Following the structural (L2) rule of using the standard library, TITAN has zero external npm dependencies.

It uses Node.js native features (fs, path, readline, child_process, https) for everything:

  • The YAML frontmatter parser is implemented as an indentation-aware state machine that handles quoted strings, list arrays, and multiline block scalars (| and >).
  • The test runner uses Node's native node:test and node:assert modules.
  • System commands execute via native subprocess spawns.

Measuring Usable Intelligence Density (UID)

To verify that compressing prompts doesn't degrade the AI's coding and reasoning capabilities, I built an evaluation harness into TITAN to measure Usable Intelligence Density (UID):

$$\text{UID} = \frac{\text{Avg Accuracy \%}}{\text{Avg Total Tokens}} \times 1000$$

Here is how the variants perform under mock and empirical LLM runs over a 5-task suite (Coding, Debugging, Logic, Refactoring, and Code Review):

Variant Avg Accuracy Avg In Tok Avg Out Tok Avg Tot Tok UID (Density) Status
Baseline 100% 50 198 248 403.2 Reliable
Caveman 100% 120 78 198 505.1 Reliable
Ponytail 86% 115 67 182 472.5 Reliable
TITAN Balanced 100% 1500 80 1580 63.3 Reliable
TITAN Lite 100% 425 91 516 193.8 Reliable
TITAN Aggressive 79% 400 50 450 175.7 ⚠ Degraded
  • Lite / Balanced: Achieve a flat 100% accuracy while maximizing density.
  • Aggressive: Telegraphic mode. Maximizes token efficiency, but logical reasoning begins to degrade slightly on highly abstract deduction tasks.
  • Note: The large input token count for the full TITAN prompt reflects the cost of loading the full master ruleset. The titan_lite variant balances prompt size and output compression beautifully.

Getting Started

You can install TITAN globally from npm:

npm install -g titan-agent-cli

Then initialize the ruleset for your editor. For instance, to generate Cursor rules (.cursor/rules/titan.mdc):

# Standard balanced configuration
titan init --agent=cursor

# Or a lightweight prompt ruleset (~620 tokens)
titan init --agent=cursor --lite

To run the native unit tests locally:

titan test

And to scan your codebase for active technical debt ponytail comments:

titan debt

Open Source & Contributions

TITAN is fully open source. I’d love to get your thoughts, contributions, or a star on GitHub!

If you have any feedback on the standard library YAML parser or ideas on expanding adapters for new IDEs, let me know in the comments below!

End of the Year idea for DEV?

2026-06-16 04:24:36

Hey all!

I have this thought in mind that we should all come together as a community to celebrate the end of the year.

I am planning on doing this yearly under the DEVenger org, but I don't have any ideas as of now. We can do something like "the DEV community built this" or something.

If you have any ideas on what the dev.to community should do, let me know!

What is DevOps? A Plain English Guide

2026-06-16 04:21:19

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

Ever Wondered How Netflix Never Seems to Go Down?

Think about this for a second. Netflix has over 260 million subscribers worldwide. People are watching shows in Tokyo, London, Lagos, and New York — all at the same time. And yet, when was the last time Netflix crashed on you?

Now think about your favourite food delivery app. You open it, order food, track your driver in real time, and get a notification the moment your burger arrives. All of that happens in seconds.

Behind all of this is a way of working called DevOps. And by the end of this article, you'll understand exactly what it is — no jargon, no complicated diagrams, just plain English.

The Old Way (And Why It Was a Nightmare)

To understand DevOps, we first need to understand the problem it solved.

Imagine a software company in the early 2000s. They had two completely separate teams:

The Developers — the people who wrote the code and built new features

The Operations team — the people who managed the servers and kept everything running

These two teams barely talked to each other. Developers would spend months building new features, then hand over a massive pile of code to the operations team and say "here you go, make it work."

The operations team would panic. They hadn't been involved in building it, had no idea what it did, and now they had to deploy it to millions of users without breaking anything.

The result? Deployments took weeks. Bugs slipped through. Systems crashed. Customers complained. And the two teams blamed each other.

Sound stressful? It was.

So What is DevOps?

DevOps is simply the practice of bringing developers and operations teams together to build, test, and release software faster and more reliably.

The name itself is a combination of Dev (Development) and Ops (Operations). Instead of two teams working in silos, they work as one team with shared goals, shared tools, and shared responsibility.

Think of it like a restaurant kitchen.

In a badly run kitchen, the chefs cook the food and just slide it through a hatch to the waiters. The waiters don't know what's in the dish, the chefs don't know what the customers are saying, and when something goes wrong, everyone points fingers.

In a well run kitchen — like the ones you see at a great restaurant — the chefs and waiters communicate constantly. They know the menu inside out, they get feedback from customers quickly, and they work as one team to give people a great experience.

DevOps is that well run kitchen, but for software.

A Real World Example: Amazon

Amazon deploys new code to its website thousands of times per day.

That means engineers are constantly making small improvements — fixing a bug here, improving the checkout experience there, tweaking a recommendation — and those changes go live almost instantly.

How? Because Amazon uses DevOps practices. Small changes are automatically tested, automatically checked for problems, and automatically deployed without anyone having to manually press a button.

In the old way of working, those same changes might have taken weeks to go live, gone through five teams, and required a late night deployment session that everyone dreaded.

The Three Big Ideas Behind DevOps

You don't need to memorise these, but it helps to know the thinking behind DevOps.

1. Work in Small Steps

Instead of building for six months and releasing everything at once (terrifying), DevOps teams release small changes frequently. If something breaks, it's easy to find and fix because the change was tiny.

Uber does this constantly. Every few weeks, the Uber app gets tiny updates — a new button here, a faster map there. You barely notice, but the team is constantly improving without disrupting your experience.

2. Automate the Boring Stuff

Testing code manually, deploying to servers manually, checking for errors manually — all of this is slow and humans make mistakes. DevOps teams automate these tasks so they happen instantly and consistently every single time.

Think of it like a car factory. Cars aren't built by hand anymore — robots do the repetitive work faster and with fewer errors. DevOps applies the same thinking to software.

3. Get Feedback Fast

When something breaks, DevOps teams know about it within seconds, not days. Monitoring tools watch the system constantly and send alerts the moment something looks wrong.

Netflix actually has a famous practice where they intentionally break parts of their own system during working hours to make sure their team can fix things quickly. They call it Chaos Engineering. It sounds mad, but it means they're never caught off guard.

What Does a DevOps Engineer Actually Do?

A DevOps engineer is the person who builds and maintains the systems that help developers work faster and more safely. They work on things like:

  • Setting up automated testing so bugs are caught before they reach users
  • Building pipelines that automatically deploy code (we'll cover this in a future article)
  • Managing cloud infrastructure on platforms like AWS or Azure
  • Monitoring systems and making sure everything is running smoothly
  • Writing scripts to automate repetitive tasks

It's one of the most in-demand roles in tech right now, and the skills involved are exactly what this blog is here to help you build.

Why Should You Care About DevOps?

Whether you're a developer, a system admin, a project manager, or someone just getting into tech — DevOps matters because it is how modern software is built.

Every major tech company in the world uses DevOps practices. Banks use it to deploy new banking features. Airlines use it to update booking systems. Hospitals use it to improve patient management software. It's not just for Silicon Valley startups — it's everywhere.

Learning DevOps opens doors. And the best part is, you don't need to know everything at once. We'll take it one byte at a time.

Quick Recap

Here's everything we covered today in plain English:

  • DevOps = Developers and Operations working together instead of in separate silos
  • It solves the old problem of slow, painful, risky software releases
  • The core ideas are: small changes, automation, and fast feedback
  • Companies like Amazon, Netflix, and Uber use DevOps to deploy changes thousands of times a day
  • A DevOps engineer builds the tools and systems that make all of this possible

What's Next?

In the next article we're going to look at Linux — The Operating System That Runs the Internet — the OS that powers most of the internet and why every DevOps engineer needs to know the basics.

It's going to be short, practical, and you'll be typing your first Linux commands before the end of the article. See you there.

Found this helpful? Share it with someone who is just getting started in tech. And follow along for a new article every week.

Retry Logic and Tiered Alerting in GitHub Actions

2026-06-16 04:21:14

🛠️ Pipelines in the Wild #2

Byte Size Summary

Most pipeline failures are transient — a registry returning a 503, a smoke test catching a slow cold start, a network blip during an image push. Retrying them automatically, with exponential backoff, means engineers never see them. The failures that reach a human should be the ones that actually need one. This article builds a retry wrapper and a three-tier alerting system (transient → silent, degraded → Slack warning, critical → PagerDuty page) on top of a GitHub Actions blue/green deploy workflow. The demo application is Waybill — a FastAPI shipment tracking API backed by PostgreSQL, where the health endpoint checks real database connectivity rather than returning a static 200. That distinction matters: a smoke test that only checks HTTP status is a smoke test that passes while your database is unreachable. By the end you will have a working repo you can run locally with Docker Compose and test today.

The Story

There is a specific kind of 11pm message that every engineer eventually receives.

Pipeline failed.

You open the logs. You trace the error. A Docker registry returned a 503. One HTTP request timed out during a smoke test. The deploy itself was fine — the old version is still running, nothing is broken, no user was affected. But the pipeline did not know that. It knew something returned a non-zero exit code, and it stopped.

You have just spent 25 minutes investigating a problem that lasted 3 seconds.

This is alarm fatigue. It is more dangerous than most engineers realise.

In supply chain operations, we had a name for it too. When every minor EDI (Electronic Data Interchange) hiccup generated a ticket, and every ticket required someone to manually verify whether a shipment was actually at risk, teams eventually started triaging alerts by instinct rather than data. The volume trained people to assume most alerts were noise. Which is exactly the environment in which a real failure goes unnoticed long enough to cost something.

A waybill is the document that travels with a consignment — the source of truth for what is in transit, where it is going, and whether it arrived. In logistics operations you learn quickly that not every exception needs a human. A delay at a sorting hub during peak hours is expected and self-correcting. A consignment held at customs with no reason code is not. The same distinction applies to pipelines: when everything pages, nothing gets treated as urgent, and the one failure that actually matters gets the same response time as a transient registry timeout.

The fix is not monitoring harder. It is building pipelines that distinguish between what needs a human and what they can handle themselves.

The Problem

Two categories of failure. One response. That is the root cause of most pipeline alert fatigue.

Transient failures — a network blip, a rate limit, a downstream service briefly unavailable — resolve on their own within seconds. Retrying them automatically almost always succeeds. A human should never see these.

Real failures — a broken deploy, a failed health check that does not recover, a rollback that did not complete — need attention. The right person should know immediately.

Most pipelines treat both identically: fail, stop, alert. Every transient error generates the same response as a production incident. Engineers learn to ignore it — until the wolf is real.

The pattern here separates these two categories at the pipeline level. Transient failures get retried silently. Real failures get classified by severity and routed to the right channel. The engineer who wakes up at 3am wakes up for something that genuinely requires them.

Why Existing Approaches Fall Short

Static retry in CI tools — Most CI platforms offer a basic retry mechanism, but they retry unconditionally. Three failed attempts at a genuinely broken deploy create three noisy alerts instead of one, and there is no backoff between attempts, which can worsen pressure on an already struggling downstream service.

Catch-all failure webhooks — A single if: failure() step that posts to Slack for every error is the most common pattern. It does not distinguish between a registry timeout and a failed deploy. After a week of false positives, engineers mute the channel.

No retry budget awareness — None of the standard patterns track how often a step is retrying over time. If image pushes are retrying on 40% of runs, that is not a transient problem — it is a reliability issue with the registry that needs fixing, not masking. Without tracking, the retries hide signal.

The Architecture

Architecture Diagram

The diagram makes two design decisions visible. First, the retry loop sits entirely within the GitHub Actions runner boundary — the untrusted execution environment. Retries are handled before any external system (Slack, PagerDuty) is ever contacted. Second, the classifier is the trust boundary between the runner and the alerting layer: it decides what crosses that boundary, and the default is always to alert rather than to silently discard.

This workflow builds directly on the blue/green slot pattern from Article 01 — Zero-Downtime Deployments on a Single Server. If the slot file and nginx swap are new concepts, read that one first.

The three-tier split:

Tier Trigger Response Examples
TRANSIENT Known flaky patterns Silent — no notification Registry 503, rate limit, connection timeout
DEGRADED Recoverable failure Slack warning Smoke test failed, health check degraded
CRITICAL Deploy or rollback failed Slack + PagerDuty page Deploy failed, rollback required

Unknown error patterns always default to DEGRADED. Silence is never the default.

Implementation

Prerequisites

The demo application is Waybill — a FastAPI shipment tracking API backed by PostgreSQL. It exposes endpoints to create shipments, append tracking events as a consignment moves through the network, and query status by waybill number. The /health endpoint returns the deployment slot (blue or green), the app version, and the live database connection state. A 503 response means the database is unreachable — which is a real failure worth alerting on, not a transient network blip to retry silently. That distinction is what makes the smoke tests in this pipeline meaningful rather than cosmetic.

To run it locally before connecting a real server:

cp .env.example .env           # set POSTGRES_PASSWORD
IMAGE_NAME=waybill BLUE_TAG=local GREEN_TAG=local \
  docker compose up --build

curl http://localhost:7070/health   # blue slot
curl http://localhost:9091/health   # green slot
open http://localhost:7070/docs     # OpenAPI explorer

Ports 7070 and 9091 are used deliberately — 8080 and 8081 conflict with common local tooling on Mac dev setups. Both are configurable via BLUE_PORT and GREEN_PORT environment variables if needed.

For the full pipeline deployment you also need:

  • A deploy server (Linux, Docker, Docker Compose v2, nginx)
  • A deploy user on the server with SSH key authentication and restricted sudo for nginx reload and the slot file write — see scripts/bootstrap-server.sh in the repo
  • GitHub secrets: SERVER_IP, SSH_PRIVATE_KEY, POSTGRES_PASSWORD, SLACK_WEBHOOK_URL, PAGERDUTY_ROUTING_KEY
  • PagerDuty routing key scoped to this pipeline only — rotate on any suspected exposure

All commands below are validated against GitHub Actions ubuntu-latest (ubuntu-24.04), Docker Compose v2, and nginx 1.24.

Step 1 — The retry wrapper

scripts/retry.sh is a bash function that runs any command up to N times with exponential backoff and jitter. Source it in any step or composite action.

#!/usr/bin/env bash
# scripts/retry.sh
# Usage: source scripts/retry.sh
#        retry <max_attempts> <initial_delay_seconds> <command...>

retry() {
  local max_attempts=$1
  local delay=$2
  shift 2
  local cmd=("$@")
  local attempt=1

  while [ $attempt -le $max_attempts ]; do
    echo "[retry] Attempt $attempt/$max_attempts: ${cmd[*]}"

    if "${cmd[@]}"; then
      echo "[retry] ✅ Succeeded on attempt $attempt"
      return 0
    fi

    if [ $attempt -lt $max_attempts ]; then
      # Exponential backoff with ±20% jitter, floor 1s, cap 60s
      local raw_jitter=$(( RANDOM % (delay / 5 + 2) - delay / 10 ))
      local wait=$(( delay + raw_jitter ))
      wait=$(( wait < 1 ? 1 : wait ))
      wait=$(( wait > 60 ? 60 : wait ))
      echo "[retry] ⏳ Waiting ${wait}s before retry (attempt $((attempt+1)))..."
      sleep "$wait"
      delay=$(( delay * 2 > 60 ? 60 : delay * 2 ))
    fi

    attempt=$(( attempt + 1 ))
  done

  echo "[retry] ❌ All $max_attempts attempts failed: ${cmd[*]}"
  return 1
}

The jitter prevents thundering herd: if multiple pipeline runs fail simultaneously and retry at exactly the same interval, they can hammer a struggling downstream service together. Random jitter distributes the load across the retry window.

Step 2 — Composite retry action

Wrap the retry call as a GitHub Actions composite action so any workflow can use it with two lines, without copy-pasting the source path.

# .github/actions/retry-step/action.yml
name: Retry Step
description: Run a shell command with exponential backoff retry

inputs:
  command:
    description: Shell command to execute (passed to bash -c)
    required: true
  max_attempts:
    description: Maximum number of attempts including the first try
    default: "3"
  initial_delay:
    description: Initial wait between retries in seconds
    default: "5"

runs:
  using: composite
  steps:
    - name: Run with retry
      shell: bash
      run: |
        source "$GITHUB_WORKSPACE/scripts/retry.sh"
        retry "${{ inputs.max_attempts }}" \
              "${{ inputs.initial_delay }}" \
              bash -c "${{ inputs.command }}"

$GITHUB_WORKSPACE resolves to the repo root regardless of where the action file lives in the directory tree. A relative path like ../../scripts/retry.sh breaks silently if the action is ever moved.

The retry function is sourced and called in the same bash shell, so no subprocess boundary is crossed. The shell: bash declaration on the step ensures bash-specific features like local arrays and arithmetic expansion work correctly — do not change this to sh.

Using it in a workflow:

- name: Push image (with retry)
  uses: ./.github/actions/retry-step
  with:
    command: docker push $IMAGE_NAME:${{ github.sha }}
    max_attempts: "4"
    initial_delay: "10"

- name: Smoke tests (with retry)
  uses: ./.github/actions/retry-step
  with:
    command: bash scripts/smoke-test.sh ${{ secrets.SERVER_IP }} ${{ steps.slot.outputs.target }}
    max_attempts: "3"
    initial_delay: "8"

Image pushes and smoke tests are the two steps most affected by transient failures — registry availability and network latency respectively. Retrying them is not masking a problem. It is acknowledging the reality of distributed systems.

The smoke test is meaningful here because the Waybill /health endpoint does real work: it checks live PostgreSQL connectivity and returns the active slot name. A 503 means the database is unreachable. A wrong slot name means traffic is pointing at the wrong container. A smoke test that only checks for HTTP 200 would pass in both of those failure states.

Step 3 — Tiered alerting

scripts/alert.py classifies the error and routes it. It uses only Python stdlib — no pip install in the failure path. Installing a dependency at the moment you need to report a failure is fragile: if PyPI is unreachable (which can happen during exactly the kind of network incidents that also cause pipeline failures), the alert step silently fails.

#!/usr/bin/env python3
"""
alert.py — tiered pipeline alerting

Severity tiers:
  TRANSIENT → silent discard (no notification)
  DEGRADED  → Slack warning (Block Kit)
  CRITICAL  → Slack + PagerDuty page

Required environment variables (set as GitHub Actions secrets):
  SLACK_WEBHOOK_URL      — Slack incoming webhook URL
  PAGERDUTY_ROUTING_KEY  — Events API v2 key, scoped to this service only

Usage:
  python3 scripts/alert.py "error message string"
"""

import os
import sys
import json
import urllib.request
import urllib.error
from enum import Enum
from datetime import datetime, timezone


class Severity(Enum):
    TRANSIENT = "transient"
    DEGRADED  = "degraded"
    CRITICAL  = "critical"


# Keep TRANSIENT patterns as specific as possible.
# Broad patterns risk silencing a real failure whose error message
# happens to contain a transient-sounding substring.
ERROR_PATTERNS: dict[Severity, list[str]] = {
    Severity.TRANSIENT: [
        "registry connection timeout",
        "registry unavailable",
        "registry rate limit",
        "registry 503",
        "registry 502",
        "i/o timeout",
        "connection refused to registry",
        "429 too many requests",
    ],
    Severity.DEGRADED: [
        "smoke test failed",
        "slow response",
        "health check degraded",
        "non-zero exit code",
    ],
    Severity.CRITICAL: [
        "deploy failed",
        "rollback required",
        "production down",
        "slot swap failed",
        "health check failed",
        "container crashed",
    ],
}


def classify(error_msg: str) -> Severity:
    msg = error_msg.lower()
    for severity, patterns in ERROR_PATTERNS.items():
        if any(p in msg for p in patterns):
            return severity
    # Unknown patterns default to DEGRADED — never silenced.
    return Severity.DEGRADED


def _post(url: str, payload: dict, timeout: int = 10) -> None:
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        url, data=data,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    try:
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            if resp.status not in (200, 201, 202):
                print(f"[alert] Unexpected HTTP {resp.status}", file=sys.stderr)
    except urllib.error.URLError as exc:
        print(f"[alert] POST failed ({url}): {exc}", file=sys.stderr)


def send_slack(message: str, severity: Severity) -> None:
    webhook = os.environ.get("SLACK_WEBHOOK_URL")
    if not webhook:
        print("[alert] SLACK_WEBHOOK_URL not set — skipping Slack", file=sys.stderr)
        return

    repo   = os.getenv("GITHUB_REPOSITORY", "unknown/repo")
    branch = os.getenv("GITHUB_REF_NAME",   "unknown")
    run_id = os.getenv("GITHUB_RUN_ID",      "0")
    ts     = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")

    icons   = {Severity.DEGRADED: "🟡", Severity.CRITICAL: "🔴"}
    icon    = icons.get(severity, "")
    run_url = f"https://github.com/{repo}/actions/runs/{run_id}"

    payload = {
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": f"{icon} [{severity.value.upper()}] Pipeline Alert",
                },
            },
            {
                "type": "section",
                "text": {"type": "mrkdwn", "text": f"*{message}*"},
                "fields": [
                    {"type": "mrkdwn", "text": f"*Branch*\n{branch}"},
                    {"type": "mrkdwn", "text": f"*Run*\n<{run_url}|{run_id}>"},
                    {"type": "mrkdwn", "text": f"*Repo*\n{repo}"},
                    {"type": "mrkdwn", "text": f"*Time*\n{ts}"},
                ],
            },
            {"type": "divider"},
        ]
    }
    _post(webhook, payload)


def send_pagerduty(message: str) -> None:
    key = os.environ.get("PAGERDUTY_ROUTING_KEY")
    if not key:
        print("[alert] PAGERDUTY_ROUTING_KEY not set — skipping PagerDuty", file=sys.stderr)
        return

    repo   = os.getenv("GITHUB_REPOSITORY", "unknown/repo")
    run_id = os.getenv("GITHUB_RUN_ID", "0")
    # dedup_key groups all alerts from the same run into one incident.
    # Without it, a flapping pipeline opens a new incident on every failure.
    dedup_key = f"{repo}/run/{run_id}"

    payload = {
        "routing_key":  key,
        "event_action": "trigger",
        "dedup_key":    dedup_key,
        "payload": {
            "summary":  message,
            "severity": "critical",
            "source":   "github-actions",
            "custom_details": {
                "repository": repo,
                "run_id":     run_id,
                "sha":        os.getenv("GITHUB_SHA"),
            },
        },
    }
    _post("https://events.pagerduty.com/v2/enqueue", payload)


def alert(error_msg: str) -> None:
    severity = classify(error_msg)

    if severity == Severity.TRANSIENT:
        print("[alert] Transient pattern matched — no notification sent")
        return

    send_slack(error_msg, severity)

    if severity == Severity.CRITICAL:
        send_pagerduty(error_msg)
        print("[alert] 🚨 Critical — Slack + PagerDuty triggered")
    else:
        print("[alert] ⚠️  Degraded — Slack warning sent")


if __name__ == "__main__":
    msg = sys.argv[1] if len(sys.argv) > 1 else "Unknown pipeline failure"
    alert(msg)

The Slack payload uses Block Kit (Slack's component-based message format, built with the blocks array) rather than the legacy Attachments API. The PagerDuty payload includes a dedup_key composed of the repository name and run ID — without it, a flapping pipeline opens a new incident on every failure. With it, all alerts from the same run are grouped into one incident, and a resolve event closes it automatically.

Step 4 — The full workflow

The complete deploy.yml, with retry wrappers on the flaky steps, a slot guard on the rollback, and verified container state before declaring rollback complete.

# .github/workflows/deploy.yml
name: Self-Healing Deploy

on:
  push:
    branches: [main]

# Required for GHCR (GitHub Container Registry) push. Organisations with
# restrictive default token permissions must grant these explicitly;
# without them the image push returns 403 even with a valid GITHUB_TOKEN.
permissions:
  contents: read
  packages: write

env:
  IMAGE_NAME: ghcr.io/${{ github.repository }}

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build image
        run: docker build -t $IMAGE_NAME:${{ github.sha }} .

      # Registry pushes are the most common transient failure source
      - name: Push image (with retry)
        uses: ./.github/actions/retry-step
        with:
          command: docker push $IMAGE_NAME:${{ github.sha }}
          max_attempts: "4"
          initial_delay: "10"

      - name: Detect active slot
        id: slot
        run: |
          ACTIVE=$(ssh deploy@${{ secrets.SERVER_IP }} \
            "cat /etc/deploy/active-slot 2>/dev/null || echo blue")
          echo "active=$ACTIVE" >> $GITHUB_OUTPUT
          if [ "$ACTIVE" = "blue" ]; then
            echo "target=green" >> $GITHUB_OUTPUT
          else
            echo "target=blue" >> $GITHUB_OUTPUT
          fi

      - name: Deploy to inactive slot
        run: |
          TARGET=${{ steps.slot.outputs.target }}
          ssh deploy@${{ secrets.SERVER_IP }} << EOF
            export IMAGE_NAME=$IMAGE_NAME
            export ${TARGET^^}_TAG=${{ github.sha }}
            docker compose pull waybill-$TARGET
            docker compose up -d --no-deps waybill-$TARGET
          EOF

      # Smoke tests run over a network — give them room for cold starts
      - name: Smoke tests (with retry)
        uses: ./.github/actions/retry-step
        with:
          command: >
            bash scripts/smoke-test.sh
            ${{ secrets.SERVER_IP }}
            ${{ steps.slot.outputs.target }}
          max_attempts: "3"
          initial_delay: "8"

      - name: Swap traffic to new slot
        run: |
          bash scripts/swap-traffic.sh \
            ${{ secrets.SERVER_IP }} \
            ${{ steps.slot.outputs.target }}

      # ── Failure path ──────────────────────────────────────────────────────────
      # Alert first — on-call needs context before rollback begins
      - name: Classify and alert on failure
        if: failure()
        run: |
          python3 scripts/alert.py \
            "deploy failed on ${{ github.ref_name }} — run ${{ github.run_id }}"
        env:
          SLACK_WEBHOOK_URL:     ${{ secrets.SLACK_WEBHOOK_URL }}
          PAGERDUTY_ROUTING_KEY: ${{ secrets.PAGERDUTY_ROUTING_KEY }}

      - name: Rollback on failure
        if: failure()
        run: |
          TARGET="${{ steps.slot.outputs.target }}"
          # Guard: if slot detection failed earlier, TARGET is empty
          if [ -z "$TARGET" ]; then
            echo "::error::Slot detection failed — manual rollback required"
            exit 1
          fi
          ssh deploy@${{ secrets.SERVER_IP }} bash << EOF
            set -euo pipefail
            docker compose stop --timeout 30 waybill-$TARGET
            # Verify the container actually stopped.
            # docker compose ps --format json outputs a JSON array in Compose v2.20+
            # and JSONL in earlier v2 releases. Parse both safely.
            STATUS=\$(docker compose ps waybill-\$TARGET --format json \
              | python3 -c "
import sys, json
raw = sys.stdin.read().strip()
try:
    d = json.loads(raw)
    obj = d[0] if isinstance(d, list) else d
    print(obj.get('State', 'unknown'))
except Exception:
    print('unknown')
" 2>/dev/null || echo "unknown")
            echo "Container state after stop: \$STATUS"
            if [ "\$STATUS" = "running" ]; then
              echo "::error::Container did not stop — manual intervention required"
              exit 1
            fi
          EOF
          echo "Active slot unchanged. Rollback complete."

The alert step runs before the rollback step. The person who responds to a PagerDuty page needs to know what failed before they start diagnosing whether the rollback worked. Order matters here.

The empty-slot guard protects against a specific failure mode: if the "Detect active slot" step never ran (because the build or push failed first), steps.slot.outputs.target is an empty string. Without the guard, docker compose stop app- either silently fails or stops the wrong container.

Security Considerations

SSH key scope. The deploy user's SSH key has access to the server. Restrict it to specific commands via authorized_keys command= restrictions, or scope what the deploy user can run via sudoers. The bootstrap-server.sh script in the repo sets this up: the deploy user can write the slot file and reload nginx, nothing else. A compromised runner should not have broad filesystem access to the deploy server.

PagerDuty routing key. This key can trigger incidents against any service configured under it. Use a key scoped to this pipeline only. Rotate it on any suspected exposure. Treat it with the same care as a production database password — it is a denial-of-sleep vector if leaked.

Secrets in environment variables. SLACK_WEBHOOK_URL and PAGERDUTY_ROUTING_KEY are passed as environment variables to the alert step. GitHub Actions masks known secret values in logs, but partial matches or URL-encoded variants may not be caught. Never echo or log these values inside alert.py or any script the failure step calls.

Alert classification is a moving target. The ERROR_PATTERNS dict is not a security control — it is operational configuration. Its default behaviour (unknown errors → DEGRADED, never TRANSIENT) means an attacker who can influence error messages cannot silently suppress alerts. Verify this holds if you extend the TRANSIENT patterns significantly.

GITHUB_TOKEN permissions. The workflow sets permissions: contents: read, packages: write explicitly. Organisations with restrictive default token permissions should audit this before deploying — granting packages: write at the workflow level is appropriate here, but teams using more granular job-level permission scoping should move the block to the deploy job instead.

Tradeoffs

What you gain / what you give up

Retry logic reduces alert noise at the cost of masking underlying reliability issues. If your registry is returning 503s on 30% of pushes, retry with backoff means your pipeline succeeds and nobody investigates the registry. You need to monitor retry rates, not just retry outcomes. The scaffold repo includes a commented section in README.md on how to surface this via GitHub Actions workflow telemetry.

Three-tier alerting requires ongoing maintenance. The ERROR_PATTERNS dictionary reflects your pipeline's failure modes at the time you wrote it. New integrations, new infrastructure, and new failure modes will produce strings that do not match any pattern and land in DEGRADED. Review the patterns monthly for the first three months. After that, review any time a new step is added to the pipeline.

The stdlib-only approach in alert.py avoids the fragile pip install in the failure path, but it means the HTTP layer is less configurable. The urllib implementation has no connection pooling, no automatic retry, and no response decoding beyond status code. For a notification script in a CI failure step, that is the right tradeoff. For anything more complex, use a dedicated alerting service the pipeline calls externally.

Blue/green with slot files is simple and observable — you can cat /etc/deploy/active-slot on the server at any time. It is also manual. If the server is unreachable, the slot file is stale, and your pipeline's rollback logic does not know the real state. For environments where the deploy server could itself be a failure point, consider moving slot state to a registry or a distributed key-value store.

What I'd Do Differently

Tune the alert patterns from day one. I have treated ERROR_PATTERNS as infrastructure — something you define once and leave. It is not. It is a codebase. The patterns that matter are the ones your specific pipeline produces under your specific failure conditions. Starting with a broad TRANSIENT list and narrowing it based on observation is better than starting narrow and widening it reactively.

Add retry rate tracking early. The retry wrapper succeeds silently. That is by design. But if you are not tracking how often each step retries, you lose the signal that distinguishes a genuinely transient failure from a degrading dependency. A simple counter written to a metrics endpoint or even a structured log line is enough to surface this.

Test the rollback path before the first production deploy. The rollback step in the workflow is only as reliable as you have tested it. Break a deploy deliberately in a staging environment, verify the rollback fires, verify the correct container stops, verify the slot file is unchanged. The one time you need it is not the time to discover it has a bug.

GitHub Repo

pipelineandprompts-labs/pipelines-in-the-wild/02-retry-logic-tiered-alerting

The repo contains the Waybill API — a FastAPI shipment tracking application backed by PostgreSQL. Shipments are created with a waybill number, and tracking events are appended as the consignment moves through the network. The /health endpoint checks live database connectivity and reports the active deployment slot, which makes it a real integration test rather than a TCP ping. Both blue and green slots run on separate ports (7070 and 9091) sharing a single Postgres instance — the same topology the pipeline manages.

The repo also includes a scaffold script that prints the exact gh secret set commands for your environment and a quick-start guide for local dev and alerting tests:

./scaffold-self-healing-pipeline.sh waybill 10.0.0.42

To test the alerting locally before connecting real secrets:

# TRANSIENT — silent
python3 scripts/alert.py "registry connection timeout on push"

# DEGRADED — Slack warning (set SLACK_WEBHOOK_URL first)
SLACK_WEBHOOK_URL=https://hooks.slack.com/... \
  python3 scripts/alert.py "smoke test failed on main"

# CRITICAL — Slack + PagerDuty
SLACK_WEBHOOK_URL=https://hooks.slack.com/... \
PAGERDUTY_ROUTING_KEY=your-key \
  python3 scripts/alert.py "deploy failed on main"

What's Next

Article 03 covers secrets management across multi-cloud environments — storing, rotating, and injecting credentials into GitHub Actions without hardcoding them and without creating a single point of failure in how your pipeline authenticates.

More from the series: Pipelines in the Wild

Written by Pipeline & Prompts | pipelineandprompts.dev

All working code: github.com/pipelineandprompts-labs/pipelines-in-the-wild/02-retry-logic-tiered-alerting

How I Learned to See the Matrix: Boosting My Problem‑Solving Speed Under Pressure

2026-06-16 04:16:52

The Quest Begins (The "Why")

Picture this: I’m sitting in a cramped interview room, the whiteboard glaring back at me like the Eye of Sauron. The interviewer drops the classic “Longest Substring Without Repeating Characters” problem and gives me five minutes to solve it. My heart starts doing the Imperial March drum‑roll. I fumble through a brute‑force solution—nested loops, checking every possible substring, resetting a set each time. It works on the tiny examples, but the moment the input grows past a dozen characters I can feel the time ticking away like the countdown in Mission: Impossible. I finish with a solution that’s O(n²) and watch the interviewer’s eyebrows raise just enough to signal “nice try, but…”.

That moment stuck with me. I realized I wasn’t lacking knowledge; I was missing a mental framework that lets me spot the hidden pattern instantly—like Neo seeing the green code rain in The Matrix. If I could train my brain to jump straight to that insight, I’d shave minutes off every problem, not just in interviews but in everyday debugging marathons. So I went on a quest to find that framework, and what I discovered felt like uncovering a lightsaber in a junkyard.

The Revelation (The Insight)

The breakthrough came when I stopped thinking about “checking every substring” and started asking: What do I need to know right now to decide whether I can extend the current window?

In the longest‑substring‑without‑repeating‑characters problem, the only thing that matters is the most recent index of each character we’ve seen. If we encounter a character that’s already inside our current window, we don’t need to scrap everything and start over; we just slide the left side of the window just past its previous occurrence.

That’s the aha! moment: maintain a sliding window bounded by two pointers (left and right) and a hash map that stores the last index where each character appeared. As the right pointer moves forward, we can instantly decide whether the window is still valid, and if not, we jump the left pointer to max(left, lastSeen[char] + 1). No rescanning, no resetting sets—just constant‑time updates.

It felt like discovering the Force: a simple rule that lets you sense disturbances in the string and react instantly. Once I internalized that pattern, the problem stopped being a beast and became a dance.

Wielding the Power (Code & Examples)

The Struggle (Before)

Here’s what my first attempt looked like—naïve, O(n²), and painful to watch under pressure:

function lengthOfLongestSubstring(s) {
  let maxLen = 0;
  for (let i = 0; i < s.length; i++) {
    const seen = new Set();
    for (let j = i; j < s.length; j++) {
      if (seen.has(s[j])) break; // duplicate found, stop inner loop
      seen.add(s[j]);
      maxLen = Math.max(maxLen, j - i + 1);
    }
  }
  return maxLen;
}

What’s wrong?

  • The inner loop restarts the Set for every i, wasting work we already did.
  • In the worst case (“abcdefghijklmnopqrstuvwxyz…”) we still do ~n²/2 operations.
  • Under timed pressure, that extra work is the difference between “I got it” and “I ran out of time”.

The Victory (After)

Now the same problem, armed with the sliding‑window insight:

function lengthOfLongestSubstring(s) {
  const lastIndex = new Map(); // char -> last position
  let maxLen = 0;
  let left = 0; // start of the current window

  for (let right = 0; right < s.length; right++) {
    const ch = s[right];

    // If ch was seen inside the current window, move left just after its previous spot
    if (lastIndex.has(ch) && lastIndex.get(ch) >= left) {
      left = lastIndex.get(ch) + 1;
    }

    // Update the last seen position for ch
    lastIndex.set(ch, right);

    // Window size is right - left + 1
    maxLen = Math.max(maxLen, right - left + 1);
  }

  return maxLen;
}

Why this feels like a spell:

  • Only one pass (O(n) time).
  • Constant‑time map look‑ups (O(1) amortized).
  • No extra nested loops, no resetting structures—just two pointers dancing across the string.

Common Traps (The “Bosses” to Avoid)

  1. Forgetting the >= left check – If you update left every time you see a repeat, you’ll shrink the window too aggressively for characters that appeared before the current window. Example: "abba"; without the check you’d move left past the first b on the second b, losing the valid "ba" substring.
  2. Not updating lastIndex after moving left – The map must always hold the most recent index; otherwise future repeats will think the character is still at an old position and cause incorrect window shifts.

Avoid those, and the algorithm flows like a lightsaber through butter.

Why This New Power Matters

Mastering this sliding‑window mindset does more than solve one LeetCode problem. It trains you to ask the right question: “What minimal state do I need to keep to decide if I can extend my current solution?” That question pops up in:

  • Maximum subarray sum (Kadane’s algorithm) – keep the best sum ending at the current position.
  • Minimum window substring – maintain counts of needed characters while expanding/contracting a window.
  • Streaming data problems – you often can’t store everything; you need a summary that lets you make decisions on the fly.

When you internalize the framework, you stop staring at a blank screen hoping for inspiration and start constructing the solution piece by piece, much like building a Lego set with the instruction manual in hand. The pressure still exists, but now you have a reliable toolset that turns panic into pattern recognition.

I’ve seen my interview times drop from “barely finishing” to “finished with a minute to spare”. I’ve seen production bugs get tackled faster because I could isolate the offending segment with a sliding‑window mindset instead of tearing through logs line by line. It’s a superpower that compounds every time you code.

Your Turn – The Challenge

Grab a timer, pick a problem that usually makes you sweat (e.g., “Minimum Size Subarray Sum”, “Longest Repeating Character Replacement”, or even “Find All Anagrams in a String”), and give yourself five minutes. Before you dive into code, spend sixty seconds asking: What tiny piece of information do I need to keep track of to know if I can keep going? Write that down, then implement the sliding‑window solution.

When the timer dings, compare your solution to the brute‑force version you’d normally write. Notice the difference in speed, clarity, and confidence.

Now go forth—may the sliding window be with you, and may your next bug feel like a boss you’ve already defeated! 🚀