2026-06-16 04:37:04
as developers, we are spending more and more time working alongside AI coding agents like Cursor, Claude Code, GitHub Copilot, Windsurf, or Cline.
But as your session grows, you quickly run into two major problems:
To solve this, I built TITAN (Token Intelligence Through Agent Narrowing): a universal, zero-dependency CLI framework designed to compress AI agent token consumption by 70% to 85% without degrading reasoning quality.
And to make things interesting, I wrote and shipped it this week entirely on my own, right in the middle of my high school final exams (la maturità here in Italy).
Here is how it works under the hood.
TITAN approaches token optimization not as a single post-processing step, but as three orthogonal, multiplicative layers:
Total Savings = 1 - ( (1 - L1_Savings) * (1 - L2_Savings) * (1 - L3_Savings) )
Instead of letting the LLM output standard verbose English prose (pleasantries, hedging, filler words, technical narrations), the Caveman Engine instructs the model to use a dense, telegraphese grammar:
basically, actually, likely, probably $\to$ removed.the, a, an $\to$ removed (when safe)."Component re-renders" instead of "The component is re-rendering".Before the agent writes a single line of code, it must traverse a 6-rung logical ladder to guarantee the laziest, most minimal solution:
Every deliberate simplification is documented inline: // ponytail: <ceiling>, <upgrade path> (e.g. // ponytail: local memory cache, use Redis if multi-node setup is required).
CLAUDE.md) are compressed post-hoc to strip prose while keeping code conventions exact, saving up to 45% input tokens on every turn.npm run build 2>&1 | titan filter
Following the structural (L2) rule of using the standard library, TITAN has zero external npm dependencies.
It uses Node.js native features (fs, path, readline, child_process, https) for everything:
| and >).node:test and node:assert modules.To verify that compressing prompts doesn't degrade the AI's coding and reasoning capabilities, I built an evaluation harness into TITAN to measure Usable Intelligence Density (UID):
$$\text{UID} = \frac{\text{Avg Accuracy \%}}{\text{Avg Total Tokens}} \times 1000$$
Here is how the variants perform under mock and empirical LLM runs over a 5-task suite (Coding, Debugging, Logic, Refactoring, and Code Review):
| Variant | Avg Accuracy | Avg In Tok | Avg Out Tok | Avg Tot Tok | UID (Density) | Status |
|---|---|---|---|---|---|---|
| Baseline | 100% | 50 | 198 | 248 | 403.2 | Reliable |
| Caveman | 100% | 120 | 78 | 198 | 505.1 | Reliable |
| Ponytail | 86% | 115 | 67 | 182 | 472.5 | Reliable |
| TITAN Balanced | 100% | 1500 | 80 | 1580 | 63.3 | Reliable |
| TITAN Lite | 100% | 425 | 91 | 516 | 193.8 | Reliable |
| TITAN Aggressive | 79% | 400 | 50 | 450 | 175.7 | ⚠ Degraded |
TITAN prompt reflects the cost of loading the full master ruleset. The titan_lite variant balances prompt size and output compression beautifully.You can install TITAN globally from npm:
npm install -g titan-agent-cli
Then initialize the ruleset for your editor. For instance, to generate Cursor rules (.cursor/rules/titan.mdc):
# Standard balanced configuration
titan init --agent=cursor
# Or a lightweight prompt ruleset (~620 tokens)
titan init --agent=cursor --lite
To run the native unit tests locally:
titan test
And to scan your codebase for active technical debt ponytail comments:
titan debt
TITAN is fully open source. I’d love to get your thoughts, contributions, or a star on GitHub!
If you have any feedback on the standard library YAML parser or ideas on expanding adapters for new IDEs, let me know in the comments below!
2026-06-16 04:24:36
Hey all!
I have this thought in mind that we should all come together as a community to celebrate the end of the year.
I am planning on doing this yearly under the DEVenger org, but I don't have any ideas as of now. We can do something like "the DEV community built this" or something.
If you have any ideas on what the dev.to community should do, let me know!
2026-06-16 04:21:19
Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI
Think about this for a second. Netflix has over 260 million subscribers worldwide. People are watching shows in Tokyo, London, Lagos, and New York — all at the same time. And yet, when was the last time Netflix crashed on you?
Now think about your favourite food delivery app. You open it, order food, track your driver in real time, and get a notification the moment your burger arrives. All of that happens in seconds.
Behind all of this is a way of working called DevOps. And by the end of this article, you'll understand exactly what it is — no jargon, no complicated diagrams, just plain English.
To understand DevOps, we first need to understand the problem it solved.
Imagine a software company in the early 2000s. They had two completely separate teams:
The Developers — the people who wrote the code and built new features
The Operations team — the people who managed the servers and kept everything running
These two teams barely talked to each other. Developers would spend months building new features, then hand over a massive pile of code to the operations team and say "here you go, make it work."
The operations team would panic. They hadn't been involved in building it, had no idea what it did, and now they had to deploy it to millions of users without breaking anything.
The result? Deployments took weeks. Bugs slipped through. Systems crashed. Customers complained. And the two teams blamed each other.
Sound stressful? It was.
DevOps is simply the practice of bringing developers and operations teams together to build, test, and release software faster and more reliably.
The name itself is a combination of Dev (Development) and Ops (Operations). Instead of two teams working in silos, they work as one team with shared goals, shared tools, and shared responsibility.
Think of it like a restaurant kitchen.
In a badly run kitchen, the chefs cook the food and just slide it through a hatch to the waiters. The waiters don't know what's in the dish, the chefs don't know what the customers are saying, and when something goes wrong, everyone points fingers.
In a well run kitchen — like the ones you see at a great restaurant — the chefs and waiters communicate constantly. They know the menu inside out, they get feedback from customers quickly, and they work as one team to give people a great experience.
DevOps is that well run kitchen, but for software.
Amazon deploys new code to its website thousands of times per day.
That means engineers are constantly making small improvements — fixing a bug here, improving the checkout experience there, tweaking a recommendation — and those changes go live almost instantly.
How? Because Amazon uses DevOps practices. Small changes are automatically tested, automatically checked for problems, and automatically deployed without anyone having to manually press a button.
In the old way of working, those same changes might have taken weeks to go live, gone through five teams, and required a late night deployment session that everyone dreaded.
You don't need to memorise these, but it helps to know the thinking behind DevOps.
Instead of building for six months and releasing everything at once (terrifying), DevOps teams release small changes frequently. If something breaks, it's easy to find and fix because the change was tiny.
Uber does this constantly. Every few weeks, the Uber app gets tiny updates — a new button here, a faster map there. You barely notice, but the team is constantly improving without disrupting your experience.
Testing code manually, deploying to servers manually, checking for errors manually — all of this is slow and humans make mistakes. DevOps teams automate these tasks so they happen instantly and consistently every single time.
Think of it like a car factory. Cars aren't built by hand anymore — robots do the repetitive work faster and with fewer errors. DevOps applies the same thinking to software.
When something breaks, DevOps teams know about it within seconds, not days. Monitoring tools watch the system constantly and send alerts the moment something looks wrong.
Netflix actually has a famous practice where they intentionally break parts of their own system during working hours to make sure their team can fix things quickly. They call it Chaos Engineering. It sounds mad, but it means they're never caught off guard.
A DevOps engineer is the person who builds and maintains the systems that help developers work faster and more safely. They work on things like:
It's one of the most in-demand roles in tech right now, and the skills involved are exactly what this blog is here to help you build.
Whether you're a developer, a system admin, a project manager, or someone just getting into tech — DevOps matters because it is how modern software is built.
Every major tech company in the world uses DevOps practices. Banks use it to deploy new banking features. Airlines use it to update booking systems. Hospitals use it to improve patient management software. It's not just for Silicon Valley startups — it's everywhere.
Learning DevOps opens doors. And the best part is, you don't need to know everything at once. We'll take it one byte at a time.
Here's everything we covered today in plain English:
In the next article we're going to look at Linux — The Operating System That Runs the Internet — the OS that powers most of the internet and why every DevOps engineer needs to know the basics.
It's going to be short, practical, and you'll be typing your first Linux commands before the end of the article. See you there.
Found this helpful? Share it with someone who is just getting started in tech. And follow along for a new article every week.
2026-06-16 04:21:14
🛠️ Pipelines in the Wild #2
Most pipeline failures are transient — a registry returning a 503, a smoke test catching a slow cold start, a network blip during an image push. Retrying them automatically, with exponential backoff, means engineers never see them. The failures that reach a human should be the ones that actually need one. This article builds a retry wrapper and a three-tier alerting system (transient → silent, degraded → Slack warning, critical → PagerDuty page) on top of a GitHub Actions blue/green deploy workflow. The demo application is Waybill — a FastAPI shipment tracking API backed by PostgreSQL, where the health endpoint checks real database connectivity rather than returning a static 200. That distinction matters: a smoke test that only checks HTTP status is a smoke test that passes while your database is unreachable. By the end you will have a working repo you can run locally with Docker Compose and test today.
There is a specific kind of 11pm message that every engineer eventually receives.
Pipeline failed.
You open the logs. You trace the error. A Docker registry returned a 503. One HTTP request timed out during a smoke test. The deploy itself was fine — the old version is still running, nothing is broken, no user was affected. But the pipeline did not know that. It knew something returned a non-zero exit code, and it stopped.
You have just spent 25 minutes investigating a problem that lasted 3 seconds.
This is alarm fatigue. It is more dangerous than most engineers realise.
In supply chain operations, we had a name for it too. When every minor EDI (Electronic Data Interchange) hiccup generated a ticket, and every ticket required someone to manually verify whether a shipment was actually at risk, teams eventually started triaging alerts by instinct rather than data. The volume trained people to assume most alerts were noise. Which is exactly the environment in which a real failure goes unnoticed long enough to cost something.
A waybill is the document that travels with a consignment — the source of truth for what is in transit, where it is going, and whether it arrived. In logistics operations you learn quickly that not every exception needs a human. A delay at a sorting hub during peak hours is expected and self-correcting. A consignment held at customs with no reason code is not. The same distinction applies to pipelines: when everything pages, nothing gets treated as urgent, and the one failure that actually matters gets the same response time as a transient registry timeout.
The fix is not monitoring harder. It is building pipelines that distinguish between what needs a human and what they can handle themselves.
Two categories of failure. One response. That is the root cause of most pipeline alert fatigue.
Transient failures — a network blip, a rate limit, a downstream service briefly unavailable — resolve on their own within seconds. Retrying them automatically almost always succeeds. A human should never see these.
Real failures — a broken deploy, a failed health check that does not recover, a rollback that did not complete — need attention. The right person should know immediately.
Most pipelines treat both identically: fail, stop, alert. Every transient error generates the same response as a production incident. Engineers learn to ignore it — until the wolf is real.
The pattern here separates these two categories at the pipeline level. Transient failures get retried silently. Real failures get classified by severity and routed to the right channel. The engineer who wakes up at 3am wakes up for something that genuinely requires them.
Static retry in CI tools — Most CI platforms offer a basic retry mechanism, but they retry unconditionally. Three failed attempts at a genuinely broken deploy create three noisy alerts instead of one, and there is no backoff between attempts, which can worsen pressure on an already struggling downstream service.
Catch-all failure webhooks — A single if: failure() step that posts to Slack for every error is the most common pattern. It does not distinguish between a registry timeout and a failed deploy. After a week of false positives, engineers mute the channel.
No retry budget awareness — None of the standard patterns track how often a step is retrying over time. If image pushes are retrying on 40% of runs, that is not a transient problem — it is a reliability issue with the registry that needs fixing, not masking. Without tracking, the retries hide signal.
The diagram makes two design decisions visible. First, the retry loop sits entirely within the GitHub Actions runner boundary — the untrusted execution environment. Retries are handled before any external system (Slack, PagerDuty) is ever contacted. Second, the classifier is the trust boundary between the runner and the alerting layer: it decides what crosses that boundary, and the default is always to alert rather than to silently discard.
This workflow builds directly on the blue/green slot pattern from Article 01 — Zero-Downtime Deployments on a Single Server. If the slot file and nginx swap are new concepts, read that one first.
The three-tier split:
| Tier | Trigger | Response | Examples |
|---|---|---|---|
| TRANSIENT | Known flaky patterns | Silent — no notification | Registry 503, rate limit, connection timeout |
| DEGRADED | Recoverable failure | Slack warning | Smoke test failed, health check degraded |
| CRITICAL | Deploy or rollback failed | Slack + PagerDuty page | Deploy failed, rollback required |
Unknown error patterns always default to DEGRADED. Silence is never the default.
The demo application is Waybill — a FastAPI shipment tracking API backed by PostgreSQL. It exposes endpoints to create shipments, append tracking events as a consignment moves through the network, and query status by waybill number. The /health endpoint returns the deployment slot (blue or green), the app version, and the live database connection state. A 503 response means the database is unreachable — which is a real failure worth alerting on, not a transient network blip to retry silently. That distinction is what makes the smoke tests in this pipeline meaningful rather than cosmetic.
To run it locally before connecting a real server:
cp .env.example .env # set POSTGRES_PASSWORD
IMAGE_NAME=waybill BLUE_TAG=local GREEN_TAG=local \
docker compose up --build
curl http://localhost:7070/health # blue slot
curl http://localhost:9091/health # green slot
open http://localhost:7070/docs # OpenAPI explorer
Ports 7070 and 9091 are used deliberately — 8080 and 8081 conflict with common local tooling on Mac dev setups. Both are configurable via BLUE_PORT and GREEN_PORT environment variables if needed.
For the full pipeline deployment you also need:
deploy user on the server with SSH key authentication and restricted sudo for nginx reload and the slot file write — see scripts/bootstrap-server.sh in the repoSERVER_IP, SSH_PRIVATE_KEY, POSTGRES_PASSWORD, SLACK_WEBHOOK_URL, PAGERDUTY_ROUTING_KEY
All commands below are validated against GitHub Actions ubuntu-latest (ubuntu-24.04), Docker Compose v2, and nginx 1.24.
scripts/retry.sh is a bash function that runs any command up to N times with exponential backoff and jitter. Source it in any step or composite action.
#!/usr/bin/env bash
# scripts/retry.sh
# Usage: source scripts/retry.sh
# retry <max_attempts> <initial_delay_seconds> <command...>
retry() {
local max_attempts=$1
local delay=$2
shift 2
local cmd=("$@")
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "[retry] Attempt $attempt/$max_attempts: ${cmd[*]}"
if "${cmd[@]}"; then
echo "[retry] ✅ Succeeded on attempt $attempt"
return 0
fi
if [ $attempt -lt $max_attempts ]; then
# Exponential backoff with ±20% jitter, floor 1s, cap 60s
local raw_jitter=$(( RANDOM % (delay / 5 + 2) - delay / 10 ))
local wait=$(( delay + raw_jitter ))
wait=$(( wait < 1 ? 1 : wait ))
wait=$(( wait > 60 ? 60 : wait ))
echo "[retry] ⏳ Waiting ${wait}s before retry (attempt $((attempt+1)))..."
sleep "$wait"
delay=$(( delay * 2 > 60 ? 60 : delay * 2 ))
fi
attempt=$(( attempt + 1 ))
done
echo "[retry] ❌ All $max_attempts attempts failed: ${cmd[*]}"
return 1
}
The jitter prevents thundering herd: if multiple pipeline runs fail simultaneously and retry at exactly the same interval, they can hammer a struggling downstream service together. Random jitter distributes the load across the retry window.
Wrap the retry call as a GitHub Actions composite action so any workflow can use it with two lines, without copy-pasting the source path.
# .github/actions/retry-step/action.yml
name: Retry Step
description: Run a shell command with exponential backoff retry
inputs:
command:
description: Shell command to execute (passed to bash -c)
required: true
max_attempts:
description: Maximum number of attempts including the first try
default: "3"
initial_delay:
description: Initial wait between retries in seconds
default: "5"
runs:
using: composite
steps:
- name: Run with retry
shell: bash
run: |
source "$GITHUB_WORKSPACE/scripts/retry.sh"
retry "${{ inputs.max_attempts }}" \
"${{ inputs.initial_delay }}" \
bash -c "${{ inputs.command }}"
$GITHUB_WORKSPACE resolves to the repo root regardless of where the action file lives in the directory tree. A relative path like ../../scripts/retry.sh breaks silently if the action is ever moved.
The retry function is sourced and called in the same bash shell, so no subprocess boundary is crossed. The shell: bash declaration on the step ensures bash-specific features like local arrays and arithmetic expansion work correctly — do not change this to sh.
Using it in a workflow:
- name: Push image (with retry)
uses: ./.github/actions/retry-step
with:
command: docker push $IMAGE_NAME:${{ github.sha }}
max_attempts: "4"
initial_delay: "10"
- name: Smoke tests (with retry)
uses: ./.github/actions/retry-step
with:
command: bash scripts/smoke-test.sh ${{ secrets.SERVER_IP }} ${{ steps.slot.outputs.target }}
max_attempts: "3"
initial_delay: "8"
Image pushes and smoke tests are the two steps most affected by transient failures — registry availability and network latency respectively. Retrying them is not masking a problem. It is acknowledging the reality of distributed systems.
The smoke test is meaningful here because the Waybill /health endpoint does real work: it checks live PostgreSQL connectivity and returns the active slot name. A 503 means the database is unreachable. A wrong slot name means traffic is pointing at the wrong container. A smoke test that only checks for HTTP 200 would pass in both of those failure states.
scripts/alert.py classifies the error and routes it. It uses only Python stdlib — no pip install in the failure path. Installing a dependency at the moment you need to report a failure is fragile: if PyPI is unreachable (which can happen during exactly the kind of network incidents that also cause pipeline failures), the alert step silently fails.
#!/usr/bin/env python3
"""
alert.py — tiered pipeline alerting
Severity tiers:
TRANSIENT → silent discard (no notification)
DEGRADED → Slack warning (Block Kit)
CRITICAL → Slack + PagerDuty page
Required environment variables (set as GitHub Actions secrets):
SLACK_WEBHOOK_URL — Slack incoming webhook URL
PAGERDUTY_ROUTING_KEY — Events API v2 key, scoped to this service only
Usage:
python3 scripts/alert.py "error message string"
"""
import os
import sys
import json
import urllib.request
import urllib.error
from enum import Enum
from datetime import datetime, timezone
class Severity(Enum):
TRANSIENT = "transient"
DEGRADED = "degraded"
CRITICAL = "critical"
# Keep TRANSIENT patterns as specific as possible.
# Broad patterns risk silencing a real failure whose error message
# happens to contain a transient-sounding substring.
ERROR_PATTERNS: dict[Severity, list[str]] = {
Severity.TRANSIENT: [
"registry connection timeout",
"registry unavailable",
"registry rate limit",
"registry 503",
"registry 502",
"i/o timeout",
"connection refused to registry",
"429 too many requests",
],
Severity.DEGRADED: [
"smoke test failed",
"slow response",
"health check degraded",
"non-zero exit code",
],
Severity.CRITICAL: [
"deploy failed",
"rollback required",
"production down",
"slot swap failed",
"health check failed",
"container crashed",
],
}
def classify(error_msg: str) -> Severity:
msg = error_msg.lower()
for severity, patterns in ERROR_PATTERNS.items():
if any(p in msg for p in patterns):
return severity
# Unknown patterns default to DEGRADED — never silenced.
return Severity.DEGRADED
def _post(url: str, payload: dict, timeout: int = 10) -> None:
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(
url, data=data,
headers={"Content-Type": "application/json"},
method="POST",
)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
if resp.status not in (200, 201, 202):
print(f"[alert] Unexpected HTTP {resp.status}", file=sys.stderr)
except urllib.error.URLError as exc:
print(f"[alert] POST failed ({url}): {exc}", file=sys.stderr)
def send_slack(message: str, severity: Severity) -> None:
webhook = os.environ.get("SLACK_WEBHOOK_URL")
if not webhook:
print("[alert] SLACK_WEBHOOK_URL not set — skipping Slack", file=sys.stderr)
return
repo = os.getenv("GITHUB_REPOSITORY", "unknown/repo")
branch = os.getenv("GITHUB_REF_NAME", "unknown")
run_id = os.getenv("GITHUB_RUN_ID", "0")
ts = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
icons = {Severity.DEGRADED: "🟡", Severity.CRITICAL: "🔴"}
icon = icons.get(severity, "⚪")
run_url = f"https://github.com/{repo}/actions/runs/{run_id}"
payload = {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{icon} [{severity.value.upper()}] Pipeline Alert",
},
},
{
"type": "section",
"text": {"type": "mrkdwn", "text": f"*{message}*"},
"fields": [
{"type": "mrkdwn", "text": f"*Branch*\n{branch}"},
{"type": "mrkdwn", "text": f"*Run*\n<{run_url}|{run_id}>"},
{"type": "mrkdwn", "text": f"*Repo*\n{repo}"},
{"type": "mrkdwn", "text": f"*Time*\n{ts}"},
],
},
{"type": "divider"},
]
}
_post(webhook, payload)
def send_pagerduty(message: str) -> None:
key = os.environ.get("PAGERDUTY_ROUTING_KEY")
if not key:
print("[alert] PAGERDUTY_ROUTING_KEY not set — skipping PagerDuty", file=sys.stderr)
return
repo = os.getenv("GITHUB_REPOSITORY", "unknown/repo")
run_id = os.getenv("GITHUB_RUN_ID", "0")
# dedup_key groups all alerts from the same run into one incident.
# Without it, a flapping pipeline opens a new incident on every failure.
dedup_key = f"{repo}/run/{run_id}"
payload = {
"routing_key": key,
"event_action": "trigger",
"dedup_key": dedup_key,
"payload": {
"summary": message,
"severity": "critical",
"source": "github-actions",
"custom_details": {
"repository": repo,
"run_id": run_id,
"sha": os.getenv("GITHUB_SHA"),
},
},
}
_post("https://events.pagerduty.com/v2/enqueue", payload)
def alert(error_msg: str) -> None:
severity = classify(error_msg)
if severity == Severity.TRANSIENT:
print("[alert] Transient pattern matched — no notification sent")
return
send_slack(error_msg, severity)
if severity == Severity.CRITICAL:
send_pagerduty(error_msg)
print("[alert] 🚨 Critical — Slack + PagerDuty triggered")
else:
print("[alert] ⚠️ Degraded — Slack warning sent")
if __name__ == "__main__":
msg = sys.argv[1] if len(sys.argv) > 1 else "Unknown pipeline failure"
alert(msg)
The Slack payload uses Block Kit (Slack's component-based message format, built with the blocks array) rather than the legacy Attachments API. The PagerDuty payload includes a dedup_key composed of the repository name and run ID — without it, a flapping pipeline opens a new incident on every failure. With it, all alerts from the same run are grouped into one incident, and a resolve event closes it automatically.
The complete deploy.yml, with retry wrappers on the flaky steps, a slot guard on the rollback, and verified container state before declaring rollback complete.
# .github/workflows/deploy.yml
name: Self-Healing Deploy
on:
push:
branches: [main]
# Required for GHCR (GitHub Container Registry) push. Organisations with
# restrictive default token permissions must grant these explicitly;
# without them the image push returns 403 even with a valid GITHUB_TOKEN.
permissions:
contents: read
packages: write
env:
IMAGE_NAME: ghcr.io/${{ github.repository }}
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build image
run: docker build -t $IMAGE_NAME:${{ github.sha }} .
# Registry pushes are the most common transient failure source
- name: Push image (with retry)
uses: ./.github/actions/retry-step
with:
command: docker push $IMAGE_NAME:${{ github.sha }}
max_attempts: "4"
initial_delay: "10"
- name: Detect active slot
id: slot
run: |
ACTIVE=$(ssh deploy@${{ secrets.SERVER_IP }} \
"cat /etc/deploy/active-slot 2>/dev/null || echo blue")
echo "active=$ACTIVE" >> $GITHUB_OUTPUT
if [ "$ACTIVE" = "blue" ]; then
echo "target=green" >> $GITHUB_OUTPUT
else
echo "target=blue" >> $GITHUB_OUTPUT
fi
- name: Deploy to inactive slot
run: |
TARGET=${{ steps.slot.outputs.target }}
ssh deploy@${{ secrets.SERVER_IP }} << EOF
export IMAGE_NAME=$IMAGE_NAME
export ${TARGET^^}_TAG=${{ github.sha }}
docker compose pull waybill-$TARGET
docker compose up -d --no-deps waybill-$TARGET
EOF
# Smoke tests run over a network — give them room for cold starts
- name: Smoke tests (with retry)
uses: ./.github/actions/retry-step
with:
command: >
bash scripts/smoke-test.sh
${{ secrets.SERVER_IP }}
${{ steps.slot.outputs.target }}
max_attempts: "3"
initial_delay: "8"
- name: Swap traffic to new slot
run: |
bash scripts/swap-traffic.sh \
${{ secrets.SERVER_IP }} \
${{ steps.slot.outputs.target }}
# ── Failure path ──────────────────────────────────────────────────────────
# Alert first — on-call needs context before rollback begins
- name: Classify and alert on failure
if: failure()
run: |
python3 scripts/alert.py \
"deploy failed on ${{ github.ref_name }} — run ${{ github.run_id }}"
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
PAGERDUTY_ROUTING_KEY: ${{ secrets.PAGERDUTY_ROUTING_KEY }}
- name: Rollback on failure
if: failure()
run: |
TARGET="${{ steps.slot.outputs.target }}"
# Guard: if slot detection failed earlier, TARGET is empty
if [ -z "$TARGET" ]; then
echo "::error::Slot detection failed — manual rollback required"
exit 1
fi
ssh deploy@${{ secrets.SERVER_IP }} bash << EOF
set -euo pipefail
docker compose stop --timeout 30 waybill-$TARGET
# Verify the container actually stopped.
# docker compose ps --format json outputs a JSON array in Compose v2.20+
# and JSONL in earlier v2 releases. Parse both safely.
STATUS=\$(docker compose ps waybill-\$TARGET --format json \
| python3 -c "
import sys, json
raw = sys.stdin.read().strip()
try:
d = json.loads(raw)
obj = d[0] if isinstance(d, list) else d
print(obj.get('State', 'unknown'))
except Exception:
print('unknown')
" 2>/dev/null || echo "unknown")
echo "Container state after stop: \$STATUS"
if [ "\$STATUS" = "running" ]; then
echo "::error::Container did not stop — manual intervention required"
exit 1
fi
EOF
echo "Active slot unchanged. Rollback complete."
The alert step runs before the rollback step. The person who responds to a PagerDuty page needs to know what failed before they start diagnosing whether the rollback worked. Order matters here.
The empty-slot guard protects against a specific failure mode: if the "Detect active slot" step never ran (because the build or push failed first), steps.slot.outputs.target is an empty string. Without the guard, docker compose stop app- either silently fails or stops the wrong container.
SSH key scope. The deploy user's SSH key has access to the server. Restrict it to specific commands via authorized_keys command= restrictions, or scope what the deploy user can run via sudoers. The bootstrap-server.sh script in the repo sets this up: the deploy user can write the slot file and reload nginx, nothing else. A compromised runner should not have broad filesystem access to the deploy server.
PagerDuty routing key. This key can trigger incidents against any service configured under it. Use a key scoped to this pipeline only. Rotate it on any suspected exposure. Treat it with the same care as a production database password — it is a denial-of-sleep vector if leaked.
Secrets in environment variables. SLACK_WEBHOOK_URL and PAGERDUTY_ROUTING_KEY are passed as environment variables to the alert step. GitHub Actions masks known secret values in logs, but partial matches or URL-encoded variants may not be caught. Never echo or log these values inside alert.py or any script the failure step calls.
Alert classification is a moving target. The ERROR_PATTERNS dict is not a security control — it is operational configuration. Its default behaviour (unknown errors → DEGRADED, never TRANSIENT) means an attacker who can influence error messages cannot silently suppress alerts. Verify this holds if you extend the TRANSIENT patterns significantly.
GITHUB_TOKEN permissions. The workflow sets permissions: contents: read, packages: write explicitly. Organisations with restrictive default token permissions should audit this before deploying — granting packages: write at the workflow level is appropriate here, but teams using more granular job-level permission scoping should move the block to the deploy job instead.
What you gain / what you give up
Retry logic reduces alert noise at the cost of masking underlying reliability issues. If your registry is returning 503s on 30% of pushes, retry with backoff means your pipeline succeeds and nobody investigates the registry. You need to monitor retry rates, not just retry outcomes. The scaffold repo includes a commented section in README.md on how to surface this via GitHub Actions workflow telemetry.
Three-tier alerting requires ongoing maintenance. The ERROR_PATTERNS dictionary reflects your pipeline's failure modes at the time you wrote it. New integrations, new infrastructure, and new failure modes will produce strings that do not match any pattern and land in DEGRADED. Review the patterns monthly for the first three months. After that, review any time a new step is added to the pipeline.
The stdlib-only approach in alert.py avoids the fragile pip install in the failure path, but it means the HTTP layer is less configurable. The urllib implementation has no connection pooling, no automatic retry, and no response decoding beyond status code. For a notification script in a CI failure step, that is the right tradeoff. For anything more complex, use a dedicated alerting service the pipeline calls externally.
Blue/green with slot files is simple and observable — you can cat /etc/deploy/active-slot on the server at any time. It is also manual. If the server is unreachable, the slot file is stale, and your pipeline's rollback logic does not know the real state. For environments where the deploy server could itself be a failure point, consider moving slot state to a registry or a distributed key-value store.
Tune the alert patterns from day one. I have treated ERROR_PATTERNS as infrastructure — something you define once and leave. It is not. It is a codebase. The patterns that matter are the ones your specific pipeline produces under your specific failure conditions. Starting with a broad TRANSIENT list and narrowing it based on observation is better than starting narrow and widening it reactively.
Add retry rate tracking early. The retry wrapper succeeds silently. That is by design. But if you are not tracking how often each step retries, you lose the signal that distinguishes a genuinely transient failure from a degrading dependency. A simple counter written to a metrics endpoint or even a structured log line is enough to surface this.
Test the rollback path before the first production deploy. The rollback step in the workflow is only as reliable as you have tested it. Break a deploy deliberately in a staging environment, verify the rollback fires, verify the correct container stops, verify the slot file is unchanged. The one time you need it is not the time to discover it has a bug.
pipelineandprompts-labs/pipelines-in-the-wild/02-retry-logic-tiered-alerting
The repo contains the Waybill API — a FastAPI shipment tracking application backed by PostgreSQL. Shipments are created with a waybill number, and tracking events are appended as the consignment moves through the network. The /health endpoint checks live database connectivity and reports the active deployment slot, which makes it a real integration test rather than a TCP ping. Both blue and green slots run on separate ports (7070 and 9091) sharing a single Postgres instance — the same topology the pipeline manages.
The repo also includes a scaffold script that prints the exact gh secret set commands for your environment and a quick-start guide for local dev and alerting tests:
./scaffold-self-healing-pipeline.sh waybill 10.0.0.42
To test the alerting locally before connecting real secrets:
# TRANSIENT — silent
python3 scripts/alert.py "registry connection timeout on push"
# DEGRADED — Slack warning (set SLACK_WEBHOOK_URL first)
SLACK_WEBHOOK_URL=https://hooks.slack.com/... \
python3 scripts/alert.py "smoke test failed on main"
# CRITICAL — Slack + PagerDuty
SLACK_WEBHOOK_URL=https://hooks.slack.com/... \
PAGERDUTY_ROUTING_KEY=your-key \
python3 scripts/alert.py "deploy failed on main"
Article 03 covers secrets management across multi-cloud environments — storing, rotating, and injecting credentials into GitHub Actions without hardcoding them and without creating a single point of failure in how your pipeline authenticates.
More from the series: Pipelines in the Wild
Written by Pipeline & Prompts | pipelineandprompts.dev
All working code: github.com/pipelineandprompts-labs/pipelines-in-the-wild/02-retry-logic-tiered-alerting
2026-06-16 04:16:52
Picture this: I’m sitting in a cramped interview room, the whiteboard glaring back at me like the Eye of Sauron. The interviewer drops the classic “Longest Substring Without Repeating Characters” problem and gives me five minutes to solve it. My heart starts doing the Imperial March drum‑roll. I fumble through a brute‑force solution—nested loops, checking every possible substring, resetting a set each time. It works on the tiny examples, but the moment the input grows past a dozen characters I can feel the time ticking away like the countdown in Mission: Impossible. I finish with a solution that’s O(n²) and watch the interviewer’s eyebrows raise just enough to signal “nice try, but…”.
That moment stuck with me. I realized I wasn’t lacking knowledge; I was missing a mental framework that lets me spot the hidden pattern instantly—like Neo seeing the green code rain in The Matrix. If I could train my brain to jump straight to that insight, I’d shave minutes off every problem, not just in interviews but in everyday debugging marathons. So I went on a quest to find that framework, and what I discovered felt like uncovering a lightsaber in a junkyard.
The breakthrough came when I stopped thinking about “checking every substring” and started asking: What do I need to know right now to decide whether I can extend the current window?
In the longest‑substring‑without‑repeating‑characters problem, the only thing that matters is the most recent index of each character we’ve seen. If we encounter a character that’s already inside our current window, we don’t need to scrap everything and start over; we just slide the left side of the window just past its previous occurrence.
That’s the aha! moment: maintain a sliding window bounded by two pointers (left and right) and a hash map that stores the last index where each character appeared. As the right pointer moves forward, we can instantly decide whether the window is still valid, and if not, we jump the left pointer to max(left, lastSeen[char] + 1). No rescanning, no resetting sets—just constant‑time updates.
It felt like discovering the Force: a simple rule that lets you sense disturbances in the string and react instantly. Once I internalized that pattern, the problem stopped being a beast and became a dance.
Here’s what my first attempt looked like—naïve, O(n²), and painful to watch under pressure:
function lengthOfLongestSubstring(s) {
let maxLen = 0;
for (let i = 0; i < s.length; i++) {
const seen = new Set();
for (let j = i; j < s.length; j++) {
if (seen.has(s[j])) break; // duplicate found, stop inner loop
seen.add(s[j]);
maxLen = Math.max(maxLen, j - i + 1);
}
}
return maxLen;
}
What’s wrong?
Set for every i, wasting work we already did.
Now the same problem, armed with the sliding‑window insight:
function lengthOfLongestSubstring(s) {
const lastIndex = new Map(); // char -> last position
let maxLen = 0;
let left = 0; // start of the current window
for (let right = 0; right < s.length; right++) {
const ch = s[right];
// If ch was seen inside the current window, move left just after its previous spot
if (lastIndex.has(ch) && lastIndex.get(ch) >= left) {
left = lastIndex.get(ch) + 1;
}
// Update the last seen position for ch
lastIndex.set(ch, right);
// Window size is right - left + 1
maxLen = Math.max(maxLen, right - left + 1);
}
return maxLen;
}
Why this feels like a spell:
O(n) time).
O(1) amortized).
>= left check – If you update left every time you see a repeat, you’ll shrink the window too aggressively for characters that appeared before the current window. Example: "abba"; without the check you’d move left past the first b on the second b, losing the valid "ba" substring.
lastIndex after moving left – The map must always hold the most recent index; otherwise future repeats will think the character is still at an old position and cause incorrect window shifts.
Avoid those, and the algorithm flows like a lightsaber through butter.
Mastering this sliding‑window mindset does more than solve one LeetCode problem. It trains you to ask the right question: “What minimal state do I need to keep to decide if I can extend my current solution?” That question pops up in:
When you internalize the framework, you stop staring at a blank screen hoping for inspiration and start constructing the solution piece by piece, much like building a Lego set with the instruction manual in hand. The pressure still exists, but now you have a reliable toolset that turns panic into pattern recognition.
I’ve seen my interview times drop from “barely finishing” to “finished with a minute to spare”. I’ve seen production bugs get tackled faster because I could isolate the offending segment with a sliding‑window mindset instead of tearing through logs line by line. It’s a superpower that compounds every time you code.
Grab a timer, pick a problem that usually makes you sweat (e.g., “Minimum Size Subarray Sum”, “Longest Repeating Character Replacement”, or even “Find All Anagrams in a String”), and give yourself five minutes. Before you dive into code, spend sixty seconds asking: What tiny piece of information do I need to keep track of to know if I can keep going? Write that down, then implement the sliding‑window solution.
When the timer dings, compare your solution to the brute‑force version you’d normally write. Notice the difference in speed, clarity, and confidence.
Now go forth—may the sliding window be with you, and may your next bug feel like a boss you’ve already defeated! 🚀