Home ArchitecturesGitHub Pull Request & CI/CD Pipeline

🗄️ Data & InfrastructureIntermediateWeek 6

GitHub Pull Request & CI/CD Pipeline

Git internals, check suites, and the webhook fanout that powers DevOps

GitHubGitLabBitbucket

Key Insight

Git is a content-addressable filesystem every object is identified by its SHA-1 hash, making deduplication and integrity verification trivial.

Request Journey

Developer opens PR→

GitHub fires webhook, Actions workflow YAML evaluated→

Available runner picks up job, code checked out→

Build and tests run in container→

Status check posted back to PR, required checks pass

+1 more steps

How It Works

① Developer opens PR

② GitHub fires webhook, Actions workflow YAML evaluated

③ Available runner picks up job, code checked out

④ Build and tests run in container

⑤ Status check posted back to PR, required checks pass

⑥ PR enters merge queue, squash-merged to main, deploy workflow triggered

⚠The Problem

When a developer pushes a commit to GitHub, dozens of downstream systems must react within seconds: CI services must start builds, status checks must update, review tools must analyze the diff, and merge queues must re-evaluate readiness. GitHub hosts 200M+ repositories — the webhook fan-out from a single push event can trigger hundreds of HTTP callbacks to external services. Git's content-addressable storage model must efficiently handle repositories ranging from tiny config repos to monorepos with millions of files and decades of history.

✓The Solution

GitHub's architecture separates the Git storage layer (content-addressable SHA-1 blobs/trees/commits with efficient delta compression in packfiles) from the application layer (webhook fan-out, check suites API, merge queues). When a push event occurs, GitHub's event bus triggers parallel webhook delivery to all registered CI/CD applications. Check Suites aggregate test results from multiple CI providers into a unified status. Merge queues serialize tested merges to prevent the 'broken main' problem caused by concurrent merges of independently-tested PRs.

📊Scale at a Glance

200M+

Repositories

100M+

Developers

Billions

Git Operations/Day

Billions

Webhook Deliveries/Day

🔬Deep Dive

Git Internals — Content-Addressable Storage

Git stores every file as a blob, every directory as a tree, and every snapshot as a commit — all addressed by their SHA-1 hash. Two files with identical content share a single blob object regardless of filename or location. Trees are lists of (name, mode, SHA-1) entries pointing to blobs and sub-trees. A commit points to a tree (the full snapshot) plus parent commits (history). This content-addressable model provides automatic deduplication, trivial integrity verification (hash the object and compare), and efficient diff computation (unchanged subtrees share the same SHA-1).

Packfiles and Delta Compression

Storing every version of every file as a full blob would be prohibitively expensive. Git's packfile format compresses objects using delta encoding — storing only the binary difference between similar objects. The packing algorithm finds the best delta base for each object (often a previous version of the same file). A single packfile for a large repository might contain millions of objects compressed to a fraction of their uncompressed size. GitHub's storage layer runs aggressive repacking in the background, and packfile indices enable O(log N) object lookup without decompressing the entire pack.

Webhook Fan-Out — Event-Driven CI/CD

When a developer pushes commits, GitHub's event system generates a push event and fans it out as HTTP POST webhooks to every registered receiver — CI services, deployment tools, chat bots, and monitoring systems. A popular organization might have dozens of webhook endpoints per repository. GitHub's webhook delivery system uses a queue-based architecture with at-least-once delivery guarantees, automatic retries with exponential backoff for failed deliveries, and a webhook delivery log for debugging. Idempotency is the receiver's responsibility — CI systems must handle duplicate webhook events gracefully.

Check Suites — Aggregated CI Status

The Check Suites API allows multiple CI providers (GitHub Actions, CircleCI, Jenkins) to report test results for a single commit. Each provider creates Check Runs within a Check Suite, reporting status (queued, in_progress, completed), conclusion (success, failure, neutral), and rich output (annotations on specific lines of code). Branch protection rules can require specific check suites to pass before a PR is mergeable. This abstraction decouples GitHub from any specific CI provider — teams can use multiple CI systems simultaneously and see unified status on the pull request page.

Merge Queues — Preventing Broken Main

Without a merge queue, two PRs can each pass CI independently but break when merged together (semantic conflicts that don't produce Git merge conflicts). GitHub's merge queue serializes merges: when a PR is enqueued, it's rebased on top of the current queue head and CI runs against the combined result. If CI passes, the PR is merged; if it fails, it's removed from the queue. This guarantees that every commit on main has passed CI in the exact context it will land in. The trade-off is increased merge latency — PRs wait for preceding queue entries to complete.

⬡Architecture Diagram

GitHub Pull Request & CI/CD Pipeline — simplified architecture overview

✦Core Concepts

⚙️

Git DAG

⚙️

Webhook Fan-out

⚙️

Check Suites API

📨

Merge Queues

⚙️

Pack Files

⚙️

GitHub Actions

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Content-addressable storage provides automatic deduplication and integrity verification
✓Webhook fan-out enables a rich ecosystem of third-party CI/CD integrations
✓Check Suites API unifies status reporting from multiple CI providers into one view
✓Merge queues guarantee every main branch commit has passed CI in its actual merge context

✗ Weaknesses

✗Webhook at-least-once delivery means CI receivers must handle duplicate events idempotently
✗Delta compression in packfiles trades CPU for storage — repacking large repos is computationally expensive
✗Merge queues add latency to the merge process as PRs wait for preceding entries
✗SHA-1 has known collision vulnerabilities — Git is migrating to SHA-256 but the transition is complex

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Explain Git's content-addressable storage model. How does Git know if two files in different directories are identical?
Q2
Design a webhook delivery system with at-least-once guarantees. How would you handle receivers that are temporarily down?
Q3
Two PRs each pass CI independently but break when merged together. Design a merge queue that prevents this.
Q4
How does Git's packfile delta compression work? What makes it more efficient than storing full snapshots?
Q5
Design a CI status aggregation system that collects results from multiple providers and enforces branch protection rules.

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser