GitHub Actions CI/CD โ From Zero to Production Pipelines
How I built CI/CD workflows that actually work, including the secrets management mistakes and caching tricks that took me months to figure out.

My first CI/CD pipeline was a bash script on a cron job. Every five minutes, it pulled from main, ran tests, and if they passed, restarted the server. It worked until it didn't โ which was a Friday evening when a broken merge passed the tests (the test suite had a silent failure mode I hadn't caught) and took down the staging environment for the weekend. Nobody noticed until Monday because the cron job kept pulling the broken code, failing, and pulling again in an infinite loop of sadness.
That Monday I set up GitHub Actions properly. Took about four hours to get the first workflow running. Took another three months of iteration before the pipelines were genuinely reliable. Going to walk through what I learned, including the parts that aren't in the quickstart docs.
Why Not Jenkins, CircleCI, or Something Else
Short answer: GitHub Actions lives where the code lives. No separate service to manage, no webhook configuration, no extra login. The workflow files sit in .github/workflows/ inside the repository. Push a commit, the pipeline runs. That's it.
Jenkins is more powerful and more configurable. It's also another server to maintain, another thing to patch, another thing that can go down at 2 AM. For teams already running Jenkins with years of custom plugins, switching doesn't make sense. For starting fresh, the operational overhead isn't worth it unless you need something GitHub Actions genuinely can't do.
CircleCI and GitLab CI are solid too. Used CircleCI at a previous gig and it was fine. The choice between them is less important than having any CI/CD at all. The number of teams pushing directly to production without automated tests is still higher than it should be.
The First Workflow
A workflow is a YAML file in .github/workflows/. Here's the one that handles most of what a typical Node.js project needs:
name: CI
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [18, 20, 22]
steps:
- uses: actions/checkout@v4
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
- name: Install dependencies
run: npm ci
- name: Run linter
run: npm run lint
- name: Run tests
run: npm test
- name: Build
run: npm run build
Triggers on pushes to main and develop, and on pull requests targeting main. The matrix strategy runs the job three times in parallel โ once for each Node version. npm ci instead of npm install because ci does a clean install from the lockfile, which is deterministic and faster. If someone's package-lock.json is out of sync with their package.json, npm ci will catch it. npm install won't.
One thing I got wrong early: triggering on every push to every branch. On an active team, that eats through Actions minutes fast. Limiting triggers to main, develop, and PRs targeting main is usually enough. Feature branches get tested when the PR is opened.
Caching Dependencies
Without caching, every workflow run installs dependencies from scratch. For a Node.js project with a large node_modules, that's 30-90 seconds of downloading packages. Multiply by 3 matrix variants, multiply by 20 pushes a day, and you're wasting a lot of time.
- name: Cache node modules
uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ matrix.node-version }}-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-${{ matrix.node-version }}-
This caches the npm cache directory (not node_modules directly โ caching node_modules can cause issues with native modules across OS versions). The cache key includes the OS, Node version, and a hash of the lockfile. If the lockfile hasn't changed, the cached packages get restored and npm ci finishes in seconds instead of a minute.
The restore-keys fallback probably means that even if the lockfile changed (new dependency added), it'll restore the closest matching cache so only the new packages need downloading. Partial hit is better than no hit.
I also cache for other ecosystems. Python with pip:
- uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
Go modules:
- uses: actions/cache@v4
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
Same pattern everywhere. Cache the package manager's cache directory, key it on the lockfile hash. Simple and effective. Cut my average workflow time from 4 minutes to under 90 seconds by adding caching.
Secrets Management
First mistake: hardcoding an API key in a workflow file. It was a test API key for a service that didn't matter much. Still got flagged by GitHub's secret scanning within minutes. Embarrassing but educational.
Secrets go in the repository settings under Settings > Secrets and variables > Actions. Reference them in workflows as ${{ secrets.SECRET_NAME }}. They're masked in logs โ if the value accidentally gets printed, GitHub replaces it with ***.
- name: Deploy to staging
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }}
run: |
./deploy.sh staging
Things I learned the hard way about secrets:
Secrets aren't available in pull requests from forks. This is a security feature โ a random person forking your repo and submitting a PR shouldn't get access to your production credentials. But it means workflows that need secrets will fail on fork PRs. Handle this with a conditional:
- name: Integration tests
if: github.event.pull_request.head.repo.full_name == github.repository
env:
API_KEY: ${{ secrets.TEST_API_KEY }}
run: npm run test:integration
Environment-level secrets are worth using for anything beyond a simple project. You create environments (staging, production) in repository settings, each with their own secrets and optional protection rules. The production environment can require manual approval before a deployment job runs:
deploy-production:
needs: [test, deploy-staging]
runs-on: ubuntu-latest
environment:
name: production
url: https://myapp.example.com
steps:
- name: Deploy
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
run: ./deploy.sh production
With a protection rule on the production environment, this job pauses and waits for someone to click "Approve" in the GitHub UI before running. Prevents accidental production deploys from a merged PR.
A Real Deployment Workflow
The test workflow above is half the story. Here's a more complete pipeline that builds a Docker image and deploys it:
name: Deploy
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm test
build-and-push:
needs: test
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
ghcr.io/${{ github.repository }}:latest
ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to server
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SERVER_HOST }}
username: ${{ secrets.SERVER_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
docker pull ghcr.io/${{ github.repository }}:${{ github.sha }}
docker stop myapp || true
docker rm myapp || true
docker run -d --name myapp \
-p 3000:3000 \
--env-file /opt/myapp/.env \
ghcr.io/${{ github.repository }}:${{ github.sha }}
Three jobs, running in sequence because each depends on the previous one. Tests pass, Docker image gets built and pushed to GitHub Container Registry, then the deploy job SSHs into the server and pulls the new image.
The ${{ github.sha }} tag means every deploy is tied to a specific commit. If something breaks, you can roll back by deploying the previous commit's image. The latest tag is also pushed for convenience, but latest is a moving target โ the SHA tag is what matters for reproducibility.
cache-from: type=gha and cache-to: type=gha,mode=max use GitHub Actions' built-in cache for Docker layer caching. Without this, every build rebuilds every layer from scratch. With it, unchanged layers get reused. My Docker build went from 6 minutes to under 2 minutes with this one addition. I covered Docker layer caching and build optimization in my Docker post if you want the details on why layer ordering matters.
Reusable Workflows and Composite Actions
Once you have more than a couple of repositories, copy-pasting workflow files becomes a maintenance problem. Change the Node version across 12 repos? That's 12 PRs.
Reusable workflows let you define a workflow once and call it from other workflows:
# .github/workflows/reusable-test.yml
name: Reusable Test Workflow
on:
workflow_call:
inputs:
node-version:
required: false
type: string
default: '20'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node-version }}
- run: npm ci
- run: npm test
Call it from another workflow:
jobs:
test:
uses: my-org/shared-workflows/.github/workflows/reusable-test.yml@main
with:
node-version: '20'
Composite actions are the other option โ bundle multiple steps into a single action. Better for reusing a sequence of steps within a job. Reusable workflows are better for reusing entire jobs.
I have a shared repository with reusable workflows for testing, building Docker images, and deploying. All my project repos call into it. Update the shared repo once, every downstream project picks up the change on their next run.
Handling Flaky Tests
Nothing erodes trust in CI faster than flaky tests. A test that passes 95% of the time and fails 5% of the time makes people start ignoring CI failures entirely. "Oh, it's probably that flaky integration test again." Then a real failure slips through because nobody checked.
Two approaches that helped.
First, retry on failure for known flaky tests (while you fix the underlying issue):
- name: Run integration tests
uses: nick-fields/retry@v3
with:
timeout_minutes: 10
max_attempts: 3
command: npm run test:integration
Second, separate fast unit tests from slow integration tests into different jobs. Unit tests should never be flaky โ if they are, the test is broken. Integration tests that depend on external services will occasionally fail due to network issues or service hiccups. Running them separately means a flaky integration test doesn't block the unit test results:
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run test:unit
integration-tests:
runs-on: ubuntu-latest
continue-on-error: true
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run test:integration
The continue-on-error: true means โ I think โ integration test failures won't block the overall workflow. Controversial choice. I only use it temporarily while fixing the flaky test, not as a permanent solution. A green check that ignores failures is worse than no CI at all.
Matrix Strategies Beyond Node Versions
The matrix feature is more flexible than just testing across runtime versions. Testing across operating systems:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
node-version: [18, 20]
runs-on: ${{ matrix.os }}
Six combinations, running in parallel. Catches platform-specific bugs before they reach users. Found a file path handling bug on Windows this way โ forward slashes vs. backslashes in a path join.
You can also exclude specific combinations:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
node-version: [18, 20]
exclude:
- os: windows-latest
node-version: 18
Skips Node 18 on Windows if you know that combination isn't relevant. Saves minutes and money on hosted runners.
Scheduled Workflows
Not everything triggers on push. Some tasks need to run on a schedule. Dependency audits, for instance:
name: Security Audit
on:
schedule:
- cron: '0 9 * * 1' # Every Monday at 9 AM UTC
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm audit --production
- name: Check for outdated dependencies
run: npx npm-check-updates
The cron syntax is the same as standard cron โ I shared how cron works in my Linux CLI post. This runs every Monday morning. If npm audit finds vulnerabilities, the workflow fails and I get a notification. Probably better than remembering to run it manually, which I definitely won't.
One gotcha: scheduled workflows only run on the default branch. If you create the workflow on a feature branch, it won't trigger until it's merged to main.
Notifications and Status Checks
GitHub's built-in notifications for workflow failures are adequate but easy to miss in a crowded inbox. For critical pipelines, I send notifications to Slack:
- name: Notify on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "CI failed on ${{ github.repository }} (${{ github.ref_name }})",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*CI Failed* on `${{ github.repository }}`\nBranch: `${{ github.ref_name }}`\nCommit: ${{ github.sha }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
The if: failure() condition means this step only runs when a previous step failed. No spam on success. Only alerts when something needs attention.
Branch protection rules are the other piece. In repository settings, require status checks to pass before merging. The test job becomes a required check โ PRs can't merge until it passes. This is the single most impactful thing you can do for code quality. Not the tests themselves, but making the tests a gate. Can't skip them, can't merge around them, can't say "I'll fix the tests later."
Cost and Minutes
GitHub Actions gives free accounts 2,000 minutes per month on Linux runners. That sounds like a lot until you have a matrix of 6 combinations running on every push to an active repo. Each run takes 3 minutes, that's 18 minutes per push. Ten pushes a day is 180 minutes. Across a month with 20 working days, that's 3,600 minutes โ over the free limit.
Ways to reduce usage: limit triggers (not every push needs CI), cache aggressively, cancel redundant runs when new commits are pushed to the same branch:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
This cancels any in-progress run for the same workflow and branch when a new run starts. If I push three commits in quick succession, only the last one gets a full CI run. The first two are cancelled. Saves minutes and gives faster feedback on the latest code.
macOS runners cost 10x Linux runners. Windows runners cost 2x. If you're testing cross-platform, those multiply your usage fast. Consider whether you actually need macOS and Windows CI, or if Linux-only catches 99% of the bugs for your use case.
What I'd Set Up Differently Today
If starting a new project today, the workflow would be: lint and type-check as the fastest job (catches obvious issues in seconds), unit tests in parallel, integration tests in a separate job, Docker build and push on main only, deploy to staging automatically, deploy to production with manual approval.
I'd also use GitHub's newer OpenID Connect (OIDC) support instead of storing AWS credentials as secrets. OIDC lets the workflow request short-lived credentials from AWS directly, without any stored secrets. Less to rotate, smaller blast radius if something gets compromised. The setup is more involved but worth it for production deployments.
Probably the biggest lesson from three years of CI/CD: the pipeline is a product, not a one-time setup. It needs maintenance, monitoring, and iteration just like the application code. Treat workflow files with the same care you'd treat production code โ review changes, test them, and don't let them rot. A neglected pipeline will hurt you exactly when you can least afford it.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

Learning Docker โ What I Wish Someone Had Told Me Earlier
Why most Docker tutorials fail beginners, the critical difference between images and containers, and what actually happens when you run a container.

Observability Beyond Grep โ Logs, Metrics, Traces, and Why They All Matter
Why grepping through log files stops working at scale, the real difference between logs, metrics, and traces, and how OpenTelemetry ties them together.

SSH Tunneling โ The Networking Swiss Army Knife Nobody Taught Me
Local forwarding, remote forwarding, dynamic SOCKS proxies, jump hosts, and the SSH config shortcuts that replaced half my VPN usage.