Developer Tutorials
Instructor: Puneet Arora
Full Tutorial · Python CLI Engineering

Build a GitHub Project
Finder CLI

A complete guide to architecting, building, and deploying a production-grade command-line tool that searches GitHub with advanced filters, runs as a background daemon, and writes structured logs and reports.

Python 3.11+ Typer · httpx · APScheduler ~ 60 min read Beginner → Advanced

00 Overview & Goals

By the end of this tutorial you will have a fully working CLI tool named github-finder that lets you:

  • Search GitHub repositories with compound filters — language, star count, topics, license, push date, and more.
  • Run as a background daemon that polls on a schedule and only surfaces new results.
  • Write structured JSON logs with automatic rotation (5 MB per file, 3 backups).
  • Generate reports in Markdown, JSON, or HTML after every run.
  • Follow professional CLI standards: exit codes, config hierarchy, shell autocompletion, and a clean --help.
Prerequisites
Python 3.11+, a GitHub personal access token (classic or fine-grained), and familiarity with pip and virtual environments. No prior CLI framework experience is required.

01 System Architecture

The tool is split into four horizontal layers, each with a single responsibility. The diagram below shows how a user command flows all the way to a report file and how the daemon sits as a wrapper around that same pipeline.

Full System Architecture
ENTRY CMDS CORE OUTPUT CLI Entry Point github-finder search | watch | report | daemon search cmd filters, output fmt watch cmd periodic polling loop daemon cmd start / stop / status filter engine lang · stars · topics GitHub API client rate-limit aware · async scheduler APScheduler / cron report writer JSON / MD / HTML logger structured · rotating cache / state SQLite / JSON store Daemon process (background) PID file ~/.local / /var/run log rotation RotatingFileHandler systemd / launchd service unit file

Data always flows top → bottom. The daemon is not a separate code path — it is simply a wrapper that calls the same pipeline on a schedule. This keeps the codebase clean and testable.

02 Project Structure

Keep the package flat enough to navigate in one glance, deep enough to prevent any single file from growing unmanageable.

github-finder/ ├── src/ │ └── github_finder/ │ ├── __init__.py │ ├── cli.py # Typer app + sub-app registration │ ├── commands/ │ │ ├── search.py # search sub-command │ │ ├── watch.py # watch sub-command │ │ └── daemon.py # daemon start/stop/status │ ├── core/ │ │ ├── api.py # GitHub API client (async, httpx) │ │ ├── filters.py # SearchFilters dataclass + builder │ │ └── scheduler.py # APScheduler job definitions │ ├── output/ │ │ ├── reporter.py # JSON / Markdown / HTML report writer │ │ └── logger.py # Structured JSON rotating logger │ └── state/ │ └── store.py # SQLite cache for seen repo IDs ├── tests/ │ ├── test_filters.py │ ├── test_api.py # uses respx to mock httpx │ └── test_reporter.py ├── templates/ │ └── report.html.j2 # Jinja2 HTML report template ├── pyproject.toml # PEP 517/518 build metadata └── github-finder.service # systemd unit file
Tip — src layout
Placing the package under src/ prevents accidental imports of the local package during testing instead of the installed version. Always use pip install -e . for development.

03 CLI Standards & Typer Setup

Use Typer (built on Click) as the CLI framework. It derives argument types, help text, and validation directly from Python type hints — no boilerplate decorators needed. Pair it with Rich for coloured output and tables.

Entry point — cli.py

Python · cli.py
import typer
from .commands import search, watch, daemon

app = typer.Typer(
    name="github-finder",
    help="Find GitHub repos with advanced filters.",
    add_completion=True,        # enables --install-completion
    rich_markup_mode="rich",    # coloured help text
    no_args_is_help=True,
)

app.add_typer(search.app, name="search")
app.add_typer(watch.app,  name="watch")
app.add_typer(daemon.app, name="daemon")

@app.callback(invoke_without_command=True)
def main(version: bool = typer.Option(False, "--version", "-V")):
    if version:
        typer.echo("github-finder v1.0.0")
        raise typer.Exit()

if __name__ == "__main__":
    app()

Conventions to follow

ConventionRule
Exit codes0 = success · 1 = user error · 2 = runtime/API error
Output format--output table|json|markdown|html — default table
Config hierarchyCLI flags → env vars → ~/.config/github-finder/config.toml → defaults
Token securityRead from GITHUB_TOKEN env var or OS keyring; never hardcode
Verbosity-v / --verbose bumps log level to DEBUG
Dry run--dry-run shows the API query without hitting GitHub
ErrorsNever print raw tracebacks to stdout; log the trace, show a clean message

04 The Filter Engine

All filters are encapsulated in a single SearchFilters dataclass. The to_github_query() method translates them into the GitHub search query syntax.

Filter Engine — data flow
CLI flags --lang, --stars… SearchFilters @dataclass to_github_query() query string language:python … API
Python · core/filters.py
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SearchFilters:
    query:        str
    language:     Optional[str]   = None
    min_stars:    int             = 0
    max_stars:    Optional[int]   = None
    topics:       list[str]       = field(default_factory=list)
    license:      Optional[str]   = None  # mit, apache-2.0, gpl-3.0 …
    pushed_after: Optional[str]   = None  # ISO date "2024-01-01"
    archived:     bool            = False
    has_issues:   Optional[bool]  = None
    sort_by:      str             = "stars"   # stars|forks|updated
    order:        str             = "desc"

    def to_github_query(self) -> str:
        parts = [self.query]
        if self.language:
            parts.append(f"language:{self.language}")
        if self.min_stars:
            rng = f"stars:>={self.min_stars}"
            if self.max_stars:
                rng = f"stars:{self.min_stars}..{self.max_stars}"
            parts.append(rng)
        for t in self.topics:
            parts.append(f"topic:{t}")
        if self.license:
            parts.append(f"license:{self.license}")
        if self.pushed_after:
            parts.append(f"pushed:>{self.pushed_after}")
        if not self.archived:
            parts.append("archived:false")
        return " ".join(parts)

05 GitHub API Client

Use httpx for async HTTP and tenacity for retry logic with exponential back-off. The GitHub Search API enforces a strict secondary rate limit — you must honour Retry-After headers on 429 responses.

Rate Limits
The Search API allows 30 requests/minute when authenticated. Always check X-RateLimit-Remaining headers and sleep when approaching zero. Cache results for at least 60 seconds to avoid redundant calls.
Python · core/api.py
import httpx, asyncio
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

class GitHubClient:
    BASE = "https://api.github.com"

    def __init__(self, token: str):
        self.client = httpx.AsyncClient(
            headers={
                "Authorization":        f"Bearer {token}",
                "Accept":               "application/vnd.github+json",
                "X-GitHub-Api-Version": "2022-11-28",
            },
            timeout=15.0,
        )

    @retry(
        wait=wait_exponential(multiplier=1, min=4, max=60),
        stop=stop_after_attempt(5),
        retry=retry_if_exception_type(httpx.HTTPStatusError),
    )
    async def search_repos(self, filters, page=1, per_page=30):
        resp = await self.client.get(
            f"{self.BASE}/search/repositories",
            params={
                "q":        filters.to_github_query(),
                "sort":     filters.sort_by,
                "order":    filters.order,
                "per_page": per_page,
                "page":     page,
            }
        )
        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 60))
            await asyncio.sleep(retry_after)
            resp.raise_for_status()
        resp.raise_for_status()
        return resp.json()

06 Daemon Mode

Two strategies — choose based on your deployment target. For servers, prefer systemd. For development / cross-platform, use the Python PID-file approach.

Daemon lifecycle state machine
stopped start starting fork OK running SIGTERM stopping restart / systemd Restart=on-failure

Option A — systemd service (recommended on Linux)

INI · github-finder.service
[Unit]
Description=GitHub Finder background watcher
After=network-online.target

[Service]
Type=simple
User=%i
ExecStart=/usr/local/bin/github-finder watch --interval 3600
Restart=on-failure
RestartSec=30
StandardOutput=journal
StandardError=journal
Environment=GITHUB_TOKEN=your_token_here

[Install]
WantedBy=multi-user.target
Bash — install & manage
# Install and enable
sudo cp github-finder.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now github-finder

# Follow logs live
journalctl -fu github-finder

# One-shot status
systemctl status github-finder

Option B — Python PID file (cross-platform)

Python · commands/daemon.py
import os, sys, signal, typer
from pathlib import Path

PID_FILE = Path.home() / ".local/share/github-finder/daemon.pid"
LOG_DIR  = Path.home() / ".local/share/github-finder/logs"

def start_daemon():
    if PID_FILE.exists():
        typer.echo(f"Already running (PID {PID_FILE.read_text()})", err=True)
        raise typer.Exit(1)

    pid = os.fork()   # Unix only — use subprocess on Windows
    if pid > 0:         # parent: write PID and exit
        PID_FILE.parent.mkdir(parents=True, exist_ok=True)
        PID_FILE.write_text(str(pid))
        typer.echo(f"Daemon started (PID {pid})")
        sys.exit(0)

    # Child: detach from terminal
    os.setsid()
    sys.stdin  = open(os.devnull)
    sys.stdout = open(LOG_DIR / "daemon.log", "a")
    sys.stderr = sys.stdout
    run_watcher_loop()   # blocking APScheduler loop

def stop_daemon():
    if not PID_FILE.exists():
        typer.echo("Daemon is not running."); raise typer.Exit(1)
    pid = int(PID_FILE.read_text())
    os.kill(pid, signal.SIGTERM)
    PID_FILE.unlink()
    typer.echo(f"Daemon stopped (PID {pid})")

07 Structured Logging

Every daemon run must leave a structured, machine-readable trace. Use Python's built-in RotatingFileHandler with a custom JSON formatter so logs are both human-readable and parseable by tools like jq.

Python · output/logger.py
import logging, json, datetime
from logging.handlers import RotatingFileHandler
from pathlib import Path

LOG_DIR = Path.home() / ".local/share/github-finder/logs"
LOG_DIR.mkdir(parents=True, exist_ok=True)

class JSONFormatter(logging.Formatter):
    def format(self, record) -> str:
        return json.dumps({
            "ts":      datetime.datetime.utcnow().isoformat(),
            "level":   record.levelname,
            "module":  record.module,
            "message": record.getMessage(),
            "extra":   getattr(record, "extra", {}),
        })

def setup_logger(verbose: bool = False) -> logging.Logger:
    log = logging.getLogger("github_finder")
    log.setLevel(logging.DEBUG if verbose else logging.INFO)

    # Rotating file: 5 MB per file, keep 3 backups
    fh = RotatingFileHandler(
        LOG_DIR / "app.log",
        maxBytes=5 * 1024 * 1024,
        backupCount=3,
    )
    fh.setFormatter(JSONFormatter())

    # Pretty console output via Rich
    from rich.logging import RichHandler
    ch = RichHandler(rich_tracebacks=True)
    ch.setLevel(logging.WARNING)

    log.addHandler(fh)
    log.addHandler(ch)
    return log

# Usage: log.info("Run complete", extra={"results": 12, "new": 4})
Pro tip — query logs with jq
Because every log line is valid JSON you can filter your daemon logs like a database:
cat app.log | jq 'select(.level=="ERROR")'

08 Report Generation

After each scheduled run the daemon writes a timestamped report. Support three formats so users can choose what integrates with their workflow.

Python · output/reporter.py
import json, datetime
from pathlib import Path
from jinja2 import Environment, PackageLoader

REPORT_DIR = Path.home() / ".local/share/github-finder/reports"

def write_report(results: list[dict], fmt: str = "markdown") -> Path:
    ts   = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    stem = REPORT_DIR / f"report_{ts}"
    REPORT_DIR.mkdir(parents=True, exist_ok=True)

    if fmt == "json":
        path = stem.with_suffix(".json")
        path.write_text(json.dumps(results, indent=2))

    elif fmt == "markdown":
        path = stem.with_suffix(".md")
        lines = [
            f"# GitHub Finder Report — {ts}\n",
            f"**{len(results)} repositories found**\n",
        ]
        for r in results:
            lines += [
                f"## [{r['full_name']}]({r['html_url']})",
                f"- ⭐ {r['stargazers_count']}  |  🍴 {r['forks_count']}",
                f"- `{r.get('language','N/A')}`  |  Updated: {r['updated_at'][:10]}",
                f"\n{r.get('description','')}\n",
            ]
        path.write_text("\n".join(lines))

    elif fmt == "html":
        env  = Environment(loader=PackageLoader("github_finder"))
        tmpl = env.get_template("report.html.j2")
        path = stem.with_suffix(".html")
        path.write_text(tmpl.render(results=results, ts=ts))

    return path

09 Library Reference

PurposeLibraryInstall
CLI frameworktyper + richpip install "typer[all]"
Async HTTP clienthttpxpip install httpx
Retry / back-offtenacitypip install tenacity
Background schedulingAPSchedulerpip install apscheduler
Config files (TOML)tomllib (stdlib 3.11+)built-in
HTML reportsjinja2pip install jinja2
Local state / cachesqlite3built-in
Secure token storagekeyringpip install keyring
Python daemonpython-daemonpip install python-daemon
Testingpytest + respxpip install pytest respx
Packaginghatch or poetrypip install hatch

10 Best Practices

  • Never hardcode tokens. Read from GITHUB_TOKEN env var or OS keyring. Add the config file to .gitignore.
  • Idempotent daemon runs. Store seen repo IDs in SQLite so each run only emits new results, not the same list every hour.
  • Graceful shutdown. Catch SIGTERM / SIGINT. Flush the log and write a partial report before exiting.
  • Config hierarchy. CLI flags → env vars → TOML config → hardcoded defaults. Never skip levels.
  • Rate limit budgeting. Before each batch, check X-RateLimit-Remaining. If below 5, sleep until X-RateLimit-Reset.
  • Dry-run flag. --dry-run prints the constructed query string without hitting the API. Essential for CI and debugging.
  • Structured logs everywhere. Each run should log: start time, filters used, results count, new results count, API calls made, and any errors.
  • Test with mocks. Use respx to mock httpx responses in unit tests. Never hit the real API in CI.
  • Pin dependencies. Use pip-compile or Poetry lock files. A floating requests==2.* will bite you.
  • Semantic versioning. Tag releases as v1.0.0. Expose the version via --version and in log headers.