GitHub Project Finder CLI

00 Overview & Goals

By the end of this tutorial you will have a fully working CLI tool named github-finder that lets you:

Search GitHub repositories with compound filters — language, star count, topics, license, push date, and more.
Run as a background daemon that polls on a schedule and only surfaces new results.
Write structured JSON logs with automatic rotation (5 MB per file, 3 backups).
Generate reports in Markdown, JSON, or HTML after every run.
Follow professional CLI standards: exit codes, config hierarchy, shell autocompletion, and a clean --help.

Prerequisites

Python 3.11+, a GitHub personal access token (classic or fine-grained), and familiarity with pip and virtual environments. No prior CLI framework experience is required.

01 System Architecture

The tool is split into four horizontal layers, each with a single responsibility. The diagram below shows how a user command flows all the way to a report file and how the daemon sits as a wrapper around that same pipeline.

Full System Architecture

Data always flows top → bottom. The daemon is not a separate code path — it is simply a wrapper that calls the same pipeline on a schedule. This keeps the codebase clean and testable.

02 Project Structure

Keep the package flat enough to navigate in one glance, deep enough to prevent any single file from growing unmanageable.

github-finder/ ├── src/ │ └── github_finder/ │ ├── __init__.py │ ├── cli.py # Typer app + sub-app registration │ ├── commands/ │ │ ├── search.py # search sub-command │ │ ├── watch.py # watch sub-command │ │ └── daemon.py # daemon start/stop/status │ ├── core/ │ │ ├── api.py # GitHub API client (async, httpx) │ │ ├── filters.py # SearchFilters dataclass + builder │ │ └── scheduler.py # APScheduler job definitions │ ├── output/ │ │ ├── reporter.py # JSON / Markdown / HTML report writer │ │ └── logger.py # Structured JSON rotating logger │ └── state/ │ └── store.py # SQLite cache for seen repo IDs ├── tests/ │ ├── test_filters.py │ ├── test_api.py # uses respx to mock httpx │ └── test_reporter.py ├── templates/ │ └── report.html.j2 # Jinja2 HTML report template ├── pyproject.toml # PEP 517/518 build metadata └── github-finder.service # systemd unit file

Tip — src layout

Placing the package under src/ prevents accidental imports of the local package during testing instead of the installed version. Always use pip install -e . for development.

03 CLI Standards & Typer Setup

Use Typer (built on Click) as the CLI framework. It derives argument types, help text, and validation directly from Python type hints — no boilerplate decorators needed. Pair it with Rich for coloured output and tables.

Entry point — cli.py

Python · cli.py

import typer
from .commands import search, watch, daemon

app = typer.Typer(
    name="github-finder",
    help="Find GitHub repos with advanced filters.",
    add_completion=True,        # enables --install-completion
    rich_markup_mode="rich",    # coloured help text
    no_args_is_help=True,
)

app.add_typer(search.app, name="search")
app.add_typer(watch.app,  name="watch")
app.add_typer(daemon.app, name="daemon")

@app.callback(invoke_without_command=True)
def main(version: bool = typer.Option(False, "--version", "-V")):
    if version:
        typer.echo("github-finder v1.0.0")
        raise typer.Exit()

if __name__ == "__main__":
    app()

Conventions to follow

Convention	Rule
Exit codes	`0` = success · `1` = user error · `2` = runtime/API error
Output format	`--output table\|json\|markdown\|html` — default `table`
Config hierarchy	CLI flags → env vars → `~/.config/github-finder/config.toml` → defaults
Token security	Read from `GITHUB_TOKEN` env var or OS keyring; never hardcode
Verbosity	`-v / --verbose` bumps log level to `DEBUG`
Dry run	`--dry-run` shows the API query without hitting GitHub
Errors	Never print raw tracebacks to stdout; log the trace, show a clean message

04 The Filter Engine

All filters are encapsulated in a single SearchFilters dataclass. The to_github_query() method translates them into the GitHub search query syntax.

Filter Engine — data flow

Python · core/filters.py

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SearchFilters:
    query:        str
    language:     Optional[str]   = None
    min_stars:    int             = 0
    max_stars:    Optional[int]   = None
    topics:       list[str]       = field(default_factory=list)
    license:      Optional[str]   = None  # mit, apache-2.0, gpl-3.0 …
    pushed_after: Optional[str]   = None  # ISO date "2024-01-01"
    archived:     bool            = False
    has_issues:   Optional[bool]  = None
    sort_by:      str             = "stars"   # stars|forks|updated
    order:        str             = "desc"

    def to_github_query(self) -> str:
        parts = [self.query]
        if self.language:
            parts.append(f"language:{self.language}")
        if self.min_stars:
            rng = f"stars:>={self.min_stars}"
            if self.max_stars:
                rng = f"stars:{self.min_stars}..{self.max_stars}"
            parts.append(rng)
        for t in self.topics:
            parts.append(f"topic:{t}")
        if self.license:
            parts.append(f"license:{self.license}")
        if self.pushed_after:
            parts.append(f"pushed:>{self.pushed_after}")
        if not self.archived:
            parts.append("archived:false")
        return " ".join(parts)

05 GitHub API Client

Use httpx for async HTTP and tenacity for retry logic with exponential back-off. The GitHub Search API enforces a strict secondary rate limit — you must honour Retry-After headers on 429 responses.

Rate Limits

The Search API allows 30 requests/minute when authenticated. Always check X-RateLimit-Remaining headers and sleep when approaching zero. Cache results for at least 60 seconds to avoid redundant calls.

Python · core/api.py

import httpx, asyncio
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

class GitHubClient:
    BASE = "https://api.github.com"

    def __init__(self, token: str):
        self.client = httpx.AsyncClient(
            headers={
                "Authorization":        f"Bearer {token}",
                "Accept":               "application/vnd.github+json",
                "X-GitHub-Api-Version": "2022-11-28",
            },
            timeout=15.0,
        )

    @retry(
        wait=wait_exponential(multiplier=1, min=4, max=60),
        stop=stop_after_attempt(5),
        retry=retry_if_exception_type(httpx.HTTPStatusError),
    )
    async def search_repos(self, filters, page=1, per_page=30):
        resp = await self.client.get(
            f"{self.BASE}/search/repositories",
            params={
                "q":        filters.to_github_query(),
                "sort":     filters.sort_by,
                "order":    filters.order,
                "per_page": per_page,
                "page":     page,
            }
        )
        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 60))
            await asyncio.sleep(retry_after)
            resp.raise_for_status()
        resp.raise_for_status()
        return resp.json()

06 Daemon Mode

Two strategies — choose based on your deployment target. For servers, prefer systemd. For development / cross-platform, use the Python PID-file approach.

Daemon lifecycle state machine

Option A — systemd service (recommended on Linux)

INI · github-finder.service

[Unit]
Description=GitHub Finder background watcher
After=network-online.target

[Service]
Type=simple
User=%i
ExecStart=/usr/local/bin/github-finder watch --interval 3600
Restart=on-failure
RestartSec=30
StandardOutput=journal
StandardError=journal
Environment=GITHUB_TOKEN=your_token_here

[Install]
WantedBy=multi-user.target

Bash — install & manage

# Install and enable
sudo cp github-finder.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now github-finder

# Follow logs live
journalctl -fu github-finder

# One-shot status
systemctl status github-finder

Option B — Python PID file (cross-platform)

Python · commands/daemon.py

import os, sys, signal, typer
from pathlib import Path

PID_FILE = Path.home() / ".local/share/github-finder/daemon.pid"
LOG_DIR  = Path.home() / ".local/share/github-finder/logs"

def start_daemon():
    if PID_FILE.exists():
        typer.echo(f"Already running (PID {PID_FILE.read_text()})", err=True)
        raise typer.Exit(1)

    pid = os.fork()   # Unix only — use subprocess on Windows
    if pid > 0:         # parent: write PID and exit
        PID_FILE.parent.mkdir(parents=True, exist_ok=True)
        PID_FILE.write_text(str(pid))
        typer.echo(f"Daemon started (PID {pid})")
        sys.exit(0)

    # Child: detach from terminal
    os.setsid()
    sys.stdin  = open(os.devnull)
    sys.stdout = open(LOG_DIR / "daemon.log", "a")
    sys.stderr = sys.stdout
    run_watcher_loop()   # blocking APScheduler loop

def stop_daemon():
    if not PID_FILE.exists():
        typer.echo("Daemon is not running."); raise typer.Exit(1)
    pid = int(PID_FILE.read_text())
    os.kill(pid, signal.SIGTERM)
    PID_FILE.unlink()
    typer.echo(f"Daemon stopped (PID {pid})")

07 Structured Logging

Every daemon run must leave a structured, machine-readable trace. Use Python's built-in RotatingFileHandler with a custom JSON formatter so logs are both human-readable and parseable by tools like jq.

Python · output/logger.py

import logging, json, datetime
from logging.handlers import RotatingFileHandler
from pathlib import Path

LOG_DIR = Path.home() / ".local/share/github-finder/logs"
LOG_DIR.mkdir(parents=True, exist_ok=True)

class JSONFormatter(logging.Formatter):
    def format(self, record) -> str:
        return json.dumps({
            "ts":      datetime.datetime.utcnow().isoformat(),
            "level":   record.levelname,
            "module":  record.module,
            "message": record.getMessage(),
            "extra":   getattr(record, "extra", {}),
        })

def setup_logger(verbose: bool = False) -> logging.Logger:
    log = logging.getLogger("github_finder")
    log.setLevel(logging.DEBUG if verbose else logging.INFO)

    # Rotating file: 5 MB per file, keep 3 backups
    fh = RotatingFileHandler(
        LOG_DIR / "app.log",
        maxBytes=5 * 1024 * 1024,
        backupCount=3,
    )
    fh.setFormatter(JSONFormatter())

    # Pretty console output via Rich
    from rich.logging import RichHandler
    ch = RichHandler(rich_tracebacks=True)
    ch.setLevel(logging.WARNING)

    log.addHandler(fh)
    log.addHandler(ch)
    return log

# Usage: log.info("Run complete", extra={"results": 12, "new": 4})

Pro tip — query logs with jq

Because every log line is valid JSON you can filter your daemon logs like a database:
cat app.log | jq 'select(.level=="ERROR")'

08 Report Generation

After each scheduled run the daemon writes a timestamped report. Support three formats so users can choose what integrates with their workflow.

Python · output/reporter.py

import json, datetime
from pathlib import Path
from jinja2 import Environment, PackageLoader

REPORT_DIR = Path.home() / ".local/share/github-finder/reports"

def write_report(results: list[dict], fmt: str = "markdown") -> Path:
    ts   = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    stem = REPORT_DIR / f"report_{ts}"
    REPORT_DIR.mkdir(parents=True, exist_ok=True)

    if fmt == "json":
        path = stem.with_suffix(".json")
        path.write_text(json.dumps(results, indent=2))

    elif fmt == "markdown":
        path = stem.with_suffix(".md")
        lines = [
            f"# GitHub Finder Report — {ts}\n",
            f"**{len(results)} repositories found**\n",
        ]
        for r in results:
            lines += [
                f"## [{r['full_name']}]({r['html_url']})",
                f"- ⭐ {r['stargazers_count']}  |  🍴 {r['forks_count']}",
                f"- `{r.get('language','N/A')}`  |  Updated: {r['updated_at'][:10]}",
                f"\n{r.get('description','')}\n",
            ]
        path.write_text("\n".join(lines))

    elif fmt == "html":
        env  = Environment(loader=PackageLoader("github_finder"))
        tmpl = env.get_template("report.html.j2")
        path = stem.with_suffix(".html")
        path.write_text(tmpl.render(results=results, ts=ts))

    return path

09 Library Reference

Purpose	Library	Install
CLI framework	`typer` + `rich`	`pip install "typer[all]"`
Async HTTP client	`httpx`	`pip install httpx`
Retry / back-off	`tenacity`	`pip install tenacity`
Background scheduling	`APScheduler`	`pip install apscheduler`
Config files (TOML)	`tomllib` (stdlib 3.11+)	built-in
HTML reports	`jinja2`	`pip install jinja2`
Local state / cache	`sqlite3`	built-in
Secure token storage	`keyring`	`pip install keyring`
Python daemon	`python-daemon`	`pip install python-daemon`
Testing	`pytest` + `respx`	`pip install pytest respx`
Packaging	`hatch` or `poetry`	`pip install hatch`

10 Best Practices

Never hardcode tokens. Read from GITHUB_TOKEN env var or OS keyring. Add the config file to .gitignore.
Idempotent daemon runs. Store seen repo IDs in SQLite so each run only emits new results, not the same list every hour.
Graceful shutdown. Catch SIGTERM / SIGINT. Flush the log and write a partial report before exiting.
Config hierarchy. CLI flags → env vars → TOML config → hardcoded defaults. Never skip levels.
Rate limit budgeting. Before each batch, check X-RateLimit-Remaining. If below 5, sleep until X-RateLimit-Reset.
Dry-run flag. --dry-run prints the constructed query string without hitting the API. Essential for CI and debugging.
Structured logs everywhere. Each run should log: start time, filters used, results count, new results count, API calls made, and any errors.
Test with mocks. Use respx to mock httpx responses in unit tests. Never hit the real API in CI.
Pin dependencies. Use pip-compile or Poetry lock files. A floating requests==2.* will bite you.
Semantic versioning. Tag releases as v1.0.0. Expose the version via --version and in log headers.

Build a GitHub ProjectFinder CLI

00 Overview & Goals

01 System Architecture

02 Project Structure

03 CLI Standards & Typer Setup

Entry point — cli.py

Conventions to follow

04 The Filter Engine

05 GitHub API Client

06 Daemon Mode

Option A — systemd service (recommended on Linux)

Option B — Python PID file (cross-platform)

07 Structured Logging

08 Report Generation

09 Library Reference

10 Best Practices

Build a GitHub Project
Finder CLI