00 Overview & Goals
By the end of this tutorial you will have a fully working CLI tool named github-finder that lets you:
- Search GitHub repositories with compound filters — language, star count, topics, license, push date, and more.
- Run as a background daemon that polls on a schedule and only surfaces new results.
- Write structured JSON logs with automatic rotation (5 MB per file, 3 backups).
- Generate reports in Markdown, JSON, or HTML after every run.
- Follow professional CLI standards: exit codes, config hierarchy, shell autocompletion, and a clean
--help.
pip and virtual environments. No prior CLI framework experience is required.
01 System Architecture
The tool is split into four horizontal layers, each with a single responsibility. The diagram below shows how a user command flows all the way to a report file and how the daemon sits as a wrapper around that same pipeline.
Data always flows top → bottom. The daemon is not a separate code path — it is simply a wrapper that calls the same pipeline on a schedule. This keeps the codebase clean and testable.
02 Project Structure
Keep the package flat enough to navigate in one glance, deep enough to prevent any single file from growing unmanageable.
src/ prevents accidental imports of the local package during testing instead of the installed version. Always use pip install -e . for development.
03 CLI Standards & Typer Setup
Use Typer (built on Click) as the CLI framework. It derives argument types, help text, and validation directly from Python type hints — no boilerplate decorators needed. Pair it with Rich for coloured output and tables.
Entry point — cli.py
import typer
from .commands import search, watch, daemon
app = typer.Typer(
name="github-finder",
help="Find GitHub repos with advanced filters.",
add_completion=True, # enables --install-completion
rich_markup_mode="rich", # coloured help text
no_args_is_help=True,
)
app.add_typer(search.app, name="search")
app.add_typer(watch.app, name="watch")
app.add_typer(daemon.app, name="daemon")
@app.callback(invoke_without_command=True)
def main(version: bool = typer.Option(False, "--version", "-V")):
if version:
typer.echo("github-finder v1.0.0")
raise typer.Exit()
if __name__ == "__main__":
app()
Conventions to follow
| Convention | Rule |
|---|---|
| Exit codes | 0 = success · 1 = user error · 2 = runtime/API error |
| Output format | --output table|json|markdown|html — default table |
| Config hierarchy | CLI flags → env vars → ~/.config/github-finder/config.toml → defaults |
| Token security | Read from GITHUB_TOKEN env var or OS keyring; never hardcode |
| Verbosity | -v / --verbose bumps log level to DEBUG |
| Dry run | --dry-run shows the API query without hitting GitHub |
| Errors | Never print raw tracebacks to stdout; log the trace, show a clean message |
04 The Filter Engine
All filters are encapsulated in a single SearchFilters dataclass. The to_github_query() method translates them into the GitHub search query syntax.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class SearchFilters:
query: str
language: Optional[str] = None
min_stars: int = 0
max_stars: Optional[int] = None
topics: list[str] = field(default_factory=list)
license: Optional[str] = None # mit, apache-2.0, gpl-3.0 …
pushed_after: Optional[str] = None # ISO date "2024-01-01"
archived: bool = False
has_issues: Optional[bool] = None
sort_by: str = "stars" # stars|forks|updated
order: str = "desc"
def to_github_query(self) -> str:
parts = [self.query]
if self.language:
parts.append(f"language:{self.language}")
if self.min_stars:
rng = f"stars:>={self.min_stars}"
if self.max_stars:
rng = f"stars:{self.min_stars}..{self.max_stars}"
parts.append(rng)
for t in self.topics:
parts.append(f"topic:{t}")
if self.license:
parts.append(f"license:{self.license}")
if self.pushed_after:
parts.append(f"pushed:>{self.pushed_after}")
if not self.archived:
parts.append("archived:false")
return " ".join(parts)
05 GitHub API Client
Use httpx for async HTTP and tenacity for retry logic with exponential back-off. The GitHub Search API enforces a strict secondary rate limit — you must honour Retry-After headers on 429 responses.
X-RateLimit-Remaining headers and sleep when approaching zero. Cache results for at least 60 seconds to avoid redundant calls.
import httpx, asyncio
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
class GitHubClient:
BASE = "https://api.github.com"
def __init__(self, token: str):
self.client = httpx.AsyncClient(
headers={
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
},
timeout=15.0,
)
@retry(
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(5),
retry=retry_if_exception_type(httpx.HTTPStatusError),
)
async def search_repos(self, filters, page=1, per_page=30):
resp = await self.client.get(
f"{self.BASE}/search/repositories",
params={
"q": filters.to_github_query(),
"sort": filters.sort_by,
"order": filters.order,
"per_page": per_page,
"page": page,
}
)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
await asyncio.sleep(retry_after)
resp.raise_for_status()
resp.raise_for_status()
return resp.json()
06 Daemon Mode
Two strategies — choose based on your deployment target. For servers, prefer systemd. For development / cross-platform, use the Python PID-file approach.
Option A — systemd service (recommended on Linux)
[Unit]
Description=GitHub Finder background watcher
After=network-online.target
[Service]
Type=simple
User=%i
ExecStart=/usr/local/bin/github-finder watch --interval 3600
Restart=on-failure
RestartSec=30
StandardOutput=journal
StandardError=journal
Environment=GITHUB_TOKEN=your_token_here
[Install]
WantedBy=multi-user.target
# Install and enable
sudo cp github-finder.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now github-finder
# Follow logs live
journalctl -fu github-finder
# One-shot status
systemctl status github-finder
Option B — Python PID file (cross-platform)
import os, sys, signal, typer
from pathlib import Path
PID_FILE = Path.home() / ".local/share/github-finder/daemon.pid"
LOG_DIR = Path.home() / ".local/share/github-finder/logs"
def start_daemon():
if PID_FILE.exists():
typer.echo(f"Already running (PID {PID_FILE.read_text()})", err=True)
raise typer.Exit(1)
pid = os.fork() # Unix only — use subprocess on Windows
if pid > 0: # parent: write PID and exit
PID_FILE.parent.mkdir(parents=True, exist_ok=True)
PID_FILE.write_text(str(pid))
typer.echo(f"Daemon started (PID {pid})")
sys.exit(0)
# Child: detach from terminal
os.setsid()
sys.stdin = open(os.devnull)
sys.stdout = open(LOG_DIR / "daemon.log", "a")
sys.stderr = sys.stdout
run_watcher_loop() # blocking APScheduler loop
def stop_daemon():
if not PID_FILE.exists():
typer.echo("Daemon is not running."); raise typer.Exit(1)
pid = int(PID_FILE.read_text())
os.kill(pid, signal.SIGTERM)
PID_FILE.unlink()
typer.echo(f"Daemon stopped (PID {pid})")
07 Structured Logging
Every daemon run must leave a structured, machine-readable trace. Use Python's built-in RotatingFileHandler with a custom JSON formatter so logs are both human-readable and parseable by tools like jq.
import logging, json, datetime
from logging.handlers import RotatingFileHandler
from pathlib import Path
LOG_DIR = Path.home() / ".local/share/github-finder/logs"
LOG_DIR.mkdir(parents=True, exist_ok=True)
class JSONFormatter(logging.Formatter):
def format(self, record) -> str:
return json.dumps({
"ts": datetime.datetime.utcnow().isoformat(),
"level": record.levelname,
"module": record.module,
"message": record.getMessage(),
"extra": getattr(record, "extra", {}),
})
def setup_logger(verbose: bool = False) -> logging.Logger:
log = logging.getLogger("github_finder")
log.setLevel(logging.DEBUG if verbose else logging.INFO)
# Rotating file: 5 MB per file, keep 3 backups
fh = RotatingFileHandler(
LOG_DIR / "app.log",
maxBytes=5 * 1024 * 1024,
backupCount=3,
)
fh.setFormatter(JSONFormatter())
# Pretty console output via Rich
from rich.logging import RichHandler
ch = RichHandler(rich_tracebacks=True)
ch.setLevel(logging.WARNING)
log.addHandler(fh)
log.addHandler(ch)
return log
# Usage: log.info("Run complete", extra={"results": 12, "new": 4})
cat app.log | jq 'select(.level=="ERROR")'
08 Report Generation
After each scheduled run the daemon writes a timestamped report. Support three formats so users can choose what integrates with their workflow.
import json, datetime
from pathlib import Path
from jinja2 import Environment, PackageLoader
REPORT_DIR = Path.home() / ".local/share/github-finder/reports"
def write_report(results: list[dict], fmt: str = "markdown") -> Path:
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
stem = REPORT_DIR / f"report_{ts}"
REPORT_DIR.mkdir(parents=True, exist_ok=True)
if fmt == "json":
path = stem.with_suffix(".json")
path.write_text(json.dumps(results, indent=2))
elif fmt == "markdown":
path = stem.with_suffix(".md")
lines = [
f"# GitHub Finder Report — {ts}\n",
f"**{len(results)} repositories found**\n",
]
for r in results:
lines += [
f"## [{r['full_name']}]({r['html_url']})",
f"- ⭐ {r['stargazers_count']} | 🍴 {r['forks_count']}",
f"- `{r.get('language','N/A')}` | Updated: {r['updated_at'][:10]}",
f"\n{r.get('description','')}\n",
]
path.write_text("\n".join(lines))
elif fmt == "html":
env = Environment(loader=PackageLoader("github_finder"))
tmpl = env.get_template("report.html.j2")
path = stem.with_suffix(".html")
path.write_text(tmpl.render(results=results, ts=ts))
return path
09 Library Reference
| Purpose | Library | Install |
|---|---|---|
| CLI framework | typer + rich | pip install "typer[all]" |
| Async HTTP client | httpx | pip install httpx |
| Retry / back-off | tenacity | pip install tenacity |
| Background scheduling | APScheduler | pip install apscheduler |
| Config files (TOML) | tomllib (stdlib 3.11+) | built-in |
| HTML reports | jinja2 | pip install jinja2 |
| Local state / cache | sqlite3 | built-in |
| Secure token storage | keyring | pip install keyring |
| Python daemon | python-daemon | pip install python-daemon |
| Testing | pytest + respx | pip install pytest respx |
| Packaging | hatch or poetry | pip install hatch |
10 Best Practices
- Never hardcode tokens. Read from
GITHUB_TOKENenv var or OS keyring. Add the config file to.gitignore. - Idempotent daemon runs. Store seen repo IDs in SQLite so each run only emits new results, not the same list every hour.
- Graceful shutdown. Catch
SIGTERM/SIGINT. Flush the log and write a partial report before exiting. - Config hierarchy. CLI flags → env vars → TOML config → hardcoded defaults. Never skip levels.
- Rate limit budgeting. Before each batch, check
X-RateLimit-Remaining. If below 5, sleep untilX-RateLimit-Reset. - Dry-run flag.
--dry-runprints the constructed query string without hitting the API. Essential for CI and debugging. - Structured logs everywhere. Each run should log: start time, filters used, results count, new results count, API calls made, and any errors.
- Test with mocks. Use
respxto mock httpx responses in unit tests. Never hit the real API in CI. - Pin dependencies. Use
pip-compileor Poetry lock files. A floatingrequests==2.*will bite you. - Semantic versioning. Tag releases as
v1.0.0. Expose the version via--versionand in log headers.