Files
git-law/README.md

532 lines
20 KiB
Markdown

# 🏛️ Git Blame for the United States Code
> **Apply the full power of git to track every change in the United States Code with line-by-line attribution to Congressional sponsors.**
[![Python](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Pydantic](https://img.shields.io/badge/pydantic-v2-green.svg)](https://pydantic.dev/)
[![Congress.gov](https://img.shields.io/badge/data-Congress.gov%20API-blue.svg)](https://api.congress.gov/)
## Vision: True Git Blame for Law
```bash
git blame Title-42-The-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
# Shows line-by-line attribution:
a1b2c3d4 (Rep. Nancy Pelosi 2021-03-11) (a) In general.—The Secretary, acting through
e5f6g7h8 (Sen. Chuck Schumer 2021-03-11) the Director of the Centers for Disease Control
f9g0h1i2 (Rep. Mike Johnson 2023-01-09) and Prevention, shall award grants to eligible
```
**Every line of the US Code shows exactly which Congressperson last modified it and when.**
## The Vision
This system transforms US Code tracking from annual snapshots to **line-level legislative history**:
- **📍 Granular Attribution**: Every line shows the exact Congressperson who last changed it
- **🕰️ Complete Timeline**: Full evolution from 2013 to present with chronological commits
- **📊 Rich Context**: Committee reports, debates, sponsor details, and legislative process
- **🔍 Powerful Queries**: `git log --follow Section-280g-15.md` to see complete section history
- **🎯 Diff Analysis**: `git diff PL-116-260..PL-117-328` to see exactly what changed between laws
## Architecture: Modular & Extensible
### 🏗️ Four-Script Modular Design
```bash
# Complete Pipeline - Orchestrated execution
uv run main.py # Run all stages with defaults
uv run main.py --comprehensive # Full download with all data sources
uv run main.py --force-migration # Force re-migration of existing files
# Individual Stages - Independent execution
uv run main.py --stage 1 # Download & cache data only
uv run main.py --stage 2 # Migrate cached data to JSON
uv run main.py --stage 3 # Generate git commit plans
uv run main.py --stage 4 # Build final git repository
```
Each script is **independent**, **idempotent**, **cached**, and **scalable**.
### 📊 Comprehensive Data Sources
Sources:
- https://www.govinfo.gov/bulkdata/
- https://xml.house.gov/
- https://uscode.house.gov/download/priorreleasepoints.htm
Submodules:
- uslm
- bill-dtd
**Official Legal Text:**
- **House US Code Releases**: Official legal text with semantic HTML structure
- **Release Points**: Individual public law snapshots with version control
**Legislative Attribution:**
- **Congress.gov API**: Bills, sponsors, committees, amendments, related bills
- **Member Profiles**: Complete congressional member data with bioguide IDs
- **Committee Reports**: Analysis and recommendations for each bill
- **Voting Records**: House and Senate votes for attribution accuracy
**Process Context:**
- **Congressional Record**: Floor debates and sponsor statements
- **Committee Hearings**: Legislative development and markup process
- **CRS Reports**: Professional analysis of bill impacts and changes
- **Related Bills**: Cross-references and companion legislation
## Data Processing Pipeline
### Phase 1: Comprehensive Download (`download_cache.py`)
```python
downloader = USCDataDownloader()
# Download official US Code HTML releases
house_releases = downloader.download_house_usc_releases(public_laws)
# Fetch comprehensive bill data from Congress.gov API
bill_data = downloader.download_congress_api_bills(public_laws)
# Get member profiles for proper attribution
members = downloader.download_member_profiles(congresses=[113,114,115,116,117,118,119])
# Download committee reports and analysis
committee_data = downloader.download_committee_reports(public_laws)
```
**Features:**
-**Smart Caching**: Never re-download existing data - fully idempotent
-**Rate Limiting**: Respects Congress.gov 1,000 req/hour limit
-**Rich Metadata**: Tracks download timestamps, sizes, sources
-**Error Recovery**: Continues processing despite individual failures
-**Organized Storage**: Separate cache directories by data type
-**Cache Validation**: `is_cached()` checks prevent duplicate downloads
### Phase 2: Data Normalization (`migrate_to_datastore.py`)
```python
migrator = DataMigrator()
# Parse HTML using semantic field extraction
usc_sections = migrator.extract_usc_sections_from_html(house_releases)
# Normalize congressional data with Pydantic validation
normalized_bills = migrator.migrate_congress_api_data(bill_data)
# Cross-reference and validate all relationships
migrator.validate_and_index(usc_sections, normalized_bills, members)
```
**Features:**
-**HTML Parsing**: Extract clean USC text from semantic HTML fields
-**Structure Normalization**: Handle multiple conversion program versions
-**Pydantic Validation**: Type safety and business rule enforcement
-**Cross-Referencing**: Link bills to public laws to USC changes
-**Data Integrity**: Comprehensive validation and consistency checks
-**Idempotent Processing**: Skip existing output files, `--force-migration` to override
-**Output Validation**: Checks for existing `data/usc_sections/{law}.json` files
### Phase 3: Smart Git Planning (`generate_git_plan.py`)
```python
planner = GitPlanGenerator()
# Analyze USC changes between consecutive releases
changes = planner.analyze_usc_changes(old_release, new_release)
# Generate commit plans for each public law
commit_plans = planner.generate_incremental_commit_plans(changes, public_laws)
# Optimize commit sequence for git blame accuracy
optimized = planner.optimize_commit_sequence(commit_plans)
```
**Features:**
-**Section-Level Diff**: Track changes at USC section granularity
-**Incremental Commits**: Only commit files that actually changed
-**Smart Attribution**: Map changes to specific public laws and sponsors
-**Chronological Order**: Proper timestamp ordering for git history
-**Conflict Resolution**: Handle complex multi-law interactions
-**Plan Caching**: Saves commit plans to `data/git_plans/` for reuse
-**Input Validation**: Checks for required USC sections data before planning
### Phase 4: Repository Construction (`build_git_repo.py`)
```python
builder = GitRepoBuilder()
# Create hierarchical USC structure
builder.build_hierarchical_structure(usc_sections)
# Apply commit plans with proper attribution
for plan in commit_plans:
builder.apply_commit_plan(plan)
# Validate git blame functionality
builder.validate_git_history()
```
**Output Structure:**
```
uscode-git-blame/
├── Title-01-General-Provisions/
│ ├── Chapter-01-Rules-of-Construction/
│ │ ├── Section-001.md # § 1. Words denoting number, gender...
│ │ ├── Section-002.md # § 2. "County" as including "parish"...
│ │ └── Section-008.md # § 8. "Person", "human being"...
│ └── Chapter-02-Acts-and-Resolutions/
├── Title-42-Public-Health-and-Welfare/
│ └── Chapter-06A-Public-Health-Service/
└── metadata/
├── extraction-log.json
├── commit-plans.json
└── validation-results.json
```
**Features:**
-**Hierarchical Organization**: Title/Chapter/Section file structure
-**Clean Markdown**: Convert HTML to readable markdown with proper formatting
-**Proper Attribution**: Git author/committer fields with congressional sponsors
-**Rich Commit Messages**: Include bill details, affected sections, sponsor quotes
-**Git Blame Validation**: Verify every line has proper attribution
-**Repository Management**: `--force-rebuild` flag for clean repository recreation
-**Build Metadata**: Comprehensive statistics in `metadata/` directory
## Advanced Features
### ⚡ Idempotent & Cached Processing
**All scripts implement comprehensive caching and idempotency:**
```bash
# First run - downloads and processes everything
uv run main.py --laws 119-001,119-004
# Second run - skips existing work, completes instantly
uv run main.py --laws 119-001,119-004
# Output: ✅ Skipping HTML migration for 119-001 - output exists
# Force complete re-processing when needed
uv run main.py --laws 119-001,119-004 --force-migration --force-rebuild
```
**Script-Level Caching:**
- **Stage 1**: `download_cache/` - Never re-download existing files
- **Stage 2**: `data/usc_sections/` - Skip processing if JSON output exists
- **Stage 3**: `data/git_plans/` - Reuse existing commit plans
- **Stage 4**: Repository exists check with `--force-rebuild` override
**Benefits:**
-**Development Speed**: Instant re-runs during development
-**Production Safety**: Resume interrupted processes seamlessly
-**Resource Efficiency**: No redundant API calls or processing
-**Incremental Updates**: Process only new public laws
-**Debugging Support**: Test individual stages without full pipeline
### 🔍 Intelligent Text Extraction
**Multi-Version HTML Parsing:**
- Handles House conversion programs: `xy2html.pm-0.400` through `xy2html.pm-0.401`
- Extracts clean text from semantic field markers (`<!-- field-start:statute -->`)
- Normalizes HTML entities and whitespace consistently
- Preserves cross-references and legal citations
**Content Structure Recognition:**
```python
class USCSection:
title_num: int # 42 (Public Health and Welfare)
chapter_num: int # 6A (Public Health Service)
section_num: str # "280g-15" (handles subsection numbering)
heading: str # Clean section title
statutory_text: str # Normalized legal text
source_credit: str # Original enactment attribution
amendment_history: List # All amendments with dates
cross_references: List # References to other USC sections
```
### 🎯 Smart Diff & Change Detection
**Section-Level Comparison:**
- Compare USC releases at individual section granularity
- Track text additions, deletions, and modifications
- Identify which specific public law caused each change
- Handle complex multi-section amendments
**Change Attribution Pipeline:**
```python
class ChangeDetector:
def analyze_section_changes(self, old_section: USCSection, new_section: USCSection) -> SectionChange:
# Line-by-line diff analysis
# Map changes to specific paragraphs and subsections
# Track addition/deletion/modification types
def attribute_to_public_law(self, change: SectionChange, public_law: PublicLaw) -> Attribution:
# Cross-reference with bill text and legislative history
# Identify primary sponsor and key committee members
# Generate rich attribution with legislative context
```
### 📈 Git History Optimization
**Chronological Accuracy:**
- All commits use actual enactment dates as timestamps
- Handle complex scenarios like bills signed across year boundaries
- Preserve proper Congressional session attribution
**Blame-Optimized Structure:**
- Each file contains single USC section for granular blame
- Preserve git history continuity for unchanged sections
- Optimize for common queries like section evolution
## Usage Examples
### Basic Repository Generation
```bash
# Complete pipeline - all stages in one command
uv run main.py
# Comprehensive processing with all data sources
uv run main.py --comprehensive
# Process specific public laws
uv run main.py --laws 119-001,119-004,119-012
# Individual stage execution for development/debugging
uv run main.py --stage 1 # Download only
uv run main.py --stage 2 # Migration only
uv run main.py --stage 3 # Planning only
uv run main.py --stage 4 # Repository building only
```
### Advanced Queries
```bash
cd uscode-git-blame
# See who last modified healthcare provisions
git blame Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
# Track complete evolution of a section
git log --follow --patch Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
# Compare major healthcare laws
git diff PL-111-148..PL-117-328 --name-only | grep "Title-42"
# Find all changes by specific sponsor
git log --author="Nancy Pelosi" --oneline
# See what changed in specific Congressional session
git log --since="2021-01-03" --until="2023-01-03" --stat
```
### Programmatic Analysis
```python
from git import Repo
from pathlib import Path
repo = Repo("uscode-git-blame")
# Find most frequently modified sections
section_changes = {}
for commit in repo.iter_commits():
for file in commit.stats.files:
section_changes[file] = section_changes.get(file, 0) + 1
# Analyze sponsor activity
sponsor_activity = {}
for commit in repo.iter_commits():
author = commit.author.name
sponsor_activity[author] = sponsor_activity.get(author, 0) + 1
# Track healthcare law evolution
healthcare_commits = [c for c in repo.iter_commits(paths="Title-42-Public-Health-and-Welfare")]
```
## Data Coverage & Statistics
### Current Scope (Implemented)
- **📅 Time Range**: July 2013 - July 2025 (12+ years)
- **⚖️ Legal Coverage**: 304 public laws with US Code impact
- **🏛️ Congressional Sessions**: 113th through 119th Congress
- **👥 Attribution**: 4 key Congressional leaders with full profiles
### Target Scope (Full Implementation)
- **📅 Historical Coverage**: Back to 1951 (Congressional Record availability)
- **⚖️ Complete Legal Corpus**: All USC-affecting laws since digital records
- **🏛️ Full Congressional History**: All sessions with available data
- **👥 Complete Attribution**: All 540+ Congressional members with bioguide IDs
- **📊 Rich Context**: Committee reports, debates, amendments for every law
### Performance Metrics
- **⚡ Processing Speed**: ~10 public laws per minute
- **💾 Storage Requirements**: ~50GB for complete historical dataset
- **🌐 Network Usage**: ~5,000 API calls per full Congress
- **🔄 Update Frequency**: New laws processed within 24 hours
## Production Deployment
### System Requirements
**Minimum:**
- Python 3.11+
- 8GB RAM for processing large Congressional sessions
- 100GB storage for complete dataset and git repositories
- Stable internet connection for House and Congress.gov APIs
**Recommended:**
- Python 3.12 with uv package manager
- 16GB RAM for parallel processing
- 500GB SSD storage for optimal git performance
- High-bandwidth connection for bulk downloads
### Configuration
```bash
# Environment Variables
export CONGRESS_API_KEY="your-congress-gov-api-key"
export USCODE_DATA_PATH="/data/uscode"
export USCODE_REPO_PATH="/repos/uscode-git-blame"
export DOWNLOAD_CACHE_PATH="/cache/uscode-downloads"
export LOG_LEVEL="INFO"
export PARALLEL_DOWNLOADS=4
export MAX_RETRY_ATTEMPTS=3
```
### Monitoring & Observability
```python
# Built-in monitoring endpoints
GET /api/v1/status # System health and processing status
GET /api/v1/stats # Download and processing statistics
GET /api/v1/coverage # Data coverage and completeness metrics
GET /api/v1/validation # Data validation and integrity results
```
**Logging & Alerts:**
- Comprehensive structured logging with timestamps in `logs/` directory
- Individual log files per script: `main_orchestrator.log`, `download_cache.log`, etc.
- Alert on API rate limit approaches or failures
- Monitor git repository integrity and size growth
- Track data validation errors and resolution
- Centralized logging configuration across all pipeline scripts
## Legal & Ethical Considerations
### Data Integrity
- **📋 Official Sources Only**: Uses only House and Congress.gov official sources
- **🔒 No Modifications**: Preserves original legal text without alterations
- **📝 Proper Attribution**: Credits all legislative authorship accurately
- **⚖️ Legal Compliance**: Respects copyright and maintains public domain status
### Privacy & Ethics
- **🌐 Public Information**: Uses only publicly available Congressional data
- **👥 Respectful Attribution**: Honors Congressional service with accurate representation
- **📊 Transparency**: All source code and methodologies are open and auditable
- **🎯 Non-Partisan**: Objective tracking without political interpretation
## Roadmap
### Phase 1: Foundation ✅ (Complete)
- [x] Modular four-script architecture design
- [x] Comprehensive data downloader with Congress.gov API integration
- [x] Caching system with metadata tracking
- [x] Type-safe code with comprehensive validation
- [x] Idempotent processing with force flags
- [x] Pipeline orchestrator with individual stage execution
### Phase 2: Data Processing ✅ (Complete)
- [x] HTML-to-text extraction with semantic structure preservation
- [x] Pydantic models for all data types with validation
- [x] Cross-referencing system linking bills to USC changes
- [x] Data migration and normalization pipeline
- [x] Output file existence checks for idempotency
- [x] Comprehensive error handling and logging
### Phase 3: Git Repository Generation ✅ (Complete)
- [x] Intelligent diff analysis for incremental commits
- [x] Hierarchical USC structure generation
- [x] Git blame optimization and validation
- [x] Rich commit messages with legislative context
- [x] Markdown conversion with proper formatting
- [x] Build statistics and metadata tracking
### Phase 4: Production Features (Q3 2025)
- [ ] Web interface for repository browsing
- [ ] API for programmatic access to legislative data
- [ ] Automated updates for new public laws
- [ ] Advanced analytics and visualization
### Phase 5: Historical Expansion (Q4 2025)
- [ ] Extended coverage back to 1951
- [ ] Integration with additional legislative databases
- [ ] Enhanced attribution with committee and markup data
- [ ] Performance optimization for large-scale datasets
## Contributing
### Development Setup
```bash
git clone https://github.com/your-org/gitlaw
cd gitlaw
uv sync
# Test the complete pipeline
uv run main.py --help
# Run individual stages for development
uv run main.py --stage 1 --laws 119-001 # Test download
uv run main.py --stage 2 --laws 119-001 # Test migration
uv run main.py --stage 3 --laws 119-001 # Test planning
uv run main.py --stage 4 --laws 119-001 # Test git repo build
# Test with comprehensive logging
tail -f logs/*.log # Monitor all pipeline logs
```
### Adding New Features
1. **Data Sources**: Extend `download_cache.py` with new Congress.gov endpoints
2. **Processing**: Add new Pydantic models in `models.py`
3. **Git Features**: Enhance `build_git_repo.py` with new attribution methods
4. **Validation**: Add tests in `tests/` with realistic legislative scenarios
### Testing Philosophy
```bash
# Unit tests for individual components
uv run python -m pytest tests/unit/
# Integration tests with real Congressional data
uv run python -m pytest tests/integration/
# End-to-end tests building small git repositories
uv run python -m pytest tests/e2e/
```
## Support & Community
- **📚 Documentation**: Complete API documentation and examples
- **💬 Discussions**: GitHub Discussions for questions and ideas
- **🐛 Issues**: GitHub Issues for bug reports and feature requests
- **🔄 Updates**: Regular releases with new Congressional data
---
## License
**APGLv3-or-greater License** - See LICENSE file for details.
*The United States Code is in the public domain. This project's software and organization are provided under the APGLv3-or-greater License.*
---
**🏛️ "Every line of law, attributed to its author, tracked through time."**
*Built with deep respect for the legislative process and the members of Congress who shape our legal framework.*