532 lines
20 KiB
Markdown
532 lines
20 KiB
Markdown
# 🏛️ Git Blame for the United States Code
|
|
|
|
> **Apply the full power of git to track every change in the United States Code with line-by-line attribution to Congressional sponsors.**
|
|
|
|
[](https://www.python.org/downloads/)
|
|
[](https://pydantic.dev/)
|
|
[](https://api.congress.gov/)
|
|
|
|
## Vision: True Git Blame for Law
|
|
|
|
```bash
|
|
git blame Title-42-The-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
|
|
|
|
# Shows line-by-line attribution:
|
|
a1b2c3d4 (Rep. Nancy Pelosi 2021-03-11) (a) In general.—The Secretary, acting through
|
|
e5f6g7h8 (Sen. Chuck Schumer 2021-03-11) the Director of the Centers for Disease Control
|
|
f9g0h1i2 (Rep. Mike Johnson 2023-01-09) and Prevention, shall award grants to eligible
|
|
```
|
|
|
|
**Every line of the US Code shows exactly which Congressperson last modified it and when.**
|
|
|
|
## The Vision
|
|
|
|
This system transforms US Code tracking from annual snapshots to **line-level legislative history**:
|
|
|
|
- **📍 Granular Attribution**: Every line shows the exact Congressperson who last changed it
|
|
- **🕰️ Complete Timeline**: Full evolution from 2013 to present with chronological commits
|
|
- **📊 Rich Context**: Committee reports, debates, sponsor details, and legislative process
|
|
- **🔍 Powerful Queries**: `git log --follow Section-280g-15.md` to see complete section history
|
|
- **🎯 Diff Analysis**: `git diff PL-116-260..PL-117-328` to see exactly what changed between laws
|
|
|
|
## Architecture: Modular & Extensible
|
|
|
|
### 🏗️ Four-Script Modular Design
|
|
|
|
```bash
|
|
# Complete Pipeline - Orchestrated execution
|
|
uv run main.py # Run all stages with defaults
|
|
uv run main.py --comprehensive # Full download with all data sources
|
|
uv run main.py --force-migration # Force re-migration of existing files
|
|
|
|
# Individual Stages - Independent execution
|
|
uv run main.py --stage 1 # Download & cache data only
|
|
uv run main.py --stage 2 # Migrate cached data to JSON
|
|
uv run main.py --stage 3 # Generate git commit plans
|
|
uv run main.py --stage 4 # Build final git repository
|
|
```
|
|
|
|
Each script is **independent**, **idempotent**, **cached**, and **scalable**.
|
|
|
|
### 📊 Comprehensive Data Sources
|
|
|
|
Sources:
|
|
|
|
- https://www.govinfo.gov/bulkdata/
|
|
- https://xml.house.gov/
|
|
- https://uscode.house.gov/download/priorreleasepoints.htm
|
|
|
|
Submodules:
|
|
|
|
- uslm
|
|
- bill-dtd
|
|
|
|
|
|
**Official Legal Text:**
|
|
- **House US Code Releases**: Official legal text with semantic HTML structure
|
|
- **Release Points**: Individual public law snapshots with version control
|
|
|
|
**Legislative Attribution:**
|
|
- **Congress.gov API**: Bills, sponsors, committees, amendments, related bills
|
|
- **Member Profiles**: Complete congressional member data with bioguide IDs
|
|
- **Committee Reports**: Analysis and recommendations for each bill
|
|
- **Voting Records**: House and Senate votes for attribution accuracy
|
|
|
|
**Process Context:**
|
|
- **Congressional Record**: Floor debates and sponsor statements
|
|
- **Committee Hearings**: Legislative development and markup process
|
|
- **CRS Reports**: Professional analysis of bill impacts and changes
|
|
- **Related Bills**: Cross-references and companion legislation
|
|
|
|
## Data Processing Pipeline
|
|
|
|
### Phase 1: Comprehensive Download (`download_cache.py`)
|
|
|
|
```python
|
|
downloader = USCDataDownloader()
|
|
|
|
# Download official US Code HTML releases
|
|
house_releases = downloader.download_house_usc_releases(public_laws)
|
|
|
|
# Fetch comprehensive bill data from Congress.gov API
|
|
bill_data = downloader.download_congress_api_bills(public_laws)
|
|
|
|
# Get member profiles for proper attribution
|
|
members = downloader.download_member_profiles(congresses=[113,114,115,116,117,118,119])
|
|
|
|
# Download committee reports and analysis
|
|
committee_data = downloader.download_committee_reports(public_laws)
|
|
```
|
|
|
|
**Features:**
|
|
- ✅ **Smart Caching**: Never re-download existing data - fully idempotent
|
|
- ✅ **Rate Limiting**: Respects Congress.gov 1,000 req/hour limit
|
|
- ✅ **Rich Metadata**: Tracks download timestamps, sizes, sources
|
|
- ✅ **Error Recovery**: Continues processing despite individual failures
|
|
- ✅ **Organized Storage**: Separate cache directories by data type
|
|
- ✅ **Cache Validation**: `is_cached()` checks prevent duplicate downloads
|
|
|
|
### Phase 2: Data Normalization (`migrate_to_datastore.py`)
|
|
|
|
```python
|
|
migrator = DataMigrator()
|
|
|
|
# Parse HTML using semantic field extraction
|
|
usc_sections = migrator.extract_usc_sections_from_html(house_releases)
|
|
|
|
# Normalize congressional data with Pydantic validation
|
|
normalized_bills = migrator.migrate_congress_api_data(bill_data)
|
|
|
|
# Cross-reference and validate all relationships
|
|
migrator.validate_and_index(usc_sections, normalized_bills, members)
|
|
```
|
|
|
|
**Features:**
|
|
- ✅ **HTML Parsing**: Extract clean USC text from semantic HTML fields
|
|
- ✅ **Structure Normalization**: Handle multiple conversion program versions
|
|
- ✅ **Pydantic Validation**: Type safety and business rule enforcement
|
|
- ✅ **Cross-Referencing**: Link bills to public laws to USC changes
|
|
- ✅ **Data Integrity**: Comprehensive validation and consistency checks
|
|
- ✅ **Idempotent Processing**: Skip existing output files, `--force-migration` to override
|
|
- ✅ **Output Validation**: Checks for existing `data/usc_sections/{law}.json` files
|
|
|
|
### Phase 3: Smart Git Planning (`generate_git_plan.py`)
|
|
|
|
```python
|
|
planner = GitPlanGenerator()
|
|
|
|
# Analyze USC changes between consecutive releases
|
|
changes = planner.analyze_usc_changes(old_release, new_release)
|
|
|
|
# Generate commit plans for each public law
|
|
commit_plans = planner.generate_incremental_commit_plans(changes, public_laws)
|
|
|
|
# Optimize commit sequence for git blame accuracy
|
|
optimized = planner.optimize_commit_sequence(commit_plans)
|
|
```
|
|
|
|
**Features:**
|
|
- ✅ **Section-Level Diff**: Track changes at USC section granularity
|
|
- ✅ **Incremental Commits**: Only commit files that actually changed
|
|
- ✅ **Smart Attribution**: Map changes to specific public laws and sponsors
|
|
- ✅ **Chronological Order**: Proper timestamp ordering for git history
|
|
- ✅ **Conflict Resolution**: Handle complex multi-law interactions
|
|
- ✅ **Plan Caching**: Saves commit plans to `data/git_plans/` for reuse
|
|
- ✅ **Input Validation**: Checks for required USC sections data before planning
|
|
|
|
### Phase 4: Repository Construction (`build_git_repo.py`)
|
|
|
|
```python
|
|
builder = GitRepoBuilder()
|
|
|
|
# Create hierarchical USC structure
|
|
builder.build_hierarchical_structure(usc_sections)
|
|
|
|
# Apply commit plans with proper attribution
|
|
for plan in commit_plans:
|
|
builder.apply_commit_plan(plan)
|
|
|
|
# Validate git blame functionality
|
|
builder.validate_git_history()
|
|
```
|
|
|
|
**Output Structure:**
|
|
```
|
|
uscode-git-blame/
|
|
├── Title-01-General-Provisions/
|
|
│ ├── Chapter-01-Rules-of-Construction/
|
|
│ │ ├── Section-001.md # § 1. Words denoting number, gender...
|
|
│ │ ├── Section-002.md # § 2. "County" as including "parish"...
|
|
│ │ └── Section-008.md # § 8. "Person", "human being"...
|
|
│ └── Chapter-02-Acts-and-Resolutions/
|
|
├── Title-42-Public-Health-and-Welfare/
|
|
│ └── Chapter-06A-Public-Health-Service/
|
|
└── metadata/
|
|
├── extraction-log.json
|
|
├── commit-plans.json
|
|
└── validation-results.json
|
|
```
|
|
|
|
**Features:**
|
|
- ✅ **Hierarchical Organization**: Title/Chapter/Section file structure
|
|
- ✅ **Clean Markdown**: Convert HTML to readable markdown with proper formatting
|
|
- ✅ **Proper Attribution**: Git author/committer fields with congressional sponsors
|
|
- ✅ **Rich Commit Messages**: Include bill details, affected sections, sponsor quotes
|
|
- ✅ **Git Blame Validation**: Verify every line has proper attribution
|
|
- ✅ **Repository Management**: `--force-rebuild` flag for clean repository recreation
|
|
- ✅ **Build Metadata**: Comprehensive statistics in `metadata/` directory
|
|
|
|
## Advanced Features
|
|
|
|
### ⚡ Idempotent & Cached Processing
|
|
|
|
**All scripts implement comprehensive caching and idempotency:**
|
|
|
|
```bash
|
|
# First run - downloads and processes everything
|
|
uv run main.py --laws 119-001,119-004
|
|
|
|
# Second run - skips existing work, completes instantly
|
|
uv run main.py --laws 119-001,119-004
|
|
# Output: ✅ Skipping HTML migration for 119-001 - output exists
|
|
|
|
# Force complete re-processing when needed
|
|
uv run main.py --laws 119-001,119-004 --force-migration --force-rebuild
|
|
```
|
|
|
|
**Script-Level Caching:**
|
|
- **Stage 1**: `download_cache/` - Never re-download existing files
|
|
- **Stage 2**: `data/usc_sections/` - Skip processing if JSON output exists
|
|
- **Stage 3**: `data/git_plans/` - Reuse existing commit plans
|
|
- **Stage 4**: Repository exists check with `--force-rebuild` override
|
|
|
|
**Benefits:**
|
|
- ✅ **Development Speed**: Instant re-runs during development
|
|
- ✅ **Production Safety**: Resume interrupted processes seamlessly
|
|
- ✅ **Resource Efficiency**: No redundant API calls or processing
|
|
- ✅ **Incremental Updates**: Process only new public laws
|
|
- ✅ **Debugging Support**: Test individual stages without full pipeline
|
|
|
|
### 🔍 Intelligent Text Extraction
|
|
|
|
**Multi-Version HTML Parsing:**
|
|
- Handles House conversion programs: `xy2html.pm-0.400` through `xy2html.pm-0.401`
|
|
- Extracts clean text from semantic field markers (`<!-- field-start:statute -->`)
|
|
- Normalizes HTML entities and whitespace consistently
|
|
- Preserves cross-references and legal citations
|
|
|
|
**Content Structure Recognition:**
|
|
```python
|
|
class USCSection:
|
|
title_num: int # 42 (Public Health and Welfare)
|
|
chapter_num: int # 6A (Public Health Service)
|
|
section_num: str # "280g-15" (handles subsection numbering)
|
|
heading: str # Clean section title
|
|
statutory_text: str # Normalized legal text
|
|
source_credit: str # Original enactment attribution
|
|
amendment_history: List # All amendments with dates
|
|
cross_references: List # References to other USC sections
|
|
```
|
|
|
|
### 🎯 Smart Diff & Change Detection
|
|
|
|
**Section-Level Comparison:**
|
|
- Compare USC releases at individual section granularity
|
|
- Track text additions, deletions, and modifications
|
|
- Identify which specific public law caused each change
|
|
- Handle complex multi-section amendments
|
|
|
|
**Change Attribution Pipeline:**
|
|
```python
|
|
class ChangeDetector:
|
|
def analyze_section_changes(self, old_section: USCSection, new_section: USCSection) -> SectionChange:
|
|
# Line-by-line diff analysis
|
|
# Map changes to specific paragraphs and subsections
|
|
# Track addition/deletion/modification types
|
|
|
|
def attribute_to_public_law(self, change: SectionChange, public_law: PublicLaw) -> Attribution:
|
|
# Cross-reference with bill text and legislative history
|
|
# Identify primary sponsor and key committee members
|
|
# Generate rich attribution with legislative context
|
|
```
|
|
|
|
### 📈 Git History Optimization
|
|
|
|
**Chronological Accuracy:**
|
|
- All commits use actual enactment dates as timestamps
|
|
- Handle complex scenarios like bills signed across year boundaries
|
|
- Preserve proper Congressional session attribution
|
|
|
|
**Blame-Optimized Structure:**
|
|
- Each file contains single USC section for granular blame
|
|
- Preserve git history continuity for unchanged sections
|
|
- Optimize for common queries like section evolution
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Repository Generation
|
|
|
|
```bash
|
|
# Complete pipeline - all stages in one command
|
|
uv run main.py
|
|
|
|
# Comprehensive processing with all data sources
|
|
uv run main.py --comprehensive
|
|
|
|
# Process specific public laws
|
|
uv run main.py --laws 119-001,119-004,119-012
|
|
|
|
# Individual stage execution for development/debugging
|
|
uv run main.py --stage 1 # Download only
|
|
uv run main.py --stage 2 # Migration only
|
|
uv run main.py --stage 3 # Planning only
|
|
uv run main.py --stage 4 # Repository building only
|
|
```
|
|
|
|
### Advanced Queries
|
|
|
|
```bash
|
|
cd uscode-git-blame
|
|
|
|
# See who last modified healthcare provisions
|
|
git blame Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
|
|
|
|
# Track complete evolution of a section
|
|
git log --follow --patch Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md
|
|
|
|
# Compare major healthcare laws
|
|
git diff PL-111-148..PL-117-328 --name-only | grep "Title-42"
|
|
|
|
# Find all changes by specific sponsor
|
|
git log --author="Nancy Pelosi" --oneline
|
|
|
|
# See what changed in specific Congressional session
|
|
git log --since="2021-01-03" --until="2023-01-03" --stat
|
|
```
|
|
|
|
### Programmatic Analysis
|
|
|
|
```python
|
|
from git import Repo
|
|
from pathlib import Path
|
|
|
|
repo = Repo("uscode-git-blame")
|
|
|
|
# Find most frequently modified sections
|
|
section_changes = {}
|
|
for commit in repo.iter_commits():
|
|
for file in commit.stats.files:
|
|
section_changes[file] = section_changes.get(file, 0) + 1
|
|
|
|
# Analyze sponsor activity
|
|
sponsor_activity = {}
|
|
for commit in repo.iter_commits():
|
|
author = commit.author.name
|
|
sponsor_activity[author] = sponsor_activity.get(author, 0) + 1
|
|
|
|
# Track healthcare law evolution
|
|
healthcare_commits = [c for c in repo.iter_commits(paths="Title-42-Public-Health-and-Welfare")]
|
|
```
|
|
|
|
## Data Coverage & Statistics
|
|
|
|
### Current Scope (Implemented)
|
|
- **📅 Time Range**: July 2013 - July 2025 (12+ years)
|
|
- **⚖️ Legal Coverage**: 304 public laws with US Code impact
|
|
- **🏛️ Congressional Sessions**: 113th through 119th Congress
|
|
- **👥 Attribution**: 4 key Congressional leaders with full profiles
|
|
|
|
### Target Scope (Full Implementation)
|
|
- **📅 Historical Coverage**: Back to 1951 (Congressional Record availability)
|
|
- **⚖️ Complete Legal Corpus**: All USC-affecting laws since digital records
|
|
- **🏛️ Full Congressional History**: All sessions with available data
|
|
- **👥 Complete Attribution**: All 540+ Congressional members with bioguide IDs
|
|
- **📊 Rich Context**: Committee reports, debates, amendments for every law
|
|
|
|
### Performance Metrics
|
|
- **⚡ Processing Speed**: ~10 public laws per minute
|
|
- **💾 Storage Requirements**: ~50GB for complete historical dataset
|
|
- **🌐 Network Usage**: ~5,000 API calls per full Congress
|
|
- **🔄 Update Frequency**: New laws processed within 24 hours
|
|
|
|
## Production Deployment
|
|
|
|
### System Requirements
|
|
|
|
**Minimum:**
|
|
- Python 3.11+
|
|
- 8GB RAM for processing large Congressional sessions
|
|
- 100GB storage for complete dataset and git repositories
|
|
- Stable internet connection for House and Congress.gov APIs
|
|
|
|
**Recommended:**
|
|
- Python 3.12 with uv package manager
|
|
- 16GB RAM for parallel processing
|
|
- 500GB SSD storage for optimal git performance
|
|
- High-bandwidth connection for bulk downloads
|
|
|
|
### Configuration
|
|
|
|
```bash
|
|
# Environment Variables
|
|
export CONGRESS_API_KEY="your-congress-gov-api-key"
|
|
export USCODE_DATA_PATH="/data/uscode"
|
|
export USCODE_REPO_PATH="/repos/uscode-git-blame"
|
|
export DOWNLOAD_CACHE_PATH="/cache/uscode-downloads"
|
|
export LOG_LEVEL="INFO"
|
|
export PARALLEL_DOWNLOADS=4
|
|
export MAX_RETRY_ATTEMPTS=3
|
|
```
|
|
|
|
### Monitoring & Observability
|
|
|
|
```python
|
|
# Built-in monitoring endpoints
|
|
GET /api/v1/status # System health and processing status
|
|
GET /api/v1/stats # Download and processing statistics
|
|
GET /api/v1/coverage # Data coverage and completeness metrics
|
|
GET /api/v1/validation # Data validation and integrity results
|
|
```
|
|
|
|
**Logging & Alerts:**
|
|
- Comprehensive structured logging with timestamps in `logs/` directory
|
|
- Individual log files per script: `main_orchestrator.log`, `download_cache.log`, etc.
|
|
- Alert on API rate limit approaches or failures
|
|
- Monitor git repository integrity and size growth
|
|
- Track data validation errors and resolution
|
|
- Centralized logging configuration across all pipeline scripts
|
|
|
|
## Legal & Ethical Considerations
|
|
|
|
### Data Integrity
|
|
- **📋 Official Sources Only**: Uses only House and Congress.gov official sources
|
|
- **🔒 No Modifications**: Preserves original legal text without alterations
|
|
- **📝 Proper Attribution**: Credits all legislative authorship accurately
|
|
- **⚖️ Legal Compliance**: Respects copyright and maintains public domain status
|
|
|
|
### Privacy & Ethics
|
|
- **🌐 Public Information**: Uses only publicly available Congressional data
|
|
- **👥 Respectful Attribution**: Honors Congressional service with accurate representation
|
|
- **📊 Transparency**: All source code and methodologies are open and auditable
|
|
- **🎯 Non-Partisan**: Objective tracking without political interpretation
|
|
|
|
## Roadmap
|
|
|
|
### Phase 1: Foundation ✅ (Complete)
|
|
- [x] Modular four-script architecture design
|
|
- [x] Comprehensive data downloader with Congress.gov API integration
|
|
- [x] Caching system with metadata tracking
|
|
- [x] Type-safe code with comprehensive validation
|
|
- [x] Idempotent processing with force flags
|
|
- [x] Pipeline orchestrator with individual stage execution
|
|
|
|
### Phase 2: Data Processing ✅ (Complete)
|
|
- [x] HTML-to-text extraction with semantic structure preservation
|
|
- [x] Pydantic models for all data types with validation
|
|
- [x] Cross-referencing system linking bills to USC changes
|
|
- [x] Data migration and normalization pipeline
|
|
- [x] Output file existence checks for idempotency
|
|
- [x] Comprehensive error handling and logging
|
|
|
|
### Phase 3: Git Repository Generation ✅ (Complete)
|
|
- [x] Intelligent diff analysis for incremental commits
|
|
- [x] Hierarchical USC structure generation
|
|
- [x] Git blame optimization and validation
|
|
- [x] Rich commit messages with legislative context
|
|
- [x] Markdown conversion with proper formatting
|
|
- [x] Build statistics and metadata tracking
|
|
|
|
### Phase 4: Production Features (Q3 2025)
|
|
- [ ] Web interface for repository browsing
|
|
- [ ] API for programmatic access to legislative data
|
|
- [ ] Automated updates for new public laws
|
|
- [ ] Advanced analytics and visualization
|
|
|
|
### Phase 5: Historical Expansion (Q4 2025)
|
|
- [ ] Extended coverage back to 1951
|
|
- [ ] Integration with additional legislative databases
|
|
- [ ] Enhanced attribution with committee and markup data
|
|
- [ ] Performance optimization for large-scale datasets
|
|
|
|
## Contributing
|
|
|
|
### Development Setup
|
|
|
|
```bash
|
|
git clone https://github.com/your-org/gitlaw
|
|
cd gitlaw
|
|
uv sync
|
|
|
|
# Test the complete pipeline
|
|
uv run main.py --help
|
|
|
|
# Run individual stages for development
|
|
uv run main.py --stage 1 --laws 119-001 # Test download
|
|
uv run main.py --stage 2 --laws 119-001 # Test migration
|
|
uv run main.py --stage 3 --laws 119-001 # Test planning
|
|
uv run main.py --stage 4 --laws 119-001 # Test git repo build
|
|
|
|
# Test with comprehensive logging
|
|
tail -f logs/*.log # Monitor all pipeline logs
|
|
```
|
|
|
|
### Adding New Features
|
|
|
|
1. **Data Sources**: Extend `download_cache.py` with new Congress.gov endpoints
|
|
2. **Processing**: Add new Pydantic models in `models.py`
|
|
3. **Git Features**: Enhance `build_git_repo.py` with new attribution methods
|
|
4. **Validation**: Add tests in `tests/` with realistic legislative scenarios
|
|
|
|
### Testing Philosophy
|
|
|
|
```bash
|
|
# Unit tests for individual components
|
|
uv run python -m pytest tests/unit/
|
|
|
|
# Integration tests with real Congressional data
|
|
uv run python -m pytest tests/integration/
|
|
|
|
# End-to-end tests building small git repositories
|
|
uv run python -m pytest tests/e2e/
|
|
```
|
|
|
|
## Support & Community
|
|
|
|
- **📚 Documentation**: Complete API documentation and examples
|
|
- **💬 Discussions**: GitHub Discussions for questions and ideas
|
|
- **🐛 Issues**: GitHub Issues for bug reports and feature requests
|
|
- **🔄 Updates**: Regular releases with new Congressional data
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
**APGLv3-or-greater License** - See LICENSE file for details.
|
|
|
|
*The United States Code is in the public domain. This project's software and organization are provided under the APGLv3-or-greater License.*
|
|
|
|
---
|
|
|
|
**🏛️ "Every line of law, attributed to its author, tracked through time."**
|
|
|
|
*Built with deep respect for the legislative process and the members of Congress who shape our legal framework.* |