# πŸ›οΈ Git Blame for the United States Code > **Apply the full power of git to track every change in the United States Code with line-by-line attribution to Congressional sponsors.** [![Python](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![Pydantic](https://img.shields.io/badge/pydantic-v2-green.svg)](https://pydantic.dev/) [![Congress.gov](https://img.shields.io/badge/data-Congress.gov%20API-blue.svg)](https://api.congress.gov/) ## Vision: True Git Blame for Law ```bash git blame Title-42-The-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md # Shows line-by-line attribution: a1b2c3d4 (Rep. Nancy Pelosi 2021-03-11) (a) In general.β€”The Secretary, acting through e5f6g7h8 (Sen. Chuck Schumer 2021-03-11) the Director of the Centers for Disease Control f9g0h1i2 (Rep. Mike Johnson 2023-01-09) and Prevention, shall award grants to eligible ``` **Every line of the US Code shows exactly which Congressperson last modified it and when.** ## The Vision This system transforms US Code tracking from annual snapshots to **line-level legislative history**: - **πŸ“ Granular Attribution**: Every line shows the exact Congressperson who last changed it - **πŸ•°οΈ Complete Timeline**: Full evolution from 2013 to present with chronological commits - **πŸ“Š Rich Context**: Committee reports, debates, sponsor details, and legislative process - **πŸ” Powerful Queries**: `git log --follow Section-280g-15.md` to see complete section history - **🎯 Diff Analysis**: `git diff PL-116-260..PL-117-328` to see exactly what changed between laws ## Architecture: Modular & Extensible ### πŸ—οΈ Four-Script Modular Design ```bash # Complete Pipeline - Orchestrated execution uv run main.py # Run all stages with defaults uv run main.py --comprehensive # Full download with all data sources uv run main.py --force-migration # Force re-migration of existing files # Individual Stages - Independent execution uv run main.py --stage 1 # Download & cache data only uv run main.py --stage 2 # Migrate cached data to JSON uv run main.py --stage 3 # Generate git commit plans uv run main.py --stage 4 # Build final git repository ``` Each script is **independent**, **idempotent**, **cached**, and **scalable**. ### πŸ“Š Comprehensive Data Sources Sources: - https://www.govinfo.gov/bulkdata/ - https://xml.house.gov/ - https://uscode.house.gov/download/priorreleasepoints.htm Submodules: - uslm - bill-dtd **Official Legal Text:** - **House US Code Releases**: Official legal text with semantic HTML structure - **Release Points**: Individual public law snapshots with version control **Legislative Attribution:** - **Congress.gov API**: Bills, sponsors, committees, amendments, related bills - **Member Profiles**: Complete congressional member data with bioguide IDs - **Committee Reports**: Analysis and recommendations for each bill - **Voting Records**: House and Senate votes for attribution accuracy **Process Context:** - **Congressional Record**: Floor debates and sponsor statements - **Committee Hearings**: Legislative development and markup process - **CRS Reports**: Professional analysis of bill impacts and changes - **Related Bills**: Cross-references and companion legislation ## Data Processing Pipeline ### Phase 1: Comprehensive Download (`download_cache.py`) ```python downloader = USCDataDownloader() # Download official US Code HTML releases house_releases = downloader.download_house_usc_releases(public_laws) # Fetch comprehensive bill data from Congress.gov API bill_data = downloader.download_congress_api_bills(public_laws) # Get member profiles for proper attribution members = downloader.download_member_profiles(congresses=[113,114,115,116,117,118,119]) # Download committee reports and analysis committee_data = downloader.download_committee_reports(public_laws) ``` **Features:** - βœ… **Smart Caching**: Never re-download existing data - fully idempotent - βœ… **Rate Limiting**: Respects Congress.gov 1,000 req/hour limit - βœ… **Rich Metadata**: Tracks download timestamps, sizes, sources - βœ… **Error Recovery**: Continues processing despite individual failures - βœ… **Organized Storage**: Separate cache directories by data type - βœ… **Cache Validation**: `is_cached()` checks prevent duplicate downloads ### Phase 2: Data Normalization (`migrate_to_datastore.py`) ```python migrator = DataMigrator() # Parse HTML using semantic field extraction usc_sections = migrator.extract_usc_sections_from_html(house_releases) # Normalize congressional data with Pydantic validation normalized_bills = migrator.migrate_congress_api_data(bill_data) # Cross-reference and validate all relationships migrator.validate_and_index(usc_sections, normalized_bills, members) ``` **Features:** - βœ… **HTML Parsing**: Extract clean USC text from semantic HTML fields - βœ… **Structure Normalization**: Handle multiple conversion program versions - βœ… **Pydantic Validation**: Type safety and business rule enforcement - βœ… **Cross-Referencing**: Link bills to public laws to USC changes - βœ… **Data Integrity**: Comprehensive validation and consistency checks - βœ… **Idempotent Processing**: Skip existing output files, `--force-migration` to override - βœ… **Output Validation**: Checks for existing `data/usc_sections/{law}.json` files ### Phase 3: Smart Git Planning (`generate_git_plan.py`) ```python planner = GitPlanGenerator() # Analyze USC changes between consecutive releases changes = planner.analyze_usc_changes(old_release, new_release) # Generate commit plans for each public law commit_plans = planner.generate_incremental_commit_plans(changes, public_laws) # Optimize commit sequence for git blame accuracy optimized = planner.optimize_commit_sequence(commit_plans) ``` **Features:** - βœ… **Section-Level Diff**: Track changes at USC section granularity - βœ… **Incremental Commits**: Only commit files that actually changed - βœ… **Smart Attribution**: Map changes to specific public laws and sponsors - βœ… **Chronological Order**: Proper timestamp ordering for git history - βœ… **Conflict Resolution**: Handle complex multi-law interactions - βœ… **Plan Caching**: Saves commit plans to `data/git_plans/` for reuse - βœ… **Input Validation**: Checks for required USC sections data before planning ### Phase 4: Repository Construction (`build_git_repo.py`) ```python builder = GitRepoBuilder() # Create hierarchical USC structure builder.build_hierarchical_structure(usc_sections) # Apply commit plans with proper attribution for plan in commit_plans: builder.apply_commit_plan(plan) # Validate git blame functionality builder.validate_git_history() ``` **Output Structure:** ``` uscode-git-blame/ β”œβ”€β”€ Title-01-General-Provisions/ β”‚ β”œβ”€β”€ Chapter-01-Rules-of-Construction/ β”‚ β”‚ β”œβ”€β”€ Section-001.md # Β§ 1. Words denoting number, gender... β”‚ β”‚ β”œβ”€β”€ Section-002.md # Β§ 2. "County" as including "parish"... β”‚ β”‚ └── Section-008.md # Β§ 8. "Person", "human being"... β”‚ └── Chapter-02-Acts-and-Resolutions/ β”œβ”€β”€ Title-42-Public-Health-and-Welfare/ β”‚ └── Chapter-06A-Public-Health-Service/ └── metadata/ β”œβ”€β”€ extraction-log.json β”œβ”€β”€ commit-plans.json └── validation-results.json ``` **Features:** - βœ… **Hierarchical Organization**: Title/Chapter/Section file structure - βœ… **Clean Markdown**: Convert HTML to readable markdown with proper formatting - βœ… **Proper Attribution**: Git author/committer fields with congressional sponsors - βœ… **Rich Commit Messages**: Include bill details, affected sections, sponsor quotes - βœ… **Git Blame Validation**: Verify every line has proper attribution - βœ… **Repository Management**: `--force-rebuild` flag for clean repository recreation - βœ… **Build Metadata**: Comprehensive statistics in `metadata/` directory ## Advanced Features ### ⚑ Idempotent & Cached Processing **All scripts implement comprehensive caching and idempotency:** ```bash # First run - downloads and processes everything uv run main.py --laws 119-001,119-004 # Second run - skips existing work, completes instantly uv run main.py --laws 119-001,119-004 # Output: βœ… Skipping HTML migration for 119-001 - output exists # Force complete re-processing when needed uv run main.py --laws 119-001,119-004 --force-migration --force-rebuild ``` **Script-Level Caching:** - **Stage 1**: `download_cache/` - Never re-download existing files - **Stage 2**: `data/usc_sections/` - Skip processing if JSON output exists - **Stage 3**: `data/git_plans/` - Reuse existing commit plans - **Stage 4**: Repository exists check with `--force-rebuild` override **Benefits:** - βœ… **Development Speed**: Instant re-runs during development - βœ… **Production Safety**: Resume interrupted processes seamlessly - βœ… **Resource Efficiency**: No redundant API calls or processing - βœ… **Incremental Updates**: Process only new public laws - βœ… **Debugging Support**: Test individual stages without full pipeline ### πŸ” Intelligent Text Extraction **Multi-Version HTML Parsing:** - Handles House conversion programs: `xy2html.pm-0.400` through `xy2html.pm-0.401` - Extracts clean text from semantic field markers (``) - Normalizes HTML entities and whitespace consistently - Preserves cross-references and legal citations **Content Structure Recognition:** ```python class USCSection: title_num: int # 42 (Public Health and Welfare) chapter_num: int # 6A (Public Health Service) section_num: str # "280g-15" (handles subsection numbering) heading: str # Clean section title statutory_text: str # Normalized legal text source_credit: str # Original enactment attribution amendment_history: List # All amendments with dates cross_references: List # References to other USC sections ``` ### 🎯 Smart Diff & Change Detection **Section-Level Comparison:** - Compare USC releases at individual section granularity - Track text additions, deletions, and modifications - Identify which specific public law caused each change - Handle complex multi-section amendments **Change Attribution Pipeline:** ```python class ChangeDetector: def analyze_section_changes(self, old_section: USCSection, new_section: USCSection) -> SectionChange: # Line-by-line diff analysis # Map changes to specific paragraphs and subsections # Track addition/deletion/modification types def attribute_to_public_law(self, change: SectionChange, public_law: PublicLaw) -> Attribution: # Cross-reference with bill text and legislative history # Identify primary sponsor and key committee members # Generate rich attribution with legislative context ``` ### πŸ“ˆ Git History Optimization **Chronological Accuracy:** - All commits use actual enactment dates as timestamps - Handle complex scenarios like bills signed across year boundaries - Preserve proper Congressional session attribution **Blame-Optimized Structure:** - Each file contains single USC section for granular blame - Preserve git history continuity for unchanged sections - Optimize for common queries like section evolution ## Usage Examples ### Basic Repository Generation ```bash # Complete pipeline - all stages in one command uv run main.py # Comprehensive processing with all data sources uv run main.py --comprehensive # Process specific public laws uv run main.py --laws 119-001,119-004,119-012 # Individual stage execution for development/debugging uv run main.py --stage 1 # Download only uv run main.py --stage 2 # Migration only uv run main.py --stage 3 # Planning only uv run main.py --stage 4 # Repository building only ``` ### Advanced Queries ```bash cd uscode-git-blame # See who last modified healthcare provisions git blame Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md # Track complete evolution of a section git log --follow --patch Title-42-Public-Health-and-Welfare/Chapter-06A-Public-Health-Service/Section-280g-15.md # Compare major healthcare laws git diff PL-111-148..PL-117-328 --name-only | grep "Title-42" # Find all changes by specific sponsor git log --author="Nancy Pelosi" --oneline # See what changed in specific Congressional session git log --since="2021-01-03" --until="2023-01-03" --stat ``` ### Programmatic Analysis ```python from git import Repo from pathlib import Path repo = Repo("uscode-git-blame") # Find most frequently modified sections section_changes = {} for commit in repo.iter_commits(): for file in commit.stats.files: section_changes[file] = section_changes.get(file, 0) + 1 # Analyze sponsor activity sponsor_activity = {} for commit in repo.iter_commits(): author = commit.author.name sponsor_activity[author] = sponsor_activity.get(author, 0) + 1 # Track healthcare law evolution healthcare_commits = [c for c in repo.iter_commits(paths="Title-42-Public-Health-and-Welfare")] ``` ## Data Coverage & Statistics ### Current Scope (Implemented) - **πŸ“… Time Range**: July 2013 - July 2025 (12+ years) - **βš–οΈ Legal Coverage**: 304 public laws with US Code impact - **πŸ›οΈ Congressional Sessions**: 113th through 119th Congress - **πŸ‘₯ Attribution**: 4 key Congressional leaders with full profiles ### Target Scope (Full Implementation) - **πŸ“… Historical Coverage**: Back to 1951 (Congressional Record availability) - **βš–οΈ Complete Legal Corpus**: All USC-affecting laws since digital records - **πŸ›οΈ Full Congressional History**: All sessions with available data - **πŸ‘₯ Complete Attribution**: All 540+ Congressional members with bioguide IDs - **πŸ“Š Rich Context**: Committee reports, debates, amendments for every law ### Performance Metrics - **⚑ Processing Speed**: ~10 public laws per minute - **πŸ’Ύ Storage Requirements**: ~50GB for complete historical dataset - **🌐 Network Usage**: ~5,000 API calls per full Congress - **πŸ”„ Update Frequency**: New laws processed within 24 hours ## Production Deployment ### System Requirements **Minimum:** - Python 3.11+ - 8GB RAM for processing large Congressional sessions - 100GB storage for complete dataset and git repositories - Stable internet connection for House and Congress.gov APIs **Recommended:** - Python 3.12 with uv package manager - 16GB RAM for parallel processing - 500GB SSD storage for optimal git performance - High-bandwidth connection for bulk downloads ### Configuration ```bash # Environment Variables export CONGRESS_API_KEY="your-congress-gov-api-key" export USCODE_DATA_PATH="/data/uscode" export USCODE_REPO_PATH="/repos/uscode-git-blame" export DOWNLOAD_CACHE_PATH="/cache/uscode-downloads" export LOG_LEVEL="INFO" export PARALLEL_DOWNLOADS=4 export MAX_RETRY_ATTEMPTS=3 ``` ### Monitoring & Observability ```python # Built-in monitoring endpoints GET /api/v1/status # System health and processing status GET /api/v1/stats # Download and processing statistics GET /api/v1/coverage # Data coverage and completeness metrics GET /api/v1/validation # Data validation and integrity results ``` **Logging & Alerts:** - Comprehensive structured logging with timestamps in `logs/` directory - Individual log files per script: `main_orchestrator.log`, `download_cache.log`, etc. - Alert on API rate limit approaches or failures - Monitor git repository integrity and size growth - Track data validation errors and resolution - Centralized logging configuration across all pipeline scripts ## Legal & Ethical Considerations ### Data Integrity - **πŸ“‹ Official Sources Only**: Uses only House and Congress.gov official sources - **πŸ”’ No Modifications**: Preserves original legal text without alterations - **πŸ“ Proper Attribution**: Credits all legislative authorship accurately - **βš–οΈ Legal Compliance**: Respects copyright and maintains public domain status ### Privacy & Ethics - **🌐 Public Information**: Uses only publicly available Congressional data - **πŸ‘₯ Respectful Attribution**: Honors Congressional service with accurate representation - **πŸ“Š Transparency**: All source code and methodologies are open and auditable - **🎯 Non-Partisan**: Objective tracking without political interpretation ## Roadmap ### Phase 1: Foundation βœ… (Complete) - [x] Modular four-script architecture design - [x] Comprehensive data downloader with Congress.gov API integration - [x] Caching system with metadata tracking - [x] Type-safe code with comprehensive validation - [x] Idempotent processing with force flags - [x] Pipeline orchestrator with individual stage execution ### Phase 2: Data Processing βœ… (Complete) - [x] HTML-to-text extraction with semantic structure preservation - [x] Pydantic models for all data types with validation - [x] Cross-referencing system linking bills to USC changes - [x] Data migration and normalization pipeline - [x] Output file existence checks for idempotency - [x] Comprehensive error handling and logging ### Phase 3: Git Repository Generation βœ… (Complete) - [x] Intelligent diff analysis for incremental commits - [x] Hierarchical USC structure generation - [x] Git blame optimization and validation - [x] Rich commit messages with legislative context - [x] Markdown conversion with proper formatting - [x] Build statistics and metadata tracking ### Phase 4: Production Features (Q3 2025) - [ ] Web interface for repository browsing - [ ] API for programmatic access to legislative data - [ ] Automated updates for new public laws - [ ] Advanced analytics and visualization ### Phase 5: Historical Expansion (Q4 2025) - [ ] Extended coverage back to 1951 - [ ] Integration with additional legislative databases - [ ] Enhanced attribution with committee and markup data - [ ] Performance optimization for large-scale datasets ## Contributing ### Development Setup ```bash git clone https://github.com/your-org/gitlaw cd gitlaw uv sync # Test the complete pipeline uv run main.py --help # Run individual stages for development uv run main.py --stage 1 --laws 119-001 # Test download uv run main.py --stage 2 --laws 119-001 # Test migration uv run main.py --stage 3 --laws 119-001 # Test planning uv run main.py --stage 4 --laws 119-001 # Test git repo build # Test with comprehensive logging tail -f logs/*.log # Monitor all pipeline logs ``` ### Adding New Features 1. **Data Sources**: Extend `download_cache.py` with new Congress.gov endpoints 2. **Processing**: Add new Pydantic models in `models.py` 3. **Git Features**: Enhance `build_git_repo.py` with new attribution methods 4. **Validation**: Add tests in `tests/` with realistic legislative scenarios ### Testing Philosophy ```bash # Unit tests for individual components uv run python -m pytest tests/unit/ # Integration tests with real Congressional data uv run python -m pytest tests/integration/ # End-to-end tests building small git repositories uv run python -m pytest tests/e2e/ ``` ## Support & Community - **πŸ“š Documentation**: Complete API documentation and examples - **πŸ’¬ Discussions**: GitHub Discussions for questions and ideas - **πŸ› Issues**: GitHub Issues for bug reports and feature requests - **πŸ”„ Updates**: Regular releases with new Congressional data --- ## License **APGLv3-or-greater License** - See LICENSE file for details. *The United States Code is in the public domain. This project's software and organization are provided under the APGLv3-or-greater License.* --- **πŸ›οΈ "Every line of law, attributed to its author, tracked through time."** *Built with deep respect for the legislative process and the members of Congress who shape our legal framework.*