An automated system that monitors and archives regulatory compliance data from the Irish Environmental Protection Agency (EPA). This tool provides daily updates on environmental licence compliance, monitoring reports, incidents, and regulatory activities across Ireland.
This system automatically:
- Scrapes licence profile data from the EPA Ireland LEAP API
- Tracks compliance records, monitoring returns, site visits, and incidents
- Generates daily CSV files with new compliance documents
- Creates RSS feeds for easy monitoring and integration
- Archives historical data in a SQLite database
- Provides structured access to EPA regulatory information
The system monitors the EPA Ireland LEAP portal which contains:
- Environmental Licences: Industrial emissions, waste, water discharge permits
- Compliance Records: Monitoring reports, audit results, enforcement actions
- Documents: Annual reports, site visit reports, incident notifications, complaints
- Licence Holders: Companies and organizations with EPA permits
output/csv/daily/YYYY/MM/YYYY-MM-DD.csv
Each CSV contains new compliance documents with columns:
licence_profile_name
- Company/facility namedocument_type
- Type of compliance documenttitle
- Document title (sanitized for CSV)leap_url
- Direct link to EPA portaldocument_date
- When the document was createdcompliance_status
- Current status (Open/Closed)document_url
- API endpoint for document data
output/rsstwitter.xml
- Recent CSV files (last 30 days)output/daily.xml
- Recent compliance documents
epa_ireland.db
- SQLite database with complete historical data
- Python 3.8+
- pip package manager
- Git (for automated updates)
-
Clone the repository
git clone https://github.com/EPA-Ireland-Updates-Unofficial/epa_ireland_scraper.git cd epa_ireland_scraper
-
Set up virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
Full scrape and update:
python scraper.py
Generate CSV for specific date:
python export_to_csv.py 2025-01-15
Generate RSS feeds:
python rss_generator.py --csv-days 30
Regenerate historical CSV files:
python regenerate_csvs.py
The system includes a cron script (cron_scraper.sh
) that:
- Runs the scraper to update the database
- Exports today's CSV file
- Updates RSS feeds
- Commits changes to Git
- Pushes updates to GitHub
Set up daily automation:
# Edit crontab
crontab -e
# Add line to run daily at 4:30 AM
30 4 * * * /path/to/epa_ireland_scraper/cron_scraper.sh
File | Purpose |
---|---|
scraper.py |
Main scraper that fetches data from EPA API |
export_to_csv.py |
Generates daily CSV files with deduplication |
rss_generator.py |
Creates RSS feeds from CSV files and database |
cron_scraper.sh |
Automated daily execution script |
regenerate_csvs.py |
One-off script to rebuild historical CSVs |
requirements.txt |
Python package dependencies |
- Default:
epa_ireland.db
(SQLite) - Automatically created on first run
- Contains three main tables:
licence_profiles
,compliance_records
,compliance_documents
- Default lookback: 4 days for new documents
- Deduplication: Avoids re-exporting previously exported documents
- Text sanitization: Removes line breaks and formatting issues
- CSV feed: Last 30 days of CSV files
- Document feed: Most recent compliance documents
- Update frequency: After each scraper run
Subscribe to the RSS feeds to monitor updates:
- For developers: Use
rsstwitter.xml
to track when new CSV files are available - For end users: Use
daily.xml
to see latest compliance documents
- New CSV files and RSS updates are automatically committed to GitHub
- Watch this repository for notifications when new data is available
- View historical data through GitHub's file browser
The system is modular:
- Modify
scraper.py
to change data collection - Update
export_to_csv.py
to change CSV format - Extend
rss_generator.py
for new RSS formats
-- Licence profiles (companies/facilities)
licence_profiles: licenceprofileid, profilenumber, name, status, etc.
-- Compliance records (regulatory activities)
compliance_records: compliancerecord_id, licenceprofileid, type, status, date
-- Documents (reports, monitoring data, incidents)
compliance_documents: document_id, compliance_id, title, document_type, document_date, document_url
Apache 2.0
This project is unofficial and for educational/research purposes. All EPA data remains subject to EPA terms of use.
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This is an unofficial tool for accessing publicly available EPA Ireland data. Users should:
- Respect EPA's terms of service
- Use data responsibly and ethically
- Verify important information with official EPA sources
- Be mindful of API rate limits
For issues, questions, or feature requests:
- Open a GitHub issue
- Check existing issues for similar problems
- Provide detailed information about your setup and the issue
Last updated: Daily via automated scraper Data source: EPA Ireland LEAP Portal Maintained by: Conor O'Neill