initial commit
This commit is contained in:
289
README.md
Normal file
289
README.md
Normal file
@@ -0,0 +1,289 @@
|
||||
# Flat Scraper
|
||||
|
||||
Automatischer Web Scraper für Wohnungangebote auf NHG.at mit Benachrichtigungen bei neuen Ergebnissen.
|
||||
|
||||
## Features
|
||||
|
||||
- 🏠 **Automatisches Scraping** von NHG.at Wohnungsangeboten
|
||||
- 📍 **PLZ-basierte Suche** für 1120, 1140, 1150, 1160
|
||||
- 📊 **CSV Storage** für Ergebnisverfolgung
|
||||
- 🔔 **Benachrichtigungen** bei neuen Wohnungen (Console + Email)
|
||||
- 🐳 **Docker Support** für Raspberry Pi (ARM64)
|
||||
- ⏰ **Automatisierte Ausführung** alle 6 Stunden
|
||||
- 🔐 **Environment Variables** für sensitive Daten (.env)
|
||||
- 📧 **Email Security** (SSL/TLS/STARTTLS Support)
|
||||
|
||||
## Projektstruktur
|
||||
|
||||
```
|
||||
flat_scraper/
|
||||
├── src/
|
||||
│ ├── scrapers/ # Scraper Module
|
||||
│ │ ├── base_scraper.py # Basis-Klasse
|
||||
│ │ └── nhg_scraper.py # NHG.at spezifisch
|
||||
│ ├── storage/ # Daten-Speicher
|
||||
│ │ └── csv_storage.py # CSV-basiert
|
||||
│ ├── notifier/ # Benachrichtigungen
|
||||
│ │ └── email_notifier.py
|
||||
│ ├── config/ # Konfiguration
|
||||
│ │ └── sites.yaml
|
||||
│ ├── config_loader.py # Konfigurations-Loader mit .env Support
|
||||
│ └── main.py # Hauptanwendung
|
||||
├── data/ # CSV Ergebnisse
|
||||
├── .env.example # Environment Vorlage
|
||||
├── .env # Deine sensitiven Daten (nicht in VCS)
|
||||
├── .gitignore # Git ignore für .env und data/
|
||||
├── requirements.txt # Python Dependencies
|
||||
├── Dockerfile # ARM64 optimiert
|
||||
├── docker-compose.yml # Automatisierung
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Environment einrichten
|
||||
|
||||
```bash
|
||||
# Environment Vorlage kopieren
|
||||
cp .env.example .env
|
||||
|
||||
# Deine Daten eintragen
|
||||
vim .env
|
||||
```
|
||||
|
||||
**Wichtige .env Variablen:**
|
||||
```bash
|
||||
SMTP_SERVER=smtp.gmail.com
|
||||
SMTP_PORT=587
|
||||
EMAIL_USERNAME=deine-email@gmail.com
|
||||
EMAIL_PASSWORD=dein-app-password
|
||||
EMAIL_FROM=deine-email@gmail.com
|
||||
EMAIL_TO=empfänger@example.com
|
||||
EMAIL_SECURITY=starttls # Options: none, ssl, tls, starttls
|
||||
```
|
||||
|
||||
### 2. Docker auf Raspberry Pi
|
||||
|
||||
```bash
|
||||
# Build und Start
|
||||
docker-compose up -d
|
||||
|
||||
# Logs ansehen
|
||||
docker-compose logs -f flat-scraper
|
||||
|
||||
# Scheduler starten (automatisch alle 6 Stunden)
|
||||
docker-compose up -d scheduler
|
||||
```
|
||||
|
||||
### 3. Manuelles Testen
|
||||
|
||||
```bash
|
||||
# Einmaliger Lauf
|
||||
docker-compose run --rm flat-scraper
|
||||
|
||||
# Mit Environment File
|
||||
docker run --rm -v $(pwd):/app --env-file .env flat-scraper-test python src/main.py
|
||||
```
|
||||
|
||||
## Konfiguration
|
||||
|
||||
### Sites konfigurieren (`src/config/sites.yaml`)
|
||||
|
||||
```yaml
|
||||
sites:
|
||||
nhg:
|
||||
name: "Neue Heimat Gewog"
|
||||
url: "https://nhg.at/immobilienangebot/wohnungsangebot/"
|
||||
scraper_class: "nhg_scraper.NHGScraper"
|
||||
enabled: true
|
||||
search_params:
|
||||
plz_list:
|
||||
- "1120 Wien"
|
||||
- "1140 Wien"
|
||||
- "1150 Wien"
|
||||
- "1160 Wien"
|
||||
schedule:
|
||||
cron: "0 */6 * * *" # Alle 6 Stunden
|
||||
timezone: "Europe/Vienna"
|
||||
```
|
||||
|
||||
### Email-Benachrichtigungen
|
||||
|
||||
```yaml
|
||||
notification:
|
||||
email:
|
||||
enabled: true
|
||||
smtp_server: "${SMTP_SERVER}"
|
||||
smtp_port: "${SMTP_PORT}"
|
||||
username: "${EMAIL_USERNAME}"
|
||||
password: "${EMAIL_PASSWORD}"
|
||||
from_email: "${EMAIL_FROM}"
|
||||
to_emails:
|
||||
- "${EMAIL_TO}"
|
||||
security: "${EMAIL_SECURITY:starttls}" # Options: none, ssl, tls, starttls
|
||||
|
||||
console:
|
||||
enabled: true # Immer für Debugging
|
||||
```
|
||||
|
||||
## Datenformat
|
||||
|
||||
Ergebnisse werden als CSV gespeichert:
|
||||
|
||||
```csv
|
||||
scrape_time,plz,address,link,hash,scraper
|
||||
2024-01-15T10:30:00,1120,"1120 Wien, Flurschützstraße 5 / 2 / 10",https://...,abc123,nhg
|
||||
```
|
||||
|
||||
**Hash-basierter Vergleich** vermeidet Duplikate zwischen Läufen.
|
||||
|
||||
## Erweiterbarkeit
|
||||
|
||||
### Neue Webseite hinzufügen
|
||||
|
||||
1. **Neue Scraper-Klasse** in `src/scrapers/`:
|
||||
```python
|
||||
from .base_scraper import BaseScraper
|
||||
|
||||
class NewSiteScraper(BaseScraper):
|
||||
async def scrape(self, search_params):
|
||||
# Implementierung
|
||||
pass
|
||||
```
|
||||
|
||||
2. **Konfiguration erweitern**:
|
||||
```yaml
|
||||
sites:
|
||||
new_site:
|
||||
name: "New Site"
|
||||
url: "https://example.com"
|
||||
scraper_class: "new_site_scraper.NewSiteScraper"
|
||||
enabled: true
|
||||
search_params:
|
||||
# Site-spezifische Parameter
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Der ConfigLoader unterstützt **automatische Substitution**:
|
||||
```yaml
|
||||
# In YAML
|
||||
smtp_server: "${SMTP_SERVER}"
|
||||
username: "${EMAIL_USERNAME:default@example.com}" # Mit Default
|
||||
```
|
||||
|
||||
## Deployment auf Raspberry Pi
|
||||
|
||||
### ARM64 Support
|
||||
|
||||
Der Dockerfile ist für ARM64 optimiert:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.11-slim-bullseye
|
||||
# ARM64 optimierte Browser Installation
|
||||
RUN apt-get update && apt-get install -y chromium
|
||||
```
|
||||
|
||||
### Performance-Tipps
|
||||
|
||||
- `--no-sandbox` für Chromium (im Dockerfile berücksichtigt)
|
||||
- Shared Browser Path: `PLAYWRIGHT_BROWSERS_PATH=/ms-playwright`
|
||||
- Memory-optimierte Settings
|
||||
- Environment Variables statt Hardcoding
|
||||
|
||||
### Docker Compose Features
|
||||
|
||||
- **Volume Mounting**: `./data:/app/data` für persistente CSVs
|
||||
- **Environment Support**: `--env-file .env` für sensitive Daten
|
||||
- **Scheduler Service**: Automatische Ausführung alle 6 Stunden
|
||||
- **Restart Policy**: `unless-stopped` für Zuverlässigkeit
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Häufige Probleme
|
||||
|
||||
1. **Browser startet nicht**: `playwright install-deps chromium`
|
||||
2. **Keine Ergebnisse**: PLZ nicht verfügbar oder Website geändert
|
||||
3. **Email funktioniert nicht**: SMTP-Einstellungen und Security prüfen
|
||||
4. **Environment nicht geladen**: `.env` Datei prüfen und Rechte
|
||||
|
||||
### Debugging
|
||||
|
||||
```bash
|
||||
# Logs ansehen
|
||||
docker-compose logs -f flat-scraper
|
||||
|
||||
# Manuell testen
|
||||
docker-compose run --rm flat-scraper python src/main.py
|
||||
|
||||
# Email Test
|
||||
docker run --rm -v $(pwd):/app --env-file .env flat-scraper-test python -c "
|
||||
from src.notifier.email_notifier import EmailNotifier
|
||||
from src.config_loader import ConfigLoader
|
||||
config = ConfigLoader()
|
||||
notifier = EmailNotifier(config.get_notification_config()['email'])
|
||||
test_results = [{'plz': '1120', 'address': 'Test', 'link': '#', 'hash': 'test'}]
|
||||
notifier.send_notification('test', test_results)
|
||||
"
|
||||
```
|
||||
|
||||
## Entwicklung
|
||||
|
||||
### Testing
|
||||
|
||||
```bash
|
||||
# Einzelnen Scraper testen
|
||||
python -c "
|
||||
import asyncio
|
||||
from src.scrapers.nhg_scraper import NHGScraper
|
||||
scraper = NHGScraper({'url': 'https://nhg.at/immobilienangebot/wohnungsangebot/', 'search_params': {'plz_list': ['1120 Wien']}})
|
||||
results = asyncio.run(scraper.scrape())
|
||||
print(results)
|
||||
"
|
||||
```
|
||||
|
||||
### Logging
|
||||
|
||||
Logs werden automatisch geschrieben:
|
||||
- Level: `INFO` (kann in `sites.yaml` angepasst werden)
|
||||
- Format: `Zeitstempel - Modul - Level - Nachricht`
|
||||
- Output: Console + Docker Logs
|
||||
|
||||
## Sicherheit
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- **`.env`** wird nicht in Git eingecheckt (siehe `.gitignore`)
|
||||
- **`.env.example`** als Vorlage für das Team
|
||||
- **Keine Passwörter** im Code oder in YAML
|
||||
- **Docker Secrets** optional für Production
|
||||
|
||||
### Email Security
|
||||
|
||||
Unterstützte Security Modi:
|
||||
- **`none`** - Keine Verschlüsselung
|
||||
- **`ssl`** - SMTP_SSL (Port 465)
|
||||
- **`tls`** - Explicit TLS (Port 587 + STARTTLS)
|
||||
- **`starttls`** - STARTTLS (Standard für Gmail)
|
||||
- **`ssl/tls`** - Kompatibilitätsmodus
|
||||
|
||||
## Architektur
|
||||
|
||||
### Hybrid-Ansatz
|
||||
|
||||
- **BaseScraper**: Gemeinsame Funktionalität (Hashing, Metadata)
|
||||
- **Site-spezifische Scraper**: Individuelle Implementierungen
|
||||
- **Config-Driven**: YAML Konfiguration mit Environment Support
|
||||
- **Modular**: Storage und Notifier austauschbar
|
||||
|
||||
### Datenfluss
|
||||
|
||||
```
|
||||
Config → Scraper → Results → Storage → Comparison → Notifier
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
Environment Playwright CSV Hash-Vergleich Email/Console
|
||||
```
|
||||
|
||||
## Lizenz
|
||||
|
||||
MIT License
|
||||
Reference in New Issue
Block a user