initial commit

2026-02-15 10:04:41 +01:00
commit 1819779aac
20 changed files with 1078 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,14 @@
 # Email Configuration
 SMTP_SERVER=smtp.gmail.com
 SMTP_PORT=587
 EMAIL_USERNAME=your-email@gmail.com
 EMAIL_PASSWORD=your-app-password
 EMAIL_FROM=your-email@gmail.com
 EMAIL_TO=recipient@example.com
 # Email Security (optional)
 # Options: none, ssl, tls, starttls
 EMAIL_SECURITY=starttls
 # Optional: Multiple recipients (comma-separated)
 # EMAIL_TO=recipient1@example.com,recipient2@example.com
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,42 @@
 # Python
 __pycache__
 *.pyc
 *.pyo
 *.pyd
 .Python
 env
 pip-log.txt
 pip-delete-this-directory.txt
 .tox
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.log
 .git
 .mypy_cache
 .pytest_cache
 .hypothesis
 # IDE
 .vscode
 .idea
 *.swp
 *.swo
 # OS
 .DS_Store
 Thumbs.db
 # Local data
 data/*.csv
 # Environment variables
 .env
 .env.local
 .env.production
 # Docker
 .dockerignore
--- a/38
+++ b/38
@@ -0,0 +1,38 @@
 # Use Python 3.11 slim image for ARM64 compatibility
 FROM python:3.11-slim-bullseye
 # Set environment variables
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
    chromium \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*
 # Set working directory
 WORKDIR /app
 # Copy requirements and install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Install Playwright browsers
 RUN playwright install chromium
 RUN playwright install-deps chromium
 # Copy application code
 COPY src/ ./src/
 # Create data directory
 RUN mkdir -p data
 # Create non-root user
 RUN useradd -m -u 1000 scraper && chown -R scraper:scraper /app
 USER scraper
 # Default command
 CMD ["python", "src/main.py"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,289 @@
 # Flat Scraper
 Automatischer Web Scraper für Wohnungangebote auf NHG.at mit Benachrichtigungen bei neuen Ergebnissen.
 ## Features
 - 🏠 **Automatisches Scraping** von NHG.at Wohnungsangeboten
 - 📍 **PLZ-basierte Suche** für 1120, 1140, 1150, 1160
 - 📊 **CSV Storage** für Ergebnisverfolgung
 - 🔔 **Benachrichtigungen** bei neuen Wohnungen (Console + Email)
 - 🐳 **Docker Support** für Raspberry Pi (ARM64)
 - ⏰ **Automatisierte Ausführung** alle 6 Stunden
 - 🔐 **Environment Variables** für sensitive Daten (.env)
 - 📧 **Email Security** (SSL/TLS/STARTTLS Support)
 ## Projektstruktur
 ```
 flat_scraper/
 ├── src/
 │   ├── scrapers/           # Scraper Module
 │   │   ├── base_scraper.py # Basis-Klasse
 │   │   └── nhg_scraper.py  # NHG.at spezifisch
 │   ├── storage/            # Daten-Speicher
 │   │   └── csv_storage.py  # CSV-basiert
 │   ├── notifier/           # Benachrichtigungen
 │   │   └── email_notifier.py
 │   ├── config/             # Konfiguration
 │   │   └── sites.yaml
 │   ├── config_loader.py    # Konfigurations-Loader mit .env Support
 │   └── main.py            # Hauptanwendung
 ├── data/                   # CSV Ergebnisse
 ├── .env.example           # Environment Vorlage
 ├── .env                  # Deine sensitiven Daten (nicht in VCS)
 ├── .gitignore           # Git ignore für .env und data/
 ├── requirements.txt       # Python Dependencies
 ├── Dockerfile           # ARM64 optimiert
 ├── docker-compose.yml   # Automatisierung
 └── README.md
 ```
 ## Quick Start
 ### 1. Environment einrichten
 ```bash
 # Environment Vorlage kopieren
 cp .env.example .env
 # Deine Daten eintragen
 vim .env
 ```
 **Wichtige .env Variablen:**
 ```bash
 SMTP_SERVER=smtp.gmail.com
 SMTP_PORT=587
 EMAIL_USERNAME=deine-email@gmail.com
 EMAIL_PASSWORD=dein-app-password
 EMAIL_FROM=deine-email@gmail.com
 EMAIL_TO=empfänger@example.com
 EMAIL_SECURITY=starttls  # Options: none, ssl, tls, starttls
 ```
 ### 2. Docker auf Raspberry Pi
 ```bash
 # Build und Start
 docker-compose up -d
 # Logs ansehen
 docker-compose logs -f flat-scraper
 # Scheduler starten (automatisch alle 6 Stunden)
 docker-compose up -d scheduler
 ```
 ### 3. Manuelles Testen
 ```bash
 # Einmaliger Lauf
 docker-compose run --rm flat-scraper
 # Mit Environment File
 docker run --rm -v $(pwd):/app --env-file .env flat-scraper-test python src/main.py
 ```
 ## Konfiguration
 ### Sites konfigurieren (`src/config/sites.yaml`)
 ```yaml
 sites:
  nhg:
    name: "Neue Heimat Gewog"
    url: "https://nhg.at/immobilienangebot/wohnungsangebot/"
    scraper_class: "nhg_scraper.NHGScraper"
    enabled: true
    search_params:
      plz_list:
        - "1120 Wien"
        - "1140 Wien"
        - "1150 Wien"
        - "1160 Wien"
    schedule:
      cron: "0 */6 * * *"  # Alle 6 Stunden
      timezone: "Europe/Vienna"
 ```
 ### Email-Benachrichtigungen
 ```yaml
 notification:
  email:
    enabled: true
    smtp_server: "${SMTP_SERVER}"
    smtp_port: "${SMTP_PORT}"
    username: "${EMAIL_USERNAME}"
    password: "${EMAIL_PASSWORD}"
    from_email: "${EMAIL_FROM}"
    to_emails:
      - "${EMAIL_TO}"
    security: "${EMAIL_SECURITY:starttls}"  # Options: none, ssl, tls, starttls
  console:
    enabled: true  # Immer für Debugging
 ```
 ## Datenformat
 Ergebnisse werden als CSV gespeichert:
 ```csv
 scrape_time,plz,address,link,hash,scraper
 2024-01-15T10:30:00,1120,"1120 Wien, Flurschützstraße 5 / 2 / 10",https://...,abc123,nhg
 ```
 **Hash-basierter Vergleich** vermeidet Duplikate zwischen Läufen.
 ## Erweiterbarkeit
 ### Neue Webseite hinzufügen
 1. **Neue Scraper-Klasse** in `src/scrapers/`:
 ```python
 from .base_scraper import BaseScraper
 class NewSiteScraper(BaseScraper):
    async def scrape(self, search_params):
        # Implementierung
        pass
 ```
 2. **Konfiguration erweitern**:
 ```yaml
 sites:
  new_site:
    name: "New Site"
    url: "https://example.com"
    scraper_class: "new_site_scraper.NewSiteScraper"
    enabled: true
    search_params:
      # Site-spezifische Parameter
 ```
 ### Environment Variables
 Der ConfigLoader unterstützt **automatische Substitution**:
 ```yaml
 # In YAML
 smtp_server: "${SMTP_SERVER}"
 username: "${EMAIL_USERNAME:default@example.com}"  # Mit Default
 ```
 ## Deployment auf Raspberry Pi
 ### ARM64 Support
 Der Dockerfile ist für ARM64 optimiert:
 ```dockerfile
 FROM python:3.11-slim-bullseye
 # ARM64 optimierte Browser Installation
 RUN apt-get update && apt-get install -y chromium
 ```
 ### Performance-Tipps
 - `--no-sandbox` für Chromium (im Dockerfile berücksichtigt)
 - Shared Browser Path: `PLAYWRIGHT_BROWSERS_PATH=/ms-playwright`
 - Memory-optimierte Settings
 - Environment Variables statt Hardcoding
 ### Docker Compose Features
 - **Volume Mounting**: `./data:/app/data` für persistente CSVs
 - **Environment Support**: `--env-file .env` für sensitive Daten
 - **Scheduler Service**: Automatische Ausführung alle 6 Stunden
 - **Restart Policy**: `unless-stopped` für Zuverlässigkeit
 ## Troubleshooting
 ### Häufige Probleme
 1. **Browser startet nicht**: `playwright install-deps chromium`
 2. **Keine Ergebnisse**: PLZ nicht verfügbar oder Website geändert
 3. **Email funktioniert nicht**: SMTP-Einstellungen und Security prüfen
 4. **Environment nicht geladen**: `.env` Datei prüfen und Rechte
 ### Debugging
 ```bash
 # Logs ansehen
 docker-compose logs -f flat-scraper
 # Manuell testen
 docker-compose run --rm flat-scraper python src/main.py
 # Email Test
 docker run --rm -v $(pwd):/app --env-file .env flat-scraper-test python -c "
 from src.notifier.email_notifier import EmailNotifier
 from src.config_loader import ConfigLoader
 config = ConfigLoader()
 notifier = EmailNotifier(config.get_notification_config()['email'])
 test_results = [{'plz': '1120', 'address': 'Test', 'link': '#', 'hash': 'test'}]
 notifier.send_notification('test', test_results)
 "
 ```
 ## Entwicklung
 ### Testing
 ```bash
 # Einzelnen Scraper testen
 python -c "
 import asyncio
 from src.scrapers.nhg_scraper import NHGScraper
 scraper = NHGScraper({'url': 'https://nhg.at/immobilienangebot/wohnungsangebot/', 'search_params': {'plz_list': ['1120 Wien']}})
 results = asyncio.run(scraper.scrape())
 print(results)
 "
 ```
 ### Logging
 Logs werden automatisch geschrieben:
 - Level: `INFO` (kann in `sites.yaml` angepasst werden)
 - Format: `Zeitstempel - Modul - Level - Nachricht`
 - Output: Console + Docker Logs
 ## Sicherheit
 ### Environment Variables
 - **`.env`** wird nicht in Git eingecheckt (siehe `.gitignore`)
 - **`.env.example`** als Vorlage für das Team
 - **Keine Passwörter** im Code oder in YAML
 - **Docker Secrets** optional für Production
 ### Email Security
 Unterstützte Security Modi:
 - **`none`** - Keine Verschlüsselung
 - **`ssl`** - SMTP_SSL (Port 465)
 - **`tls`** - Explicit TLS (Port 587 + STARTTLS)
 - **`starttls`** - STARTTLS (Standard für Gmail)
 - **`ssl/tls`** - Kompatibilitätsmodus
 ## Architektur
 ### Hybrid-Ansatz
 - **BaseScraper**: Gemeinsame Funktionalität (Hashing, Metadata)
 - **Site-spezifische Scraper**: Individuelle Implementierungen
 - **Config-Driven**: YAML Konfiguration mit Environment Support
 - **Modular**: Storage und Notifier austauschbar
 ### Datenfluss
 ```
 Config → Scraper → Results → Storage → Comparison → Notifier
   ↓           ↓         ↓        ↓           ↓
 Environment  Playwright  CSV     Hash-Vergleich  Email/Console
 ```
 ## Lizenz
 MIT License
--- a/debug.png
+++ b/debug.png
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,48 @@
 version: '3.8'
 services:
  flat-scraper:
    build: .
    restart: unless-stopped
    volumes:
      - ./data:/app/data
      - ./src/config:/app/src/config
      - ./.env:/app/.env:ro
    environment:
      - TZ=Europe/Vienna
    # For ARM64 (Raspberry Pi) - uncomment if needed
    # platform: linux/arm64
  # Optional: Add a scheduler service for automated runs
  scheduler:
    build: .
    restart: unless-stopped
    volumes:
      - ./data:/app/data
      - ./src/config:/app/src/config
      - ./.env:/app/.env:ro
    environment:
      - TZ=Europe/Vienna
    command: >
      python -c "
      import schedule
      import time
      from src.main import FlatScraper
      def run_scraper():
          scraper = FlatScraper()
          scraper.run_once()
      # Schedule every 6 hours
      schedule.every(6).hours.do(run_scraper)
      print('Scheduler started. Running every 6 hours.')
      run_scraper()  # Run immediately
      while True:
          schedule.run_pending()
          time.sleep(60)
      "
    # platform: linux/arm64
    depends_on:
      - flat-scraper
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,5 @@
 playwright==1.40.0
 pandas==2.1.4
 pydantic==2.5.2
 pyyaml==6.0.1
 apscheduler==3.10.4
--- a/src/config/init.py
+++ b/src/config/init.py
--- a/src/config/sites.yaml
+++ b/src/config/sites.yaml
@@ -0,0 +1,38 @@
 sites:
  nhg:
    name: "Neue Heimat Gewog"
    url: "https://nhg.at/immobilienangebot/wohnungsangebot/"
    scraper_class: "nhg_scraper.NHGScraper"
    enabled: true
    search_params:
      plz_list:
        - "1120 Wien"
        - "1140 Wien"
        - "1150 Wien"
        - "1160 Wien"
    schedule:
      cron: "0 */6 * * *"  # Alle 6 Stunden
      timezone: "Europe/Vienna"
 # Email notification settings
 notification:
  email:
    enabled: true  # Set to true to enable email notifications
    smtp_server: "${SMTP_SERVER}"
    smtp_port: "${SMTP_PORT}"
    username: "${EMAIL_USERNAME}"
    password: "${EMAIL_PASSWORD}"
    from_email: "${EMAIL_FROM}"
    to_emails:
      - "${EMAIL_TO}"
    security: "${EMAIL_SECURITY:starttls}"  # Options: none, ssl, tls, starttls
  console:
    enabled: true  # Always enabled for debugging
 # General settings
 general:
  data_dir: "data"
  log_level: "INFO"
  max_retries: 3
  retry_delay: 5  # seconds
--- a/src/config_loader.py
+++ b/src/config_loader.py
@@ -0,0 +1,69 @@
 import yaml
 import os
 from typing import Dict, Any
 from pathlib import Path
 class ConfigLoader:
    """Load and manage configuration"""
    def __init__(self, config_path: str = "src/config/sites.yaml"):
        self.config_path = Path(config_path)
        self.config = self._load_config()
    def _load_config(self) -> Dict[str, Any]:
        """Load configuration from YAML file"""
        if not self.config_path.exists():
            raise FileNotFoundError(f"Konfigurationsdatei nicht gefunden: {self.config_path}")
        try:
            with open(self.config_path, 'r', encoding='utf-8') as file:
                config = yaml.safe_load(file)
                return self._substitute_env_vars(config)
        except Exception as e:
            raise ValueError(f"Fehler beim Laden der Konfiguration: {e}")
    def _substitute_env_vars(self, config: Any) -> Any:
        """Recursively substitute environment variables in configuration"""
        if isinstance(config, dict):
            return {key: self._substitute_env_vars(value) for key, value in config.items()}
        elif isinstance(config, list):
            return [self._substitute_env_vars(item) for item in config]
        elif isinstance(config, str) and config.startswith('${') and config.endswith('}'):
            # Extract environment variable name
            env_var = config[2:-1]
            default_value = None
            # Handle default values (e.g., ${VAR:default})
            if ':' in env_var:
                env_var, default_value = env_var.split(':', 1)
            return os.getenv(env_var, default_value)
        else:
            return config
    def get_sites(self) -> Dict[str, Any]:
        """Get all site configurations"""
        return self.config.get('sites', {})
    def get_site_config(self, site_name: str) -> Dict[str, Any]:
        """Get configuration for a specific site"""
        sites = self.get_sites()
        if site_name not in sites:
            raise ValueError(f"Site '{site_name}' nicht in Konfiguration gefunden")
        return sites[site_name]
    def get_notification_config(self) -> Dict[str, Any]:
        """Get notification configuration"""
        return self.config.get('notification', {})
    def get_general_config(self) -> Dict[str, Any]:
        """Get general configuration"""
        return self.config.get('general', {})
    def is_site_enabled(self, site_name: str) -> bool:
        """Check if a site is enabled"""
        try:
            config = self.get_site_config(site_name)
            return config.get('enabled', True)
        except ValueError:
            return False
--- a/src/main.py
+++ b/src/main.py
@@ -0,0 +1,130 @@
 import asyncio
 import logging
 from datetime import datetime
 from typing import Dict, List, Any
 import importlib
 from config_loader import ConfigLoader
 from storage.csv_storage import CSVStorage
 from notifier.email_notifier import EmailNotifier, ConsoleNotifier
 # Setup logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
 class FlatScraper:
    """Main application class"""
    def __init__(self, config_path: str = "src/config/sites.yaml"):
        self.config_loader = ConfigLoader(config_path)
        self.storage = CSVStorage(self.config_loader.get_general_config().get('data_dir', 'data'))
        self.notifiers = self._setup_notifiers()
        self.scrapers = {}
    def _setup_notifiers(self) -> List:
        """Setup notification systems"""
        notifiers = []
        notification_config = self.config_loader.get_notification_config()
        # Console notifier (always enabled for debugging)
        console_config = notification_config.get('console', {})
        if console_config.get('enabled', True):
            notifiers.append(ConsoleNotifier())
        # Email notifier
        email_config = notification_config.get('email', {})
        if email_config.get('enabled', False):
            notifiers.append(EmailNotifier(email_config))
        return notifiers
    def _get_scraper_class(self, scraper_class_path: str):
        """Dynamically import scraper class"""
        module_name, class_name = scraper_class_path.rsplit('.', 1)
        module = importlib.import_module(f'scrapers.{module_name}')
        return getattr(module, class_name)
    def _get_scraper(self, site_name: str):
        """Get or create scraper instance"""
        if site_name not in self.scrapers:
            site_config = self.config_loader.get_site_config(site_name)
            scraper_class_path = site_config.get('scraper_class')
            scraper_class = self._get_scraper_class(scraper_class_path)
            self.scrapers[site_name] = scraper_class(site_config)
        return self.scrapers[site_name]
    async def scrape_site(self, site_name: str) -> Dict[str, Any]:
        """Scrape a single site"""
        try:
            logger.info(f"Start scraping {site_name}")
            scraper = self._get_scraper(site_name)
            results = await scraper.scrape()
            logger.info(f"Found {len(results)} results for {site_name}")
            # Compare with previous results
            new_results, removed_results = self.storage.compare_results(site_name, results)
            # Save results
            self.storage.save_results(site_name, results)
            # Send notifications for new results
            if new_results:
                logger.info(f"Found {len(new_results)} new results for {site_name}")
                for notifier in self.notifiers:
                    try:
                        notifier.send_notification(site_name, new_results)
                    except Exception as e:
                        logger.error(f"Error sending notification: {e}")
            else:
                logger.info(f"No new results for {site_name}")
            return {
                'site': site_name,
                'total_results': len(results),
                'new_results': len(new_results),
                'success': True
            }
        except Exception as e:
            logger.error(f"Error scraping {site_name}: {e}")
            return {
                'site': site_name,
                'error': str(e),
                'success': False
            }
    async def scrape_all_sites(self) -> List[Dict[str, Any]]:
        """Scrape all enabled sites"""
        results = []
        sites = self.config_loader.get_sites()
        for site_name in sites.keys():
            if self.config_loader.is_site_enabled(site_name):
                result = await self.scrape_site(site_name)
                results.append(result)
        return results
    def run_once(self) -> None:
        """Run scraping once"""
        logger.info("Starting flat scraper run")
        results = asyncio.run(self.scrape_all_sites())
        # Summary
        successful = sum(1 for r in results if r['success'])
        total_new = sum(r.get('new_results', 0) for r in results)
        logger.info(f"Scraping completed: {successful}/{len(results)} sites successful, {total_new} new results")
 def main():
    """Main entry point"""
    scraper = FlatScraper()
    scraper.run_once()
 if __name__ == "__main__":
    main()
--- a/src/notifier/init.py
+++ b/src/notifier/init.py
--- a/src/notifier/email_notifier.py
+++ b/src/notifier/email_notifier.py
@@ -0,0 +1,135 @@
 import smtplib
 from email.mime.text import MIMEText
 from email.mime.multipart import MIMEMultipart
 from typing import List, Dict, Optional
 import os
 from datetime import datetime
 class EmailNotifier:
    """Email notification system"""
    def __init__(self, config: Dict):
        self.smtp_server = config.get('smtp_server', 'localhost')
        self.smtp_port = int(config.get('smtp_port', 587))
        self.username = config.get('username', '')
        self.password = config.get('password', '')
        self.from_email = config.get('from_email', self.username)
        self.security = config.get('security', 'starttls')
        # Handle to_emails - can be string or list
        to_emails = config.get('to_emails', [])
        if isinstance(to_emails, str):
            self.to_emails = [to_emails]
        else:
            self.to_emails = to_emails
    def send_notification(self, scraper_name: str, new_results: List[Dict]) -> bool:
        """Send email notification for new results"""
        if not new_results:
            return True
        if not self.to_emails:
            print("Keine Empfänger-Emails konfiguriert")
            return False
        try:
            # Create message
            msg = MIMEMultipart()
            msg['From'] = self.from_email
            msg['To'] = ', '.join(self.to_emails)
            msg['Subject'] = f"Neue Wohnungen gefunden: {len(new_results)} neue Ergebnisse für {scraper_name}"
            # Create email body
            body = self._create_email_body(scraper_name, new_results)
            msg.attach(MIMEText(body, 'html'))
            # Send email with security settings
            if self.security in ['ssl', 'ssl/tls']:
                server = smtplib.SMTP_SSL(self.smtp_server, self.smtp_port)
            else:
                server = smtplib.SMTP(self.smtp_server, self.smtp_port)
                # Apply security settings for non-SSL
                if self.security in ['tls', 'starttls']:
                    server.starttls()
            # Login if credentials provided
            if self.username and self.password:
                server.login(self.username, self.password)
            server.send_message(msg)
            server.quit()
            print(f"Email-Benachrichtigung gesendet an {len(self.to_emails)} Empfänger")
            return True
        except Exception as e:
            print(f"Fehler beim Senden der Email: {e}")
            return False
    def _create_email_body(self, scraper_name: str, new_results: List[Dict]) -> str:
        """Create HTML email body"""
        html = f"""
        <html>
        <body>
            <h2>🏠 Neue Wohnungen gefunden - {scraper_name}</h2>
            <p><strong>Zeitpunkt:</strong> {datetime.now().strftime('%d.%m.%Y %H:%M')}</p>
            <p><strong>Anzahl neuer Ergebnisse:</strong> {len(new_results)}</p>
            <h3>Neue Wohnungen:</h3>
            <table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse;">
                <tr style="background-color: #f0f0f0;">
                    <th>PLZ</th>
                    <th>Adresse</th>
                    <th>Link</th>
                </tr>
        """
        for result in new_results:
            plz = result.get('plz', 'N/A')
            address = result.get('address', 'N/A')
            link = result.get('link', '#')
            html += f"""
                <tr>
                    <td>{plz}</td>
                    <td>{address}</td>
                    <td><a href="{link}">Details</a></td>
                </tr>
            """
        html += """
            </table>
            <br>
            <p><small>Diese Nachricht wurde automatisch vom Flat Scraper gesendet.</small></p>
        </body>
        </html>
        """
        return html
 class ConsoleNotifier:
    """Console notification for testing"""
    def send_notification(self, scraper_name: str, new_results: List[Dict]) -> bool:
        """Print notification to console"""
        if not new_results:
            return True
        print(f"\n{'='*50}")
        print(f"🏠 NEUE WOHNUNGEN GEFUNDEN: {scraper_name}")
        print(f"Zeitpunkt: {datetime.now().strftime('%d.%m.%Y %H:%M')}")
        print(f"Anzahl: {len(new_results)}")
        print(f"{'='*50}")
        for result in new_results:
            plz = result.get('plz', 'N/A')
            address = result.get('address', 'N/A')
            link = result.get('link', '#')
            print(f"📍 PLZ {plz}: {address}")
            if link != '#':
                print(f"   🔗 {link}")
            print()
        return True
--- a/src/scrapers/init.py
+++ b/src/scrapers/init.py
--- a/src/scrapers/base_scraper.py
+++ b/src/scrapers/base_scraper.py
@@ -0,0 +1,35 @@
 from abc import ABC, abstractmethod
 from typing import List, Dict, Any
 from datetime import datetime
 class BaseScraper(ABC):
    """Base class for all web scrapers"""
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.name = config.get('name', 'unknown')
        self.base_url = config.get('url', '')
    @abstractmethod
    async def scrape(self, search_params: Dict[str, Any]) -> List[Dict]:
        """Scrape data from the website"""
        pass
    def generate_hash(self, data: Dict) -> str:
        """Generate unique hash for result comparison"""
        import hashlib
        import json
        # Sort keys for consistent hashing
        sorted_data = json.dumps(data, sort_keys=True)
        return hashlib.md5(sorted_data.encode()).hexdigest()
    def add_metadata(self, results: List[Dict]) -> List[Dict]:
        """Add metadata to results"""
        for result in results:
            result.update({
                'scraper': self.name,
                'scrape_time': datetime.now().isoformat(),
                'hash': self.generate_hash(result)
            })
        return results
--- a/src/scrapers/nhg_scraper.py
+++ b/src/scrapers/nhg_scraper.py
@@ -0,0 +1,94 @@
 import asyncio
 from playwright.async_api import async_playwright
 from typing import List, Dict, Any
 import time
 from .base_scraper import BaseScraper
 class NHGScraper(BaseScraper):
    """NHG.at specific scraper"""
    def __init__(self, config: Dict[str, Any]):
        super().__init__(config)
        self.plz_list = config.get('search_params', {}).get('plz_list', ["1120", "1140", "1150", "1160"])
        self.base_url = config.get('url', 'https://nhg.at/immobilienangebot/wohnungsangebot/')
    async def scrape_plz(self, page, plz: str) -> List[Dict]:
        """Scrape alle Wohnungen für eine PLZ"""
        results = []
        try:
            # Seite laden
            await page.goto(self.base_url)
            await page.wait_for_load_state('networkidle')
            # Prüfen ob PLZ verfügbar
            options = await page.locator('#Filter_City option').all_text_contents()
            if plz not in options:
                print(f"PLZ {plz} nicht verfügbar")
                return results
            # PLZ auswählen
            await page.select_option('#Filter_City', plz)
            # Warten auf Ergebnisse
            await page.wait_for_timeout(3000)
            # Ergebnisse aus UnitsList extrahieren
            units_list = await page.query_selector('#UnitsList')
            if not units_list:
                print(f"Keine UnitsList gefunden für PLZ {plz}")
                return results
            # Den gesamten Textinhalt holen und nach Adressen durchsuchen
            content = await units_list.text_content()
            # Adressen mit Regex finden
            import re
            address_pattern = r'(\d{4}\s+Wien,\s*[^,\n]+)'
            addresses = re.findall(address_pattern, content)
            for address in addresses:
                address = address.strip()
                if address:
                    # Details Link suchen
                    details_link = None
                    try:
                        # Suche nach Details Link nach der Adresse
                        details_elements = await page.locator('#UnitsList a').all()
                        for element in details_elements:
                            link_text = await element.text_content()
                            if 'Details' in link_text:
                                details_link = await element.get_attribute('href')
                                break
                    except:
                        pass
                    result = {
                        'plz': plz.split()[0],  # Nur die PLZ, ohne "Wien"
                        'address': address,
                        'link': details_link,
                    }
                    results.append(result)
        except Exception as e:
            print(f"Fehler beim Scraping von PLZ {plz}: {e}")
        return results
    async def scrape(self, search_params: Dict[str, Any] = None) -> List[Dict]:
        """Scrape alle PLZs"""
        all_results = []
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            for plz in self.plz_list:
                print(f"Scraping PLZ {plz}...")
                results = await self.scrape_plz(page, plz)
                all_results.extend(results)
                time.sleep(1)  # Rate limiting
            await browser.close()
        return self.add_metadata(all_results)
--- a/src/storage/init.py
+++ b/src/storage/init.py
--- a/src/storage/csv_storage.py
+++ b/src/storage/csv_storage.py
@@ -0,0 +1,78 @@
 import pandas as pd
 import os
 from typing import List, Dict, Set, Tuple
 from datetime import datetime
 from pathlib import Path
 class CSVStorage:
    """CSV-based storage for scraping results"""
    def __init__(self, data_dir: str = "data"):
        self.data_dir = Path(data_dir)
        self.data_dir.mkdir(exist_ok=True)
    def get_filename(self, scraper_name: str) -> Path:
        """Get CSV filename for scraper"""
        return self.data_dir / f"{scraper_name}_results.csv"
    def load_previous_results(self, scraper_name: str) -> Set[str]:
        """Load previous result hashes"""
        filename = self.get_filename(scraper_name)
        if not filename.exists():
            return set()
        try:
            df = pd.read_csv(filename)
            return set(df['hash'].dropna().unique())
        except Exception as e:
            print(f"Fehler beim Laden vorheriger Ergebnisse: {e}")
            return set()
    def save_results(self, scraper_name: str, results: List[Dict]) -> None:
        """Save results to CSV"""
        if not results:
            print(f"Keine Ergebnisse für {scraper_name}")
            return
        filename = self.get_filename(scraper_name)
        # Convert to DataFrame
        df = pd.DataFrame(results)
        # Append to existing file or create new one
        if filename.exists():
            df.to_csv(filename, mode='a', header=False, index=False)
        else:
            df.to_csv(filename, index=False)
        print(f"{len(results)} Ergebnisse für {scraper_name} gespeichert")
    def compare_results(self, scraper_name: str, new_results: List[Dict]) -> Tuple[List[Dict], List[Dict]]:
        """Compare new results with previous ones"""
        previous_hashes = self.load_previous_results(scraper_name)
        new_hashes = {result['hash'] for result in new_results}
        # Find new results
        new_items = [result for result in new_results if result['hash'] not in previous_hashes]
        # Find removed results (optional, for reporting)
        removed_hashes = previous_hashes - new_hashes
        removed_items = []  # We don't have the full data for removed items
        return new_items, removed_items
    def get_latest_results(self, scraper_name: str, limit: int = 50) -> pd.DataFrame:
        """Get latest results from CSV"""
        filename = self.get_filename(scraper_name)
        if not filename.exists():
            return pd.DataFrame()
        try:
            df = pd.read_csv(filename)
            # Sort by scrape_time descending and get latest
            if 'scrape_time' in df.columns:
                df = df.sort_values('scrape_time', ascending=False)
            return df.head(limit)
        except Exception as e:
            print(f"Fehler beim Lesen der Ergebnisse: {e}")
            return pd.DataFrame()
--- a/test_data/nhg_test_results.csv
+++ b/test_data/nhg_test_results.csv
@@ -0,0 +1,2 @@
 plz,address,link,scraper,scrape_time,hash
 1120,"1120 Wien, Flurschützstraße 5 / 2 / 10",#,NHG Test,2026-02-15T08:13:09.072841,75b2a4c4eb48f8f22047d252320d56f6
--- a/test_scraper.py
+++ b/test_scraper.py
@@ -0,0 +1,61 @@
 #!/usr/bin/env python3
 """
 Test script for the NHG scraper
 """
 import asyncio
 import sys
 import os
 # Add src to path
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src'))
 from scrapers.nhg_scraper import NHGScraper
 from storage.csv_storage import CSVStorage
 async def test_nhg_scraper():
    """Test the NHG scraper with a single PLZ"""
    print("Testing NHG Scraper...")
    config = {
        'name': 'NHG Test',
        'url': 'https://nhg.at/immobilienangebot/wohnungsangebot/',
        'search_params': {
            'plz_list': ['1120 Wien']  # Test with full PLZ name
        }
    }
    scraper = NHGScraper(config)
    try:
        results = await scraper.scrape()
        print(f"Found {len(results)} results:")
        for result in results:
            print(f"  PLZ: {result.get('plz')}")
            print(f"  Address: {result.get('address')}")
            print(f"  Link: {result.get('link')}")
            print(f"  Hash: {result.get('hash')}")
            print("-" * 40)
        # Test storage
        storage = CSVStorage('test_data')
        new_results, removed_results = storage.compare_results('nhg_test', results)
        print(f"New results: {len(new_results)}")
        print(f"Removed results: {len(removed_results)}")
        # Save results
        storage.save_results('nhg_test', results)
        return True
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()
        return False
 if __name__ == "__main__":
    success = asyncio.run(test_nhg_scraper())
    sys.exit(0 if success else 1)
		`@@ -0,0 +1,2 @@`
							`plz,address,link,scraper,scrape_time,hash`
							`1120,"1120 Wien, Flurschützstraße 5 / 2 / 10",#,NHG Test,2026-02-15T08:13:09.072841,75b2a4c4eb48f8f22047d252320d56f6`