# Deribit Normalizer - Implementation Guide

**Broker ID:** deribit
**Formato:** JSON API (crypto perpetuals + European-style options)
**Hash Match Rate Actual:** 74.27% → **Objetivo:** 95%+

Este documento organiza la implementación de validaciones críticas en fases priorizadas.

---

## Estado Actual

### Metrics
- **Hash Match Rate:** 74.27% (1,146/1,543 matches) ⚠️
- **Data Integrity:** 100% ✅ (datos de trading correctos)
- **Test Coverage:** 616 líneas (29 tests) ✅
- **Scope:** JSON API only (crypto perpetuals + options)

### Critical Issues
- **1 CRÍTICO:** Expired options hash mismatch (397 records con "_EXP_A" suffix)
- **5 HIGH/MEDIUM:** Data validation issues (symbol, price, quantity, timestamp, side)

### Strengths
- ✅ Strong test coverage (616 líneas vs legacy 0 tests)
- ✅ Clean architecture (interpreter pattern + Polars lazy evaluation)
- ✅ 100% data integrity (precio, cantidad, fees, timestamps correctos)
- ✅ 9 validaciones ya implementadas correctamente

---

## FASE 1: Critical Hash Fix (1-2 días) ⚠️ URGENTE

**Objetivo:** Corregir hash match rate de 74.27% a 95%+

**Prioridad:** ALTA (URGENTE)
**Complejidad:** MEDIA
**Riesgo:** MEDIO

### Issue: Expired Options Hash Mismatch

**Problema:**
- 397 expired options (25.73%) necesitan "_EXP_A" suffix en hash computation
- Riesgo: Expired options pueden duplicarse en re-import
- Actual: `MD5(json.dumps(id))`
- Expected: `MD5(json.dumps(id + "_EXP_A"))` para expired options

### Implementación

#### Paso 1.1: Agregar Método de Hash Computation (0.5 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Después de línea 127 (después de `_is_option` method)

```python
@classmethod
def _compute_file_row_hash(cls, order_id: str, instrument_name: str) -> str:
    """
    Compute file_row hash with expired options handling.

    For expired options, append "_EXP_A" suffix to match legacy.

    Args:
        order_id: Order ID from Deribit API (trade.get("id"))
        instrument_name: Instrument name (e.g., "BTC-26JUL24-56000-P")

    Returns:
        MD5 hash string for deduplication

    Examples:
        >>> # Expired option
        >>> _compute_file_row_hash("596504147", "BTC-26JUL24-56000-P")
        'abc123...'  # hash of "596504147_EXP_A"

        >>> # Active option
        >>> _compute_file_row_hash("596504147", "BTC-26DEC26-80000-C")
        'def456...'  # hash of "596504147" (no suffix)

        >>> # Perpetual
        >>> _compute_file_row_hash("3877357", "ETH_USDT")
        'ghi789...'  # hash of "3877357" (no suffix)
    """
    # Check if instrument is expired option
    # Pattern: BTC-DDMMMYY-STRIKE-C/P
    if cls._is_option(instrument_name):
        parts = instrument_name.split('-')
        if len(parts) == 4:
            expiry_str = parts[1]  # e.g., "26JUL24"
            try:
                # Parse expiry date in DDMMMYY format
                expiry_date = datetime.strptime(expiry_str, '%d%b%y')

                # Check if expired (compare with current date)
                if expiry_date < datetime.now():
                    order_id = order_id + "_EXP_A"
                    logger.debug(
                        f"[DERIBIT] Added EXP_A suffix for expired option: "
                        f"{instrument_name} (expired: {expiry_date.date()})"
                    )
            except ValueError as e:
                logger.warning(
                    f"[DERIBIT] Failed to parse expiry date '{expiry_str}' "
                    f"for instrument '{instrument_name}': {e}"
                )
                # Not a valid date format - skip suffix

    return hashlib.md5(json.dumps(order_id).encode('utf-8')).hexdigest()
```

#### Paso 1.2: Usar Nuevo Método en parse_json_content (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Línea ~202 (reemplazar hash computation)

**BEFORE:**
```python
# Compute file_row hash using legacy formula:
# MD5(json.dumps(id)) - no EXP_A suffix for Deribit
trade_id_val = trade.get("id", "")
file_row_hash = hashlib.md5(json.dumps(trade_id_val).encode('utf-8')).hexdigest()
```

**AFTER:**
```python
# Compute file_row hash using legacy formula with expired options handling:
# - Active options/perpetuals: MD5(json.dumps(id))
# - Expired options: MD5(json.dumps(id + "_EXP_A"))
trade_id_val = trade.get("id", "")
file_row_hash = cls._compute_file_row_hash(str(trade_id_val), instrument)
```

#### Paso 1.3: Agregar Tests (0.5 días)

**Archivo:** `tests/brokers/test_deribit.py`
**Ubicación:** Agregar nueva clase TestHashComputation

```python
class TestHashComputation:
    """Tests for expired options hash computation."""

    def test_expired_option_hash_suffix(self):
        """Verifica que expired options usan _EXP_A suffix"""
        # Arrange: Expired option (26JUL24 in the past)
        order_id = "596504147"
        instrument = "BTC-26JUL24-56000-P"  # Expired option

        # Act
        hash_result = DeribitInterpreter._compute_file_row_hash(order_id, instrument)

        # Assert: Should match legacy formula with suffix
        expected = hashlib.md5(json.dumps("596504147_EXP_A").encode('utf-8')).hexdigest()
        assert hash_result == expected

    def test_active_option_hash_no_suffix(self):
        """Verifica que active options NO usan suffix"""
        # Arrange: Active option (26DEC26 in the future)
        order_id = "596504147"
        instrument = "BTC-26DEC26-80000-C"  # Future expiry

        # Act
        hash_result = DeribitInterpreter._compute_file_row_hash(order_id, instrument)

        # Assert: Should NOT have suffix
        expected = hashlib.md5(json.dumps("596504147").encode('utf-8')).hexdigest()
        assert hash_result == expected

    def test_perpetual_hash_no_suffix(self):
        """Verifica que perpetuals NO usan suffix"""
        # Arrange: Perpetual (not an option)
        order_id = "3877357"
        instrument = "ETH_USDT"  # Perpetual

        # Act
        hash_result = DeribitInterpreter._compute_file_row_hash(order_id, instrument)

        # Assert: Should NOT have suffix
        expected = hashlib.md5(json.dumps("3877357").encode('utf-8')).hexdigest()
        assert hash_result == expected

    def test_expired_option_integration(self):
        """Integration test: expired option correctamente procesada"""
        # Arrange: Sample expired option trade
        trade = {
            "id": 596504147,
            "type": "trade",
            "instrument_name": "ETH-30DEC22-3600-P",  # Expired in 2022
            "side": "liquidation buy",
            "price": 0.0007,
            "amount": 2.0,
            "contracts": 2.0,
            "timestamp": 1671469172919,
            "commission": 0.00035,
            "currency": "ETH",
            "index_price": 3872.94,
        }

        # Act
        df = DeribitInterpreter.parse_json_content(json.dumps([trade]))

        # Assert
        assert df.height == 1
        result = df.collect()

        # file_row should have EXP_A suffix
        expected_hash = hashlib.md5(json.dumps("596504147_EXP_A").encode('utf-8')).hexdigest()
        assert result["file_row"][0] == expected_hash
```

#### Paso 1.4: Validación con Datos Reales (0.5 días)

1. **Ejecutar tests:**
   ```bash
   pytest tests/brokers/test_deribit.py::TestHashComputation -v
   ```

2. **Verificar hash match rate con datos reales:**
   ```bash
   python scripts/verify_hash_match_rate.py \
       --broker deribit \
       --user-id 49186 \
       --expected-rate 0.95
   ```

3. **Revisar logs:**
   ```bash
   grep "EXP_A" logs/deribit.log | head -20
   ```

### Métricas de Éxito

- [ ] Hash match rate >= 95% (vs actual 74.27%)
- [ ] Expired options correctamente deduplicados (397 records)
- [ ] Active options sin cambios (1,146 records)
- [ ] Perpetuals sin cambios
- [ ] Data integrity mantenida (100%)
- [ ] Tests passing: 4 new tests
- [ ] No regressions en tests existentes

### Tiempo Estimado: 1-2 días

---

## FASE 2: Data Validation (1-1.5 días)

**Objetivo:** Prevenir datos inválidos

**Prioridad:** ALTA
**Complejidad:** BAJA
**Riesgo:** BAJO

### 2.1 Symbol Validation (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Línea ~164 (después de instrument assignment)

```python
instrument = str(trade.get("instrument_name", ""))

# Validate instrument name is not empty
if not instrument or not instrument.strip():
    logger.warning(f"[DERIBIT] Skipping trade {trade.get('id', 'unknown')}: empty instrument_name")
    continue

is_option = cls._is_option(instrument)
```

**Test:**
```python
def test_symbol_empty_validation():
    trade = {
        "id": 12345,
        "type": "trade",
        "instrument_name": "",  # Empty symbol
        "side": "open buy",
        "price": 2500.0,
        "amount": 10.0,
        "timestamp": 1751852373399,
    }
    df = DeribitInterpreter.parse_json_content(json.dumps([trade]))
    assert df.height == 0  # Should be rejected
```

### 2.2 Price Validation (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Línea ~177 (después de price assignment)

```python
price = float(trade.get("price", 0) or 0)

# Validate price is positive
if price <= 0:
    logger.warning(f"[DERIBIT] Skipping trade {trade.get('id', 'unknown')}: "
                   f"zero or negative price {price}")
    continue
```

**Tests:**
```python
def test_price_zero_validation():
    # Zero price rejection

def test_price_negative_validation():
    # Negative price rejection
```

### 2.3 Quantity Validation (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Línea ~183 (después de quantity calculation)

```python
if contracts is not None:
    quantity = float(contracts)
else:
    quantity = amount

# Validate quantity is positive
if quantity <= 0:
    logger.warning(f"[DERIBIT] Skipping trade {trade.get('id', 'unknown')}: "
                   f"zero quantity (amount={amount}, contracts={contracts})")
    continue
```

**Test:**
```python
def test_quantity_zero_validation():
    # Zero quantity rejection
```

### 2.4 Timestamp Validation (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Línea ~210 (antes de timestamp conversion)

```python
timestamp_val = trade.get("timestamp")

# Validate timestamp exists and is not zero
if not timestamp_val or timestamp_val == 0:
    logger.warning(f"[DERIBIT] Skipping trade {trade.get('id', 'unknown')}: "
                   f"missing timestamp")
    continue

try:
    timestamp_ms = int(timestamp_val)
    utc_dt = datetime.utcfromtimestamp(timestamp_ms / 1000)
```

**Tests:**
```python
def test_timestamp_missing_validation():
    # Missing timestamp rejection

def test_timestamp_zero_validation():
    # Zero timestamp rejection
```

### 2.5 Side Mapping Improvement (0.25 días)

**Archivo:** `brokers/deribit/deribit.py`
**Ubicación:** Líneas 238-240 (mejorar documentación)

```python
# Map compound side values to BUY/SELL
# Deribit uses 6 compound side values:
# - "open buy", "close short", "liquidation buy" → BUY
# - "open sell", "close long", "liquidation sell" → SELL
side_raw = str(trade.get("side", "")).lower()
action = "BUY" if any(x in side_raw for x in ["buy", "short"]) else \
         "SELL" if any(x in side_raw for x in ["sell", "long"]) else ""

# Validate side mapping was successful
if not action:
    logger.warning(f"[DERIBIT] Skipping trade {trade.get('id', 'unknown')}: "
                   f"unknown side '{side_raw}'")
    continue
```

**Test:**
```python
def test_side_mapping_all_combinations():
    # Test all 6 side combinations

def test_side_mapping_unknown():
    # Unknown side rejection
```

### Métricas de Éxito

- [ ] Zero symbols vacíos en output
- [ ] Zero prices zero/negativos en output
- [ ] Zero quantities zero en output
- [ ] Zero timestamps missing en output
- [ ] Unknown sides rechazados
- [ ] Rejection rate < 0.1%
- [ ] Tests passing: 8 new tests
- [ ] No regressions en tests existentes

### Tiempo Estimado: 1-1.5 días

---

## FASE 3: CSV Support (5-8 días - CONDICIONAL) 🎯

**Objetivo:** Implementar soporte para 2 formatos CSV legacy

**Prioridad:** CONDICIONAL (requiere SQL query)
**Complejidad:** ALTA
**Riesgo:** MEDIO

### BLOQUEADOR: SQL Query Required

Antes de implementar, ejecutar SQL query para determinar necessity:

```sql
SELECT
    source_type,
    COUNT(*) as count,
    COUNT(DISTINCT user_id) as user_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage,
    MAX(created_at) as last_used
FROM data_sources
WHERE broker_id = 'deribit'
  AND created_at > NOW() - INTERVAL '12 months'
GROUP BY source_type
ORDER BY count DESC;
```

**Criterio:**
- Si CSV percentage > 5%: **IMPLEMENTAR** (continuar con Fase 3)
- Si CSV percentage < 5%: **OMITIR** (skip Fase 3)

### Scope (Solo si se Implementa)

#### 3.1 CSV Format v1 Parser (2 días)

**Legacy:** `brokers_deribit.py` líneas 20-180 (getcsv)

**Features:**
- Format detection (detect "DATE" header)
- Side mapping (buy/sell, open buy, close buy, open sell, close sell)
- Symbol normalization (BTC-PERPETUAL → BTC, remove dashes)
- Options parsing (extract strike, expiry, type)
- Fee extraction
- Unix timestamp vs string date parsing
- Quantity source (quantity → amount priority)
- Price decimal calculation

#### 3.2 CSV Format v2 Parser (2 días)

**Legacy:** `brokers_deribit.py` líneas 182-326 (getcsv2)

**Features:**
- Format detection (detect "TRADE ID" header)
- Different column mappings
- Future/Option type detection
- Similar processing as v1

#### 3.3 Format Detection Update (0.5 días)

**Archivo:** `brokers/deribit/detector.py`

Agregar detección de 3 formats:
1. JSON API (current)
2. CSV v1
3. CSV v2

#### 3.4 Tests Comprehensivos (1.5 días)

- CSV v1 parsing tests
- CSV v2 parsing tests
- Format detection tests
- Hash match validation
- Edge cases

#### 3.5 Documentación (0.5 días)

- Update README
- CSV format specifications
- Migration guide

### Métricas de Éxito (Si se Implementa)

- [ ] CSV v1 format correctamente parseado
- [ ] CSV v2 format correctamente parseado
- [ ] Hash match rate >= 95% para CSVs
- [ ] Tests passing para ambos formats
- [ ] Documentación actualizada
- [ ] No regressions en JSON API

### Tiempo Estimado: 5-8 días (solo si necesario)

---

## Checklist de Implementación

### Pre-Implementación
- [ ] Review plan completo
- [ ] Setup environment de desarrollo
- [ ] Create feature branch: `feature/deribit-validations`
- [ ] Backup de datos de prueba

### Fase 1 (Critical Hash Fix)
- [ ] Implementar `_compute_file_row_hash()` method
- [ ] Actualizar hash computation en parse_json_content
- [ ] Agregar 4 test cases
- [ ] Run tests: `pytest tests/brokers/test_deribit.py -v`
- [ ] Verificar hash match rate con datos reales
- [ ] Code review
- [ ] Merge to main

### Fase 2 (Data Validation)
- [ ] Implementar symbol validation
- [ ] Implementar price validation
- [ ] Implementar quantity validation
- [ ] Implementar timestamp validation
- [ ] Mejorar side mapping documentation
- [ ] Agregar 8 test cases
- [ ] Run tests: `pytest tests/brokers/test_deribit.py -v`
- [ ] Verificar rejection rate < 0.1%
- [ ] Code review
- [ ] Merge to main

### Fase 3 (CSV Support - Condicional)
- [ ] Ejecutar SQL query
- [ ] Si CSV > 5%: implementar CSV parsers
- [ ] Si CSV < 5%: skip fase
- [ ] (Si se implementa) Tests comprehensivos
- [ ] (Si se implementa) Documentación
- [ ] (Si se implementa) Code review
- [ ] (Si se implementa) Merge to main

### Post-Implementación
- [ ] Integration tests con datos reales
- [ ] Performance testing
- [ ] Monitoring setup
- [ ] Deploy a staging
- [ ] Smoke tests en staging
- [ ] Deploy a producción
- [ ] Monitor production metrics

---

## Comandos Útiles

### Development
```bash
# Run all tests
pytest tests/brokers/test_deribit.py -v

# Run specific test class
pytest tests/brokers/test_deribit.py::TestHashComputation -v

# Run with coverage
pytest tests/brokers/test_deribit.py --cov=pipeline.p01_normalize.brokers.deribit

# Watch mode (auto-run on file change)
ptw tests/brokers/test_deribit.py -- -v
```

### Validation
```bash
# Verify hash match rate
python scripts/verify_hash_match_rate.py \
    --broker deribit \
    --user-id 49186 \
    --expected-rate 0.95

# Check validation warnings
grep "Skipping trade" logs/deribit.log | wc -l

# Verify rejection rate
python scripts/calculate_rejection_rate.py \
    --broker deribit \
    --threshold 0.001
```

### Git Workflow
```bash
# Create feature branch
git checkout -b feature/deribit-validations

# Commit Fase 1
git add .
git commit -m "feat(deribit): implement expired options hash fix"

# Commit Fase 2
git add .
git commit -m "feat(deribit): add data validations (symbol, price, quantity, timestamp, side)"

# Push and create PR
git push origin feature/deribit-validations
```

---

## Troubleshooting

### Issue: Hash Match Rate No Mejora

**Síntomas:**
- Hash match rate permanece en ~74%
- Tests passing pero validación con datos reales falla

**Solución:**
1. Verificar que `_compute_file_row_hash()` está siendo llamado:
   ```python
   # Add logging to verify
   logger.info(f"Computing hash for order {order_id}, instrument {instrument_name}")
   ```

2. Verificar expiry date parsing:
   ```python
   # Test date parsing manually
   from datetime import datetime
   expiry_str = "26JUL24"
   expiry_date = datetime.strptime(expiry_str, '%d%b%y')
   print(f"Parsed: {expiry_date}, Is expired: {expiry_date < datetime.now()}")
   ```

3. Verificar que todos los expired options tienen el suffix:
   ```bash
   grep "Added EXP_A suffix" logs/deribit.log | wc -l
   # Should be ~397 records
   ```

### Issue: Rejection Rate Alto

**Síntomas:**
- Rejection rate > 1%
- Muchos warnings en logs

**Solución:**
1. Analizar tipos de rechazos:
   ```bash
   grep "Skipping trade" logs/deribit.log | \
       cut -d':' -f3 | sort | uniq -c | sort -rn
   ```

2. Investigar datos específicos que están siendo rechazados

3. Considerar ajustar validaciones si rejection rate es justificado

### Issue: Tests Failing

**Síntomas:**
- Tests existentes fallan después de cambios

**Solución:**
1. Verificar que no se modificaron implementaciones existentes
2. Run tests con verbose output:
   ```bash
   pytest tests/brokers/test_deribit.py -vv --tb=long
   ```
3. Revisar test fixtures que puedan necesitar actualización

---

## Referencias

### Documentos Relacionados
- `README.md` - Resumen ejecutivo
- `PLAN_ANALISIS_VALIDACIONES_DERIBIT.md` - Plan completo con detalles técnicos
- `CAMBIOS_IMPLEMENTADOS.md` - Tracking de implementación
- `EJEMPLOS_CAMBIOS_CODIGO.md` - Ejemplos before/after con tests

### Código de Referencia
- `deribit.py.original` - Implementación actual (516 líneas)
- `detector.py` - Format detection (50 líneas)
- `test_deribit.py.original` - Tests actuales (616 líneas)

### Código Legacy
- `old_code_from_legacy/brokers_deribit.py` - CSV parsers (326 líneas)
- `old_code_from_legacy/deribit_export.py` - API sync service (549 líneas)

---

**Fecha de Creación:** 2026-01-14
**Última Actualización:** 2026-01-14
**Versión:** 1.0
**Responsable:** Development Team
