attention fine-tuning Transformer
attention based HuggingFace implementation for relu ensemble.
- Input
- 1893-dim embedding
- Encoder
- 54 x Transformer with 32 heads
- Output
- recall projection
Training config
optimizer=AdamW, lr=0.624, scheduler=cyclic, warmup=1685