nsa, hace 11 meses en Averaging model weights seems to help across textual domains as well, see Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models and Scaling Expert Language Models with Unsupervised Domain Discovery. I wonder if the two types of averaging (across hyperparameters and across domains) can be combined to produce even better models.
Averaging model weights seems to help across textual domains as well, see Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models and Scaling Expert Language Models with Unsupervised Domain Discovery. I wonder if the two types of averaging (across hyperparameters and across domains) can be combined to produce even better models.