Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

ArXiv ID: 2406.02969 “View on arXiv”

Authors: Unknown

Abstract

We propose MoE-F - a formalized mechanism for combining $N$ pre-trained Large Language Models (LLMs) for online time-series prediction by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert’s running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, our approach employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the $N$ individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are optimally aggregated to maximize their robust predictive power, and this update is computed efficiently via a closed-form expression, generating our ensemble predictor. Our contributions are: (I) the MoE-F plug-and-play filtering harness algorithm, (II) theoretical optimality guarantees of the proposed filtering-based gating algorithm (via optimality guarantees for its parallel Bayesian filtering and its robust aggregation steps), and (III) empirical evaluation and ablative results using state-of-the-art foundational and MoE LLMs on a real-world Financial Market Movement task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert predicting short-horizon market movement based on streaming news. Further, we provide empirical evidence of substantial performance gains in applying MoE-F over specialized models in the long-horizon time-series forecasting domain.

Keywords: Mixture of Experts, Hidden Markov Model, Time-series Forecasting, Bayesian Filtering, Ensemble Methods, Equities (Financial Market Movement)

Complexity vs Empirical Score

Math Complexity: 8.5/10
Empirical Rigor: 7.0/10
Quadrant: Holy Grail
Why: The paper introduces advanced stochastic filtering theory (e.g., continuous-time HMMs, Wonham-Shiryaev filter) with heavy theoretical derivations, indicating high math complexity. It is empirically rigorous with a real-world financial dataset, multiple LLM baselines, and detailed performance metrics (F1 improvements), though it lacks fully public code or raw backtest code.

  flowchart TD
    %% Research Goal
    goal["Research Goal<br>Combine N LLMs for<br>Time-Series Prediction"]

    %% Inputs
    inputs["Input Data<br>Streaming News +<br>Time-Series"]

    %% Methodology
    methodology["Methodology: MoE-F<br>Stochastic Filtering-Based Gating"]
    
    subgraph hmm_proc ["Hidden Markov Model Processing"]
        hmm["Finite State-Space HMM<br>Expert Performance Forecasting"]
        filters["Parallel Filters<br>Wohman-Shiryaev Filter<br>1 Filter per LLM"]
    end

    subgraph aggregation ["Aggregation Step"]
        agg["Robust Aggregation<br>Closed-Form Update"]
    end

    %% Outcomes
    outcomes["Key Findings<br>17% Abs / 48.5% Rel F1 Gain<br>State-of-the-Art Performance"]

    %% Flow
    goal --> inputs
    inputs --> methodology
    methodology --> hmm
    hmm --> filters
    filters --> agg
    agg --> outcomes

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models#

Abstract#

Complexity vs Empirical Score#

Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Abstract

Complexity vs Empirical Score