Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

ArXiv ID: 2404.01338 “View on arXiv”

Authors: Unknown

Abstract

Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (NLP) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (LDA) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. We created an experimental data set composed of 2,158 financial news items that were manually labelled by NLP researchers to evaluate our solution. The ROUGE-L values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with LDA to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text.

Keywords: Natural Language Processing, Latent Dirichlet Allocation, co-reference resolution, topic segmentation, financial forecasting, General (Market Screening)

Complexity vs Empirical Score

  • Math Complexity: 3.5/10
  • Empirical Rigor: 7.0/10
  • Quadrant: Street Traders
  • Why: The paper focuses on implementing an NLP pipeline with established models like LDA and machine learning, resulting in moderate empirical rigor due to the detailed experimental setup and evaluation metrics (ROUGE-L, dataset size). Math complexity is low, as it relies on standard probabilistic topic modeling and classification techniques without novel mathematical derivations.
  flowchart TD
    A["Research Goal<br>Automatic detection of relevant info,<br>predictions, and forecasts in financial news"] --> B["Data Input<br>2,158 financial news items<br>(manually labeled)"]

    B --> C{"Methodology Pipeline"}

    C --> D["Step 1: Topic Segmentation<br>Group closely related text"]
    D --> E["Step 2: Co-reference Resolution<br>Discover internal text dependencies"]
    E --> F["Step 3: LDA Topic Modelling<br>Detect relevant vs. less relevant text"]
    F --> G["Step 4: Temporal Analysis<br>Identify predictions & forecasts<br>within relevant text"]

    G --> H["Key Outcomes<br>ROUGE-L Score: 0.662 (Relevance)<br>ROUGE-L Score: 0.982 (Predictions)<br>First work to jointly consider relevance & temporality"]