A compendium of data sources for data science, machine learning, and artificial intelligence

ArXiv ID: 2309.05682 “View on arXiv”

Authors: Unknown

Abstract

Recent advances in data science, machine learning, and artificial intelligence, such as the emergence of large language models, are leading to an increasing demand for data that can be processed by such models. While data sources are application-specific, and it is impossible to produce an exhaustive list of such data sources, it seems that a comprehensive, rather than complete, list would still benefit data scientists and machine learning experts of all levels of seniority. The goal of this publication is to provide just such an (inevitably incomplete) list – or compendium – of data sources across multiple areas of applications, including finance and economics, legal (laws and regulations), life sciences (medicine and drug discovery), news sentiment and social media, retail and ecommerce, satellite imagery, and shipping and logistics, and sports.

Keywords: data science, large language models, data sources, compendium, finance and economics

Complexity vs Empirical Score

  • Math Complexity: 0.5/10
  • Empirical Rigor: 2.0/10
  • Quadrant: Philosophers
  • Why: The paper is a resource compendium with almost no advanced mathematical notation or derivations, and its empirical focus is on listing data sources rather than presenting backtested strategies or statistical analysis.
  flowchart TD
    A["Research Goal: Identify and compendium data sources<br>for DS, ML, and AI across diverse application areas"] --> B["Methodology: Review & Synthesis"]
    B --> C{"Inputs: Existing literature,<br>domain-specific datasets & APIs"}
    C --> D["Computational Process: Categorization &<br>Analysis of source applicability"]
    D --> E["Key Findings/Outcomes:<br>Comprehensive list of sources across 8 domains<br>Finance, Legal, Life Sciences, Social Media,<br>Retail, Satellite, Logistics, Sports"]
    E --> F["Value: Addresses data demand<br>driven by Large Language Models"]