Sentiment Analysis in Python
The quantitative extraction of bullish or bearish signals from unstructured text sources — news, earnings calls, social media — to generate alpha in systematic trading strategies.
Definition
Sentiment Analysis in quantitative finance applies natural language processing (NLP) techniques to unstructured text data — financial news, SEC filings, earnings call transcripts, analyst reports, and social media — to extract a quantitative sentiment signal that can be incorporated into a trading model. Modern approaches range from lexicon-based methods (assigning pre-defined sentiment scores to financial vocabulary using dictionaries like Loughran-McDonald) to transformer-based large language models fine-tuned on financial text (FinBERT, BloombergGPT). The core hypothesis is that textual sentiment contains information about future returns that is not fully captured by price and volume data alone.
Quantitative Formula
Where , , and are the counts of positive, negative, and neutral tokens (words or sentences) in a document at time , classified according to a sentiment lexicon or ML model. The resulting is the net sentiment score. For cross-sectional strategies, sentiment scores are typically standardized (z-scored) within a universe and combined with price momentum and fundamental signals in a multi-factor model.
Why It Matters in Backtesting
Sentiment-based backtests face a unique lookahead bias risk: news timestamps in commercial data vendors are frequently inaccurate by hours or days, meaning a backtest may be using articles published after market close as if they were pre-open signals. The correct approach requires verifying publication timestamps against trading timestamps at the tick level. Additionally, the Loughran-McDonald financial lexicon must be used instead of general-purpose lexicons like VADER or AFINN — words like 'liability', 'risk', and 'derivative' have neutral connotations in finance but strongly negative scores in generic sentiment models, producing systematically wrong signals.
Python Implementation
import numpy as np
import pandas as pd
from collections import Counter
# Simplified Loughran-McDonald inspired word lists
LM_POSITIVE = {"strong", "exceeded", "growth", "record", "outperform", "raised", "beat", "robust"}
LM_NEGATIVE = {"weak", "missed", "decline", "loss", "impairment", "restructuring", "headwind", "default"}
def calculate_financial_sentiment(articles_df: pd.DataFrame,
price_df: pd.DataFrame,
forward_window: int = 5) -> pd.DataFrame:
"""
Computes document-level financial sentiment and validates signal vs forward returns.
articles_df: DataFrame with ['timestamp', 'ticker', 'text'] columns.
price_df: DataFrame with DatetimeIndex and ticker columns (daily returns).
"""
results = []
for _, row in articles_df.iterrows():
tokens = str(row["text"]).lower().split()
n_pos = sum(1 for t in tokens if t in LM_POSITIVE)
n_neg = sum(1 for t in tokens if t in LM_NEGATIVE)
total = n_pos + n_neg
net_sentiment = (n_pos - n_neg) / total if total > 0 else 0.0
# Forward return starting AFTER the article timestamp (no lookahead)
pub_date = pd.to_datetime(row["timestamp"]).date()
ticker = row["ticker"]
if ticker in price_df.columns:
future_returns = price_df[ticker].loc[str(pub_date):]
fwd_return = (1 + future_returns.iloc[1:forward_window + 1]).prod() - 1 if len(future_returns) > forward_window else np.nan
else:
fwd_return = np.nan
results.append({"timestamp": row["timestamp"], "ticker": ticker,
"sentiment": net_sentiment, "n_positive": n_pos,
"n_negative": n_neg, "forward_return": fwd_return})
results_df = pd.DataFrame(results)
signal_ic = results_df["sentiment"].corr(results_df["forward_return"])
results_df.attrs["information_coefficient"] = signal_ic
results_df.attrs["signal_valid"] = abs(signal_ic) > 0.05
return results_dfTest this in a live environment
Stop running Jupyter notebooks locally. Paste this Sentiment Analysis code directly into Valetha's Strategy Lab and run a full historical backtest in seconds.
Open the Python Strategy Lab