Back to Portfolio
Machine Learning · Financial Analytics

US Stocks Indicators

Predictive Modeling and Financial Segmentation

Machine Learning pipeline to classify US stocks by price variation using 173 financial indicators (2014–2018). Compares Logistic Regression, Random Forest, SVM RBF, and XGBoost, complemented by unsupervised segmentation with PCA, t-SNE, UMAP, and K-Means.

Python Scikit-learn XGBoost PCA t-SNE UMAP K-Means Repository
11.5KRecords
173Variables
2014–18Period
11,566
Records
173
Indicators
2014–2018
Period
11
Sectors

Problem & Objective

Business Question

Can a company's financial indicators help anticipate whether its stock will experience a positive price variation, while simultaneously revealing groups of companies with similar financial behaviors?

Business Value

This project can function as an initial financial screening tool. It does not aim to predict exact prices, but to classify companies based on financial signals that could be associated with positive or negative price variations. It supports investment analysis processes, company prioritization, and exploration of financial profiles by sector.

STAR Framework

S

Situation

Database of 11,566 records with 173 financial variables of US stocks (2014–2018) across 11 sectors. Need to evaluate whether these indicators can classify price variations.

T

Task

Build an analytical pipeline to classify stocks with positive/negative variation using supervised models and complement with unsupervised segmentation.

A

Action

Data preparation, scaling, One-Hot encoding of Sector, temporal split 70/15/15. Comparison of Logistic Regression, Random Forest, SVM RBF, and XGBoost. Segmentation with PCA, t-SNE, UMAP, and K-Means.

R

Result

Random Forest achieved AUC ~0.62. XGBoost offered better F1 balance (~0.22). t-SNE/UMAP clusters revealed consistent groups of financial profiles.

Data Overview

Dataset Information

  • Records: 11,566
  • Variables: 173 financial indicators
  • Period: 2014–2018
  • Sectors: 11 categories
  • Target: Class (positive/negative variation)
  • Reference: Price Var (continuous)

Class Distribution

  • Class 1 (Positive): ~56.3%
  • Class 0 (Negative): ~43.7%
  • Relatively balanced distribution

Note: Price Var behavior changes significantly by year — 2017 shows negative average, while 2015, 2016, and 2018 show positive averages.

Methodology

1
Data Preparation Separation of predictors and target. Exclusion of Price Var and Class to avoid data leakage. Scaling of numerical variables. One-Hot encoding for Sector.
2
Temporal Validation Split 70% training, 15% validation, 15% test — respecting chronological order to avoid mixing future information with past data.
3
Supervised Models Logistic Regression (L1 regularization), Random Forest, SVM RBF, XGBoost — evaluated via Accuracy, F1-score, and AUC.
4
Unsupervised Segmentation Dimensionality reduction with PCA, t-SNE, and UMAP. Clustering with K-Means. Evaluation via Silhouette Score and Inertia. Cluster analysis against Price Var deciles.

Key Insights

Balanced distributionThe dataset is relatively balanced: ~56.3% class 1 (positive) vs ~43.7% class 0 (negative).
Temporal variation in Price Var2017 shows a negative average Price Var, while 2015, 2016, and 2018 present positive averages — indicating strong temporal dynamics.
Random Forest: Best AUCRandom Forest achieved the highest AUC (~0.62), showing the best global discrimination, though with signs of moderate overfitting.
XGBoost: Best stabilityXGBoost offered better balance for the positive class (F1 ~0.22) with reasonable stability between validation and test sets.
Segmentation patternsPCA obtained the highest Silhouette Score, while t-SNE and UMAP revealed more interpretable clusters: moderate variation companies, positive/negative extremes, and higher drawdown exposure.
Sector insightsThe analysis across 11 sectors reveals differentiated financial profiles that can support investment filtering and company prioritization.

Recommended Portfolio Visuals

KPI cards: 11,566 records · 173 indicators · 2014–2018 · 11 sectors
Class variable distribution
Average Price Var per year
Model comparison: Accuracy, F1, AUC
2D cluster map with t-SNE and UMAP
Heatmap: clusters vs Price Var deciles
Feature importance (best model)
Confusion matrix (selected model)

Google Slides Presentation

A detailed walkthrough of the analysis, methodology, and key findings using the STAR framework.

Open Presentation

Limitations

  • The stock market is influenced by external factors not always reflected in financial statements.
  • AUC ~0.62 shows the model identifies signals but has moderate predictive capacity.
  • Sector appears numerically encoded — final version should map to real sector names.
  • SHAP or feature importance recommended to improve model interpretability.

Next Steps

  • Map real sector names.
  • Add confusion matrix and ROC curve.
  • Use SHAP for XGBoost explainability.
  • Build interactive dashboard with Tableau or Power BI.
  • Publish clean notebook on GitHub with README, dataset, and visuals.
Explore the project