US Stocks Indicators – Camilo Sierra

Key Metrics

11,566

Records

173

Indicators

2014–2018

Period

11

Sectors

Business Context

Problem & Objective

Business Question

Can a company's financial indicators help anticipate whether its stock will experience a positive price variation, while simultaneously revealing groups of companies with similar financial behaviors?

Business Value

This project can function as an initial financial screening tool. It does not aim to predict exact prices, but to classify companies based on financial signals that could be associated with positive or negative price variations. It supports investment analysis processes, company prioritization, and exploration of financial profiles by sector.

Methodology

STAR Framework

S

Situation

Database of 11,566 records with 173 financial variables of US stocks (2014–2018) across 11 sectors. Need to evaluate whether these indicators can classify price variations.

T

Task

Build an analytical pipeline to classify stocks with positive/negative variation using supervised models and complement with unsupervised segmentation.

A

Action

Data preparation, scaling, One-Hot encoding of Sector, temporal split 70/15/15. Comparison of Logistic Regression, Random Forest, SVM RBF, and XGBoost. Segmentation with PCA, t-SNE, UMAP, and K-Means.

R

Result

Random Forest achieved AUC ~0.62. XGBoost offered better F1 balance (~0.22). t-SNE/UMAP clusters revealed consistent groups of financial profiles.

Dataset

Data Overview

Dataset Information

Records: 11,566
Variables: 173 financial indicators
Period: 2014–2018
Sectors: 11 categories
Target: Class (positive/negative variation)
Reference: Price Var (continuous)

Class Distribution

Class 1 (Positive): ~56.3%
Class 0 (Negative): ~43.7%
Relatively balanced distribution

Note: Price Var behavior changes significantly by year — 2017 shows negative average, while 2015, 2016, and 2018 show positive averages.

Process

Methodology

1

Data Preparation Separation of predictors and target. Exclusion of Price Var and Class to avoid data leakage. Scaling of numerical variables. One-Hot encoding for Sector.

2

Temporal Validation Split 70% training, 15% validation, 15% test — respecting chronological order to avoid mixing future information with past data.

3

Supervised Models Logistic Regression (L1 regularization), Random Forest, SVM RBF, XGBoost — evaluated via Accuracy, F1-score, and AUC.

4

Unsupervised Segmentation Dimensionality reduction with PCA, t-SNE, and UMAP. Clustering with K-Means. Evaluation via Silhouette Score and Inertia. Cluster analysis against Price Var deciles.

Findings

Key Insights

Balanced distributionThe dataset is relatively balanced: ~56.3% class 1 (positive) vs ~43.7% class 0 (negative).

Temporal variation in Price Var2017 shows a negative average Price Var, while 2015, 2016, and 2018 present positive averages — indicating strong temporal dynamics.

Random Forest: Best AUCRandom Forest achieved the highest AUC (~0.62), showing the best global discrimination, though with signs of moderate overfitting.

XGBoost: Best stabilityXGBoost offered better balance for the positive class (F1 ~0.22) with reasonable stability between validation and test sets.

Segmentation patternsPCA obtained the highest Silhouette Score, while t-SNE and UMAP revealed more interpretable clusters: moderate variation companies, positive/negative extremes, and higher drawdown exposure.

Sector insightsThe analysis across 11 sectors reveals differentiated financial profiles that can support investment filtering and company prioritization.

Visualizations

Recommended Portfolio Visuals

KPI cards: 11,566 records · 173 indicators · 2014–2018 · 11 sectors

Class variable distribution

Average Price Var per year

Model comparison: Accuracy, F1, AUC

2D cluster map with t-SNE and UMAP

Heatmap: clusters vs Price Var deciles

Feature importance (best model)

Confusion matrix (selected model)

Project Presentation

Google Slides Presentation

A detailed walkthrough of the analysis, methodology, and key findings using the STAR framework.

Open Presentation

Limitations

The stock market is influenced by external factors not always reflected in financial statements.
AUC ~0.62 shows the model identifies signals but has moderate predictive capacity.
Sector appears numerically encoded — final version should map to real sector names.
SHAP or feature importance recommended to improve model interpretability.

Next Steps

Map real sector names.
Add confusion matrix and ROC curve.
Use SHAP for XGBoost explainability.
Build interactive dashboard with Tableau or Power BI.
Publish clean notebook on GitHub with README, dataset, and visuals.