Yumbox ๐ฑ
Yadgiri Machine Toolbox โ A very yummy project! (Yadgiri is Persian for "learning".)
What is Yumbox?
Yumbox is a Machine Learning reusable toolbox to help bootstrap ML projects as quickly as possible! It follows the best practices I have accumulated over the years, from ensuring reproducibility and tracking experiments to following DRY (Don't Repeat Yourself) and VC (Version Control) on code/data principles. This project also specially focuses on Semantic Similarity, Classification, Information Retrieval, and the intersection of the three: Entity Resolution (and Entity Matching).
Why Yumbox came to be?
I found myself reusing the same classes and functions in my projects over the years, from my university days (pre-LLLM chatbot era) to this day, so I decided to create this project to gather all these proven functionalities, provide standard interfaces for them, and make them freely available to help everyone facilitate fast prototyping on ML projects.
How Yumbox came to be!
I gradually started refactoring the classes and functions from the projects I was working on in the industry and the university and I added them to Yumbox. I still continue to do so, when I find a highly valuable functionality that can be reused and shared with other projects. Since the things covered in this project have vastly different scopes, I follow a dynamic import paradigm so you only have to install the dependencies required for the functionality needed. I will later make these dynamic imports, lazy imports, just as popular libraries like HuggingFace's transformers do.
๐ Modules Overview
Below you'll find detailed guides for every module in Yumbox:
| Module | Purpose |
|---|---|
| ๐ Cache | Caching decorators & storage backends (pickle, LMDB, Redis, FAISS) |
| โ๏ธ Config | Global config (BFG) + production-ready logger setup |
| ๐ฆ Data | Flexible Datasets, Samplers & training utilities for PyTorch |
| ๐ญ Factory | FAISS index builders, PCA, clustering, similarity computation |
| ๐ Metrics | Classification, retrieval metrics + MLflow-ready plotting |
| ๐งช MLflow | Experiment tracking, checkpoint management, multi-run analysis |
| โ๏ธ NLP | Multilingual (EN/FA) text preprocessing, tokenization, analysis |
| ๐ต๏ธ Parse | Runtime dependency analysis & import tracing |
| ๐ท๏ธ Scraper | HTML parsing, text extraction, parallel image downloading |
| ๐ ๏ธ Scripts | CLI tools for MLflow analysis & checkpoint cleanup |
| ๐ Vectors | Vector search, feature fusion, array utilities |
๐ Quick Start Examples
Cache results with one line
from yumbox.cache import cache
@cache # Auto-saves to BFG["cache_dir"]/{func_name}.pkl
def expensive_feature_extraction(data):
return heavy_computation(data)
# First call: runs computation, saves result
# Subsequent calls: loads from cache instantly โจ
Setup logging + config in 3 lines
from yumbox.config import BFG, setup_logger
BFG["cache_dir"] = "~/.cache/yumbox" # Auto-creates directory
logger = setup_logger(
name="my_project",
path="./logs", # Save logs to file
capture_libs=["torch"], # Also log PyTorch messages
suppress_libs=["httpcore"] # Silence noisy dependencies
)
logger.info("Ready to yum! ๐ฑ")
Build a FAISS index for cosine search
import numpy as np
from yumbox.factory import build_index, pca_faiss
from yumbox.vectors import normalize_vector
# Your embeddings
embeddings = np.random.randn(10000, 768).astype(np.float32)
# Option 1: Quick flat index (exact search)
index = build_index(normalize_vector(embeddings)) # Inner product = cosine for unit vectors
# Option 2: PCA reduction + HNSW for speed
embeddings_reduced = pca_faiss(embeddings, n_components=128)
index = factory.FaissIndexBuilder().build_hnsw_index(
normalize_vector(embeddings_reduced),
M=32, efConstruction=200
)
# Search
distances, indices = index.search(query_vectors, k=10)