Creating an API-Free Machine Learning Workflow with MLE-Agent and Ollama
In this tutorial, we’ll illustrate how to integrate MLE-Agent with Ollama to create a fully local machine learning workflow that operates without the need for external APIs. By using Google Colab, we’ll set up a reproducible environment, generate a synthetic dataset, and guide our MLE-Agent to draft a training script. Throughout this process, we’ll also implement strategies for sanitizing code, ensure correct imports, and develop a robust fallback script. This way, we can streamline our workflow while still benefiting from the power of automation in machine learning.
Setting Up Our Environment
To get started, we first need to establish our working environment in Google Colab. The initial imports and function definitions will help us execute shell commands, making our Python scripts more interactive and responsive.
python
import os, re, time, textwrap, subprocess, sys
from pathlib import Path
def sh(cmd, check=True, env=None, cwd=None):
print(f"$ {cmd}")
p = subprocess.run(cmd, shell=True, env={os.environ, (env or {})} if env else None,
cwd=cwd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(p.stdout)
if check and p.returncode != 0:
raise RuntimeError(p.stdout)
return p.stdout
This sh()
function will execute shell commands and print their outputs in real-time. It raises an error if any command fails, allowing us to monitor the execution process efficiently.
Defining Our Workspace
Next, we define our workspace directories and file paths. This setup includes paths for our dataset, model files, and training scripts. Additionally, we will install the necessary Python packages.
python
WORK = Path("/content/mle_colab_demo")
WORK.mkdir(parents=True, exist_ok=True)
PROJ = WORK / "proj"
PROJ.mkdir(exist_ok=True)
DATA = WORK / "data.csv"
MODEL = WORK / "model.joblib"
PREDS = WORK / "preds.csv"
SAFE = WORK / "train_safe.py"
RAW = WORK / "agent_train_raw.py"
FINAL = WORK / "train.py"
MODEL_NAME = os.environ.get("OLLAMA_MODEL", "llama3.2:1b")
sh("pip -q install –upgrade pip")
sh("pip -q install mle-agent==0.4.* scikit-learn pandas numpy joblib")
sh("curl -fsSL https://ollama.com/install.sh | sh")
sv = subprocess.Popen("ollama serve", shell=True)
time.sleep(4)
sh(f"ollama pull {MODEL_NAME}")
This section of our script ensures that we have all the necessary Python dependencies installed and prepares our local Ollama environment. We initiate the server to allow local model processing without needing external API keys.
Generating the Synthetic Dataset
To train our model effectively, we’ll first need to create a synthetic dataset:
python
import numpy as np, pandas as pd
np.random.seed(0)
n = 500
X = np.random.rand(n, 4)
y = ([0.4, -0.2, 0.1, 0.5] @ X.T + 0.15 * np.random.randn(n) > 0.55).astype(int)
pd.DataFrame(np.c_[X, y], columns=["f1", "f2", "f3", "f4", "target"]).to_csv(DATA, index=False)
Here, we generate 500 samples with four features and a target variable based on a linear combination of the features, effectively setting up our dataset for training.
Configuring MLE-Agent with Ollama
Next, we set environment variables and construct a strict prompt that will instruct MLE-Agent to generate the train.py
script.
python
env = {
"OPENAI_API_KEY": "",
"ANTHROPIC_API_KEY": "",
"GEMINI_API_KEY": "",
"OLLAMA_HOST": "http://127.0.0.1:11434",
"MLE_LLM_ENGINE": "ollama",
"MLE_MODEL": MODEL_NAME
}
prompt = f"""
Return ONE fenced python code block only. Write train.py that reads {DATA}; 80/20 split (random_state=42, stratify);
Pipeline: SimpleImputer + StandardScaler + LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42);
Print ROC-AUC & F1; print sorted coefficient magnitudes; save model to {MODEL} and preds to {PREDS};
Use only sklearn, pandas, numpy, joblib; no extra text.
"""
def extract(txt: str) -> str | None:
txt = re.sub(r"x1B[[0-?][ -/][@-~]", "", txt)
m = re.search(r"(?:python)?s([sS]?)", txt, re.I)
if m:
return m.group(1).strip()
if txt.strip().lower().startswith("python"):
return txt.strip()[6:].strip()
m = re.search(r"(?:^|n)(froms+[^n]+|imports+[^n]+)([sS]*)", txt);
return (m.group(1) + m.group(2)).strip() if m else None
out = sh(f’printf %s "{prompt}" | mle chat’, check=False, cwd=str(PROJ), env=env)
code = extract(out) or sh(f’printf %s "{prompt}" | ollama run {MODEL_NAME}’, check=False, env=env)
code = extract(code) if code and not isinstance(code, str) else (code or "")
(Path(RAW)).write_text(code or "", encoding="utf-8")
This segment guides MLE-Agent through the code generation process, relying on both MLE-Agent’s capabilities as well as Ollama’s if needed. The generated script, if any, is saved for further sanitization.
Sanitizing the Generated Script
Next up, we sanitize the generated script to ensure it adheres to coding standards and avoids common pitfalls:
python
def sanitize(src: str) -> str:
if not src:
return ""
s = src
fixes = {
r"froms+sklearn.pipelines+imports+SimpleImputer": "from sklearn.impute import SimpleImputer",
r"froms+sklearn.preprocessings+imports+SimpleImputer": "from sklearn.impute import SimpleImputer",
# Add other common data sanitation fixes here
}
for pat, rep in fixes.items():
s = re.sub(pat, rep, s)
if "SimpleImputer" in s and "from sklearn.impute import SimpleImputer" not in s:
s = "from sklearn.impute import SimpleImputer\n" + s
return s
san = sanitize(code)
safe = textwrap.dedent(f"""
import pandas as pd
import numpy as np
import joblib
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.compose import ColumnTransformer
DATA = Path("{DATA}")
MODEL = Path("{MODEL}")
PREDS = Path("{PREDS}")
df = pd.read_csv(DATA)
X = df.drop(columns=["target"])
y = df["target"].astype(int)
Add the rest of your deterministic pipeline here
""").strip()
This function scans for and auto-fixes common errors in the script, ensuring it runs cleanly without missing imports. We also prepare a safe version of the training script as a fallback.
Finalizing and Running the Training Script
Finally, we decide whether to run the sanitized or safe script based on the quality of the generated code:
python
chosen = san if ("import " in san and "sklearn" in san and "read_csv" in san) else safe
Path(SAFE).write_text(safe, encoding="utf-8")
Path(FINAL).write_text(chosen, encoding="utf-8")
print("Using train.py (first 800 chars):", chosen[:800])
sh(f"python {FINAL}")
print("Artifacts:", [str(p) for p in WORK.glob(‘*’)])
By executing this code, we can evaluate the performance metrics of our model, print the coefficients, and save all necessary files and outputs.
Through this structured approach, we can see how integrating local LLMs like MLE-Agent with traditional ML pipelines promotes reliability and unconditional control over the entire machine learning process, eliminating the need for external API calls.
Feel free to explore the FULL CODES here and take a look at our GitHub Page for Tutorials. Follow us on Twitter and join the 100k+ ML SubReddit to stay updated on machine learning trends!