Amazon SageMaker is stepping up its game by rolling out fully managed support for MLflow 3.0. This update is a game-changer for anyone on the generative AI journey, turning experimentation into a seamless process from conceptualization to production. With this latest release, SageMaker elevates managed MLflow from merely tracking experiments to providing comprehensive end-to-end observability—crucial for reducing the time-to-market for generative AI projects.
As businesses across various sectors ramp up their generative AI efforts, there’s a growing need for advanced capabilities to track experiments, monitor behavior, and assess the performance of their models and applications. Data scientists and developers often find themselves entangled in the complexity of analyzing their models from experimentation to production. This challenge can make it difficult to trace the root causes of issues, leading to teams allocating more time to tool integration rather than enhancing their models or generative AI applications.
The introduction of fully managed MLflow 3.0 on Amazon SageMaker AI changes the narrative. It simplifies tracking experiments and observing the behavior of models and applications through a single, robust tool. The tracing features of MLflow 3.0 allow customers to document the inputs, outputs, and metadata at every stage of their generative AI applications. This means that developers can swiftly pinpoint the origins of bugs or unexpected behaviors without sifting through layers of complexity. By maintaining a detailed record of each model and application version, MLflow 3.0 provides essential traceability, connecting AI responses to their corresponding source components. Consequently, developers can easily identify issues tied to specific code, data, or parameters. For customers utilizing Amazon SageMaker HyperPod to train and deploy foundation models (FMs), managed MLflow now enables comprehensive experiment tracking, deeper insights into model behavior, and effective management of the ML lifecycle at scale, thereby optimizing innovative efforts.
This article dives into the foundational concepts of fully managed MLflow 3.0 on SageMaker, providing practical guidance on harnessing its new capabilities to streamline your upcoming generative AI application development.
Getting Started
To embark on your journey with fully managed MLflow 3.0 on Amazon SageMaker, you can conveniently track experiments, manage models, and refine your generative AI/ML lifecycle via the AWS Management Console, AWS Command Line Interface (AWS CLI), or the API.
Prerequisites
Before diving in, ensure you have the following:
Configuring Your Environment for SageMaker Managed MLflow Tracking Server
To set up your environment, follow these steps:
- In the SageMaker Studio UI, navigate to the Applications pane, select MLflow, and then choose Create.
- Give a unique name to your tracking server and specify an Amazon Simple Storage Service (Amazon S3) URI where your experiment artifacts will reside. Once you’re set, click Create. By default, SageMaker selects version
3.0
for creating the MLflow tracking server. - Optionally, you can select Update to modify settings such as server size, tags, or AWS Identity and Access Management (IAM) role.
The server will begin provisioning automatically, generally taking around 25 minutes. Once it’s ready, you can access the MLflow UI through SageMaker Studio to commence tracking your ML and generative AI experiments. For comprehensive details on configuring the tracking server, refer to the Machine learning experiments using Amazon SageMaker AI with MLflow guide.
Next, to start tracking experiments using your newly created SageMaker-managed MLflow tracking server, install both MLflow and the AWS SageMaker MLflow Python packages in your environment. Whether you’re using SageMaker Studio managed Jupyter Lab, SageMaker Studio Code Editor, a local IDE, or another supported environment, you’ll be able to track your work with SageMaker’s managed MLflow tracking server.
To install the required Python packages, run:
pip install mlflow==3.0 sagemaker-mlflow==0.1.0
To connect and begin logging your AI experiments, parameters, and models to the managed MLflow on SageMaker, substitute the Amazon Resource Name (ARN) of your SageMaker MLflow tracking server:
Your environment is now set up and ready to track experiments with the SageMaker Managed MLflow tracking server.
Implementing Generative AI Application Tracing and Version Tracking
Generative AI applications consist of various elements, including code, configurations, and data, which can pose challenges without systematic versioning. A LoggedModel entity in managed MLflow 3.0 represents your AI model, agent, or generative AI application within an experiment. This feature provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle. Traces form a record of inputs, outputs, and intermediate steps from a single application execution, delivering insights into application performance and execution flow.
To implement version tracking and tracing with managed MLflow 3.0 on SageMaker, you can establish a versioned model identity using a Git commit hash, linking it as the active model context. This ensures all further traces connect to this specific version. You can enable automatic logging for Amazon Bedrock interactions and then make an API call to Anthropic’s Claude 3.5 Sonnet, which will comprehensively log inputs, outputs, and metadata within the defined model context. Managed MLflow 3.0 tracing already integrates with multiple generative AI libraries, offering a one-line automatic tracing experience for these libraries. For a list of supported libraries, see Supported Integrations in the MLflow documentation.
After logging this information, track these generative AI experiments and the logged model for the agent in the managed MLflow 3.0 tracking server UI.
Beyond one-line auto tracing, MLflow provides a Python SDK for manually instrumenting your code and optimizing traces. The available code sample notebook sagemaker_mlflow_strands.ipynb
in the aws-samples GitHub repository uses MLflow manual instrumentation to trace Strands Agents. The tracing capabilities in fully managed MLflow 3.0 allow you to document inputs, outputs, and metadata associated with each request step, giving you the tools needed to identify bugs and unexpected behaviors promptly.
These capabilities enhance the observability of your AI workloads by providing detailed insights into the execution of your workload services, nodes, and tools, visible under the Traces tab.
Inspect each trace by selecting the request ID in the traces tab for the desired trace.
A new feature in fully managed MLflow 3.0 is trace tagging. Tags serve as mutable key-value pairs that add context and valuable metadata to traces. This feature simplifies the organization, searchability, and filtering of traces based on criteria such as user sessions, environments, model versions, or performance characteristics. Tags can be updated or removed at any stage—during trace execution using mlflow.update_current_trace()
or after a trace is logged using the MLflow APIs or UI. Managed MLflow 3.0 improves the accessibility of tracing analysis, enabling teams to swiftly pinpoint issues and optimize performance.
The tracing UI and Python API empower users with powerful filtering options to drill into traces based on attributes such as status, tags, user, environment, or execution time. For example, you can quickly identify all traces with errors, filter them by production environment, or search for specific request traces. This functionality is critical for debugging, cost analysis, and the continual improvement of generative AI applications.
The screenshot below shows the traces filtered using the tag ‘Production’.
Here’s a snippet that demonstrates how to search for all traces in production with a successful status:
Generative AI Use Case Walkthrough with MLflow Tracing
Creating and deploying generative AI agents—be it chat-based assistants, code generators, or customer support systems—necessitates an in-depth understanding of how these agents interact with large language models (LLMs) and external tools. Typically, during an agent’s workflow, the device iterates through reasoning steps, invoking LLMs and using tools or subsystems such as search APIs or Model Context Protocol (MCP) servers until it fulfills the user’s request. These intricate, multi-step interactions make debugging, optimization, and cost analysis particularly perplexing.
Traditional observability tools often fall short in generative AI, as agent decisions, tool calls, and LLM responses are dynamic and context-dependent. Managed MLflow 3.0 tracing offers comprehensive observability by logging every LLM call, tool invocation, and decision point within your agent’s workflow. This end-to-end tracing data allows you to:
- Debug agent behavior: Identify points where an agent’s reasoning diverges or produces unexpected outputs.
- Monitor tool usage: Analyze when and how external tools are invoked and their effects on quality and cost.
- Track performance and costs: Measure latency, token usage, and API expenses at each phase of the agentic loop.
- Audit and govern: Maintain detailed logs crucial for compliance and analysis.
For instance, consider a hypothetical scenario with a generative AI customer support agent designed to fetch financial data from a database. In the initial trace displayed in the following screenshot, the agent responds to a user query without invoking any tools. This trace captures the prompt, agent response, and decision points, clearly revealing that the agent did not leverage external resources, enabling a quick identification of gaps in its reasoning chain.
The second trace, shown in the next screenshot, illustrates the same agent deciding to call the product database tool. This trace captures the tool invocation, the returned product data, and how the agent integrates this information into its ultimate response. Here, you can assess improvements in answer quality, slight increases in latency, and additional API costs due to higher token usage.
By juxtaposing these traces, you can delve into why the agent occasionally opts out of tool usage, optimize tool interactions, and balance quality against latency and costs. MLflow’s tracing UI grants you transparent and actionable insights into these agentic loops, making it easier to analyze at scale. The sample agent discussed in this article—and all requisite code—can be found in the aws-samples GitHub repository, which you can use as a foundation for your projects.
Cleanup
Once established, a SageMaker managed MLflow tracking server will accumulate costs until you delete or halt it. Billing is based on how long the servers have been active, the size selected, and the amount of data logged to the tracking servers. To save costs, pause tracking servers when they are not in use or delete them through the API or the SageMaker Studio UI. For insight on pricing, refer to Amazon SageMaker pricing.
Conclusion
Fully managed MLflow 3.0 on Amazon SageMaker AI is going live. Kickstart your endeavors with sample code available in the aws-samples GitHub repository. We invite you to explore this enhanced capability and witness how it elevates your ML projects. Discover more at Machine Learning Experiments using Amazon SageMaker with MLflow.
For further details, consult the SageMaker Developer Guide and share your feedback with us on AWS re:Post for SageMaker or via your usual AWS Support channels.
About the Authors
Ram Vittal is a Principal ML Solutions Architect at AWS, bringing over 30 years of experience in architecting and constructing distributed, hybrid, and cloud applications. He is passionate about creating secure, scalable, and reliable AI/ML and big data solutions to aid enterprise customers in optimizing their cloud journeys. Outside work, he enjoys motorcycling and walking with his three-year-old sheepadoodle!
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS, guiding customers through their AIOps journey encompassing model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling generative AI use cases. He also focuses on Go-To-Market strategies, assisting AWS in aligning products to address challenges in the Generative AI sphere. Connect with Sandeep on LinkedIn.
Amit Modi is the product leader for SageMaker AIOps and Governance, focusing on Responsible AI at AWS. With over a decade of experience in B2B environments, he builds scalable products and teams that deliver innovative solutions and value to customers globally.
Rahul Easwar is a Senior Product Manager at AWS, steering managed MLflow and Partner AI Apps within the SageMaker AIOps team. Accumulating over 15 years of experience ranging from startups to large enterprises, he leverages his entrepreneurial background and an MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations across the globe. Connect with Rahul on LinkedIn and learn more about his work in ML platforms and enterprise AI solutions.