Managing AI Networking at Scale
AI infrastructure introduces specific requirements that differentiate it from typical enterprise networks. What was once a single network in the data center is now often two or three distinct networks – a front-end network for user access, a back-end or GPU network specifically for connecting GPUs, and often a separate dedicated storage network. And AI networks typically span hundreds or even thousands of devices. Managing network devices at scale, ensuring consistency, and handling rapid growth or changes manually in such complex environments is incredibly difficult and error-prone.
Intent-based Management
Intent-based management offers a solution to these challenges by shifting the focus from configuring individual devices to defining the desired state or intent for the entire network fabric. Instead of issuing commands switch by switch, administrators define what they want the network to do, and the intent-based system translates that into the necessary configurations and deploys them across the infrastructure. This approach automates complex deployments and ensures consistency across potentially disparate hardware. This is particularly valuable for the well-defined, repetitive topologies found in AI networks.
Aviz ONES: Open Networking Enterprise Suite
Aviz, a pure software networking company that emerged to address the evolving landscape of open networking, has developed ONES (Open Networking Enterprise Suite). ONES is an automation suite designed to facilitate the design, deployment, and monitoring of open networking platforms like SONiC and Cumulus Linux.
At its core, ONES utilizes an intent-based YAML file to define the desired state of the network fabric. This YAML file specifies the inventory of devices, how they are physically connected, and the configuration parameters for the entire fabric. For specific architectures like the Nvidia SpectrumX Reference Architecture, ONES can even construct the YAML intent internally based on high-level inputs like the number of GPUs and desired IP subnets.
A critical part of the ONES process is validation. Before and during configuration, ONES performs config validation to detect errors in the configuration syntax or structure as it’s being pushed to the switch, stopping orchestration for that node if it fails. It also conducts operational checks after configuration, verifying control plane status and data plane connectivity. For example, ONES can detect and report operational issues such as network cabling that prevent BGP from establishing a session.
ONES provides robust capabilities for day-two operations. A key feature is configuration comparison, allowing operators to compare the deployed intent against the running configuration on a switch to identify any drift or unauthorized changes. Comparisons can also be made between the running config and a baseline backup, or between the running configurations of different switches. This provides clear visibility into configuration drift.
ONES supports a wide range of telemetry and end-to-end monitoring covering switches, switch ASICs, servers, NICs, and GPUs, links, and a variety of network protocols. An anomaly detection system is included whereby users can create rules based on various metrics, defining warning and critical thresholds. When thresholds are exceeded, ONES can trigger notifications, and control are available to manage notification frequency and prevent alert fatigue.
Room for Improvement
While powerful, the current implementation of Aviz ONES presents some points to consider. The system relies heavily on YAML, and thus is susceptible to YAML’s well-known issues, including complexity, lack of schema, and, most importantly, sensitivity to spaces, tabs, indentation, and formatting.
A significant security concern is the practice of storing credentials within the YAML file. Although Aviz supports alternative methods like LDAP on a per-customer basis, the standard practice of having sensitive information like API keys, usernames and passwords, and other secrets stored directly in the primary configuration file is a security risk that deviates from best practices.
Furthermore, while ONES excels at detecting configuration drift by comparing the intended configuration with the actual running configuration, this capability is primarily designed to provide visibility into the differences. In most intent-based management systems, the automation and orchestration process automatically and continuously enforces the running configuration to match the defined intent.
Instead of automatically reverting unauthorized changes or re-applying the intended configuration, ONES focuses on highlighting the discrepancy, which an operator would then need to address manually or by initiating a restore or re-application process using the provided tools. While this detection is valuable, the lack of automated enforcement of the intended state diminishes the value of automated intent-based management.
Why This Matters
Managing complex open networks at scale, especially the specialized and demanding AI fabrics, is fraught with challenges related to complexity, consistency, manual effort, and detecting unauthorized changes. Aviz ONES tackles these problems by introducing intent-based management and comprehensive monitoring. By defining the network’s desired state in a centralized YAML file, ONES automates initial deployment, validates configurations and operational status, and provides visibility into the health and performance of the fabric. Its day-two operation features, particularly the configuration comparison and anomaly detection systems, are crucial for maintaining the intended state over time, alerting operators to deviations or performance issues.
While points like YAML credential storage and the nuance of config enforcement vs. detection warrant consideration, the core value of ONES lies in bringing a structured, automated approach to managing open networking environments that traditionally require deep technical expertise and manual intervention. For organizations looking to build and operate complex AI networking environments using flexible open networking platforms like SONiC and Cumulus, investigating whether Aviz ONES’ capabilities align with their operational needs and security requirements is a worthwhile step.
