Is the Software Defined Network (SDN) Another Detour to a Datacenter Dead-end?

Introduction

Frustrated by the inability to fiddle with Internet routing in the real world, Stanford computer scientist Nick McKeown and colleagues developed a standard called OpenFlow that essentially opens up the Internet to researchers, allowing them to define data flows using software–a sort of “software-defined networking.” Installing a small piece of OpenFlow firmware (software embedded in hardware) gives engineers access to flow tables, rules that tell switches and routers how to direct network traffic. Yet it protects the proprietary routing instructions that differentiate one company’s hardware from another. SDN is nothing more than the separation of network data traffic processing from the logic and rules controlling the flow, inspection, and modification of that data. Traditional network hardware, i.e. switches and routers, implement these functions in proprietary firmware partitioned respectively into what is known as the data and control planes. While this is a fine research project, as the major vendors start to take this seriously and are attempting to introduce it in the real-world datacenters, one must ask if this will add or reduce complexity in the already complex datacenter where a host of piece meal solutions are offered by mega corporations seeking to continually increase their revenues without an incentive to reduce complexity by eliminating the number of hardware and software components deployed which would cut into their product sales.

Systems theory tells us that as the number of components increase in a system, the cost of complexity could outweigh the benefits unless architectural reorganization provides a way out.  We argue that the management complexity in current IT infrastructure design, based on the serial von Neumann stored program control implementation of the universal Turing machine, is a more fundamental architecture issue related to the lack of resiliency of the computing model than a software design issue. Cockshott et al. (2012) conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Current generation distributed systems are implemented using a network of Turing machines in which the service and its management are intermixed as shown in figure 1. The resources utilized by the nodes in a network are often controlled by a plethora of management systems which are outside the purview of the service workflow that is utilizing the resources.  Thus the end to end service transaction response is controlled by these management systems which introduce a layer of complexity in coordination and contention resolution making the service much simpler than its management.

Figure 1: Serial von Neumann implementation of Turing Machines

The limitations of the SPC computing architecture were clearly on his mind when von Neumann gave his lecture at the Hixon symposium in 1948 in Pasadena, California (von Neumann, 1987, p. 408). “The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms.  He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond” (von Neumann, 1987,p. 408). It is clear that von Neumann recognized a problem in the way we design computing systems.

“Normally, a literary description of what an automaton is supposed to do is simpler than the complete diagram of the automaton. It is not true a priori that this always will be so. There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself, as long as the automaton is not very complicated, but when you get to high complications, the actual object is much simpler than the literary description.” (von Neumann, 1987,pp. 454-457). He remarked, “It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe.” (von Neumann, 1987,pp. 454-457). The conjecture of von Neumann leads to the fact that “one cannot construct an automaton which will predict the behavior of any arbitrary automaton” (von Neumann, 1987,p. 456). This is so with the Turing machine implemented by the SPC model.

In simpler terms the management complexity is related to the classical Russel Paradox that can be paraphrased as follows: “Who manages the managers?” Gödel’s prohibition of self-reflection in a Turing Machine mandates a hierarchy of Turing machines acting as managers managing other Turing machines implementing the computations described as a sequence of instructions that are compiled into a sequence of 1’s and 0’s. The universal Turing machine (or the general purpose computer) implements these TMs in a synchronous workflow thus prohibiting changes to computations at run-time in any Turing machine while the computation is in progress in that machine (i.e., you cannot change the behavior of that computation (compiled code) till its execution is interrupted).

Current generation server, networking, and storage equipment and their management systems have evolved from server-centric and bandwidth limited network architectures to today’s Cloud computing architecture with virtual servers and broadband networks. During last six decades, many layers of computing abstractions have been introduced to map the execution of complex computational workflows to a sequence of 1s and 0s that eventually get stored in the memory and operated upon by the CPU to achieve the desired result.  These include process definition languages, programming languages, file systems, databases, operating systems etc. While this has helped in automating many business processes, the exponential growth in services in the consumer market also has introduced severe strains on current IT infrastructure. In order to meet the need to rapidly respond to manage the distributed computing resources demanded by changing workloads, business priorities and latency constraints, new layers of resource management are added with the introduction of Hypervisors, virtual machines (VM) and their management. While these layers have made the application or service management more agile, they have introduced a new layer of issues related to their own management. For example, new layers of Virtual machine-level clustering, intrusion detection and performance management, are being introduced in addition to already existing clusters, intrusion detection and performance management systems at the infrastructure, operating systems and distributed resource management layers.

However, this approach is completely unsuited to exploit the new generation many-core servers and high-bandwidth networks now available. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them makes the current generation server, networking and storage equipment and their management systems which have evolved from server-centric and bandwidth limited architectures completely unsuited to use in the next generation computing infrastructure efficiently.  It is hard to imagine replicating current TCP/IP-based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings.  The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

Figure 2 shows the many-core server network supporting multiple bandwidths.

In order to cope with the scaling issues and utilize the hierarchical many-core network of networks effectively, next generation service architecture has to emulate the architectural resiliency of cellular organisms that tolerate faults and implement command and control structures which enable execution of self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing (in short self-*) business processes. This requires new computing models that break the Turing machine barrier to computation by allowing the computer and the computed to be treated in the same model.

Papers Solicited to Address Next Generation Datacenter Infrastructure and Technologies:

The conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2013 is devoted to addressing next generation computing models which support real-time resource reconfiguration of distributed business workflow execution based on latency constraints, changing workloads and business priorities. It is devoted to addressing the assurance of reliability, availability, performance, account management and security of distributed business process execution with appropriate visibility and control.

The objective of the Conference was first stated in WETICE 2009; “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.” We are glad to report that the discussions started in 2009 have directly resulted in an alternative approach to self-managing distributed computing systems totally different from current industry trend showing a way to eliminate the complexity of virtual machines and Hypervisors. If this approach is proven to be theoretically sound (as a paper in WETICE2012 investigated) and extend its usefulness (demonstrated through their feasibility in the form of two proofs of concepts in the last conference) to mission critical environments, the DIME network architecture may yet prove to be an important contribution to computer science.

Following the tradition, the target of the WETICE2013 is to transform current complex, redundant, costly and knowledge intensive IT management into self-configuring, self-monitoring, self-healing and self-optimizing distributed workflow implementations with service management only limited by the speed of light. We identify another emerging area of software defined networks (SDN) as a potential candidate for further investigation without the bias that often surrounds commercial profit motives to see whether the overall complexity of the datacenter will be reduced or the SDNs are yet another layer of complexity.

Papers are solicited to advance the next generation distributed computing and its management infrastructure that leverages the new hardware innovations.  The goals of the conference include (but are not limited to):
  1. Discovering new application scenarios, proposing new operating systems, programming abstractions and tools
  2. Identifying the challenging problem that still need to be solved such as parallel programming, scaling and management of distributed computing elements, and
  3. Reporting results and experiences gained by researchers in building dynamic Grid-based middleware, computing clouds (distributed or otherwise) and workflow management systems.
Submission of papers March 10, 2013
Notification to authors April 1, 2013
Final papers to IEEE-CS April 25, 2013
Paper author’s registration deadline May 10, 2013
 WETICE-2013 Conference June 17-20, 2013

References:

P. Cockshott, L. M. MacKenzie and  G. Michaelson, “Computation and its Limits”, Oxford University Press, Oxford 2012.

J. v.Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, “Automatic studies,” edited by C. E. Shannon, and J. McCarthy, Princeton University Press, 1956, pp. 43-98.

W. Aspray, and A. Burks, “Papers of John von Neumann on Computing and Computer Theory.” Cambridge, MA: MIT Press. 1989.

Advertisements

There are no comments on this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: