From Being to Becoming: Function, Structure and Fluctuations – Incremental versus Leap-frog Innovation in Datacenters
February 16, 2015

“We grow in direct proportion to the amount of chaos we can sustain and dissipate” ― Ilya Prigogine, Order out of Chaos: Man’s New Dialogue with Nature

Abstract

According to Gartner “Alpha organizations aggressively focus on disruptive innovation to achieve competitive advantage. Characterized by unknowns, disruptive innovation requires business and IT leaders to go beyond traditional management techniques and implement new ground rules to enable success.”

While there is a lot of buzz about “game changing” technologies, and “disruptive innovation”, real “game changers” and “disruptive innovators” are few and far between. Leap-frog innovation is more like a “phase transition” in physics. A system is composed of individual elements with a well-defined function which interact with each other and the external world with a well-defined structure. The system usually exhibits normal equilibrium behavior that is predictable and when there are small fluctuations, incremental innovation allows to adjust itself and maintain the equilibrium with predictability. Only when the external forces inflict large or wild unexpected fluctuations in the system, the equilibrium is threatened and the system exhibits an emergent behavior where unstable equilibrium introduces unpredictability in the evolution dynamics of the system. A phase transition occurs with a reconfiguration of the structure of the system going through an architecture transformation resulting in order from chaos.

The difference between “Kaizen” (incremental improvement) and “disruptive innovation” is in dealing with stable equilibrium with small fluctuations versus dealing with meta-stable equilibrium with large-scale and big fluctuations. Current datacenter is in a similar transition from “being” to “becoming” driven by both the hyper-scale structure and fluctuations (which, the hardware and software systems delivering business processes are experiencing) caused by rapidly changing business priorities on a global scale, workload fluctuations and latency constraints. Is the current von Neumann stored program control implementation of the Turing machine reaching its limit? Is the datacenter poised for a phase transition from current ad-hoc distributed computing practices to a new theory-driven self-* architecture? In this blog we discuss a non-von Neumann managed Turing oracle machine network with a control architecture as an alternative.

From Being to Becoming” – What Does It Mean?

The representation of the dynamics of a physical systems as linear, reversible (hence deterministic), temporal order of states requires that, in a deep sense, physical systems never change their identities through time; hence they can never become anything radically new (e.g., they must at most merely rearrange their parts, parts whose being is fixed). However, as elements interact with each other and their environment, the system dynamics can dramatically change when large fluctuations in the interactions induce a structural transformation leading to chaos and the eventual emergence of a new order out of chaos. This is denoted as “becoming”. In short, the dynamics of near equilibrium states with small-scale fluctuations in a system represent the “being” and large deviations from the equilibrium, emergence of an unstable equilibrium and the final restoration of order in a new equilibrium state represent the “becoming”. According to Plato “being” is absolute, independent, and transcendent. It never changes and yet causes the essential nature of things we perceive in the world of “becoming”. The world of becoming is the physical world we perceive through our senses. This world is always in movement, always changing. The two aspects – the static structures and their dynamics of evolution are two sides of a coin. Dynamics (becoming) represents time and static configurations at any particular instance represent the “being”. Prigogine applied this concept to understand the chemistry of matter, phase transitions and the like. Individual elements represent function and the groups (constituting a system) represent structure with dynamics. Fluctuations caused by the interaction within the system and between the system and its environment, cause the dynamics of the system to induce transitions from being to becoming. Thus, function, structure and fluctuations determine the system and its dynamics defining the complexity, chaos and order.

Why is it Relevant to Datacenters?

Datacenters are dynamic systems where software working with hardware delivers information processing services that allow modeling, interaction, reasoning, analysis and control of the environment external to them. Figure 1 shows the hardware, software and their interaction among themselves and the external world. There are two distinct systems interacting with each other to deliver the intent of the datacenter which is to execute specific computational workflows that model, monitor and control the external world processes using the computing resources:

  1. Service workflows modeling the process dynamics of the system depicting the external world and its interactions. Usually this consists of functional requirements of the system that is under consideration such as business logic, sensors and actuator monitoring and control (the computed) etc. The model consists of various functions captured in a structure (e.g., a directed acyclic graph, DAG, and it’s evolution in time. This model does not include the computing resources required to execute the process dynamics. It is assumed tat the resources will be available for the computation (cpu, memory, time etc.)
  2. The non-functional requirements that address the required resources to execute the functions as a function of time and fluctuations both in the interactions in the external world and also in the computing resources available to accomplish the intent defined in the functional requirements. The computation as implemented in the von Neumann stored program control model of the Turing machine requires time (impacted by the cpu speed, network latency, bandwidth, storage IOPs, throughput, capacity) and memory. The computing model assumes unbounded resources including time for completing the computation. Today, these resources are provided by a cluster of servers and other devices containing multi-core cpu’s and memory networked with different types of storage. The computations are executed in the server or device by allocating the resources using an operating system which itself is a software that mediates the resources to various computations.

On the right hand side of Figure 1, we depict the computing resources required to execute the functions in a given structure whether it is distributed or not. In the middle, we represent the application workflows composed of various components constituting an application area network (AAN) that is executed in a distributed computing cluster (DCC) made up of the hardware resources with specified service levels (cpu, memory, network bandwidth, cluster latency, storage capacity, IOPs , throughput and capacity). The left hand side shows a desired end-to-end process configuration and evolution monitoring and control mechanism. When all is said and done, the process workflows need to execute various functions using the computing resources made available in the form of a distributed cluster providing required CPU, memory, network bandwidth, latency, storage IOPs, throughput and capacity. The structure is determined by the non-functional requirements such as resource availability, performance, security and cost. Fluctuations evolve the process dynamics and require adjusting the resources to meet the needs of applications to cope with the fluctuations.

Figure 1: Decoupling service orchestration and infrastructure orchestration to deliver function, structure and dynamic process flow to address the fluctuations both in resource availability and service demand

Figure 1: Decoupling service orchestration and infrastructure orchestration to deliver function, structure and dynamic process flow to address the fluctuations both in resource availability and service demand

There are two ways to match the resources available to the computing nodes connected by links that execute the business process dynamics. First approach is the current state of the art and the second one is an alternative approach based on extensions to the current von Neumann stored program implementation of  the Turing machine.

Current State of the Art

The infrastructure is infused with intelligence about various applications and their evolving needs and adjust the resources (time of computation affected by cpu, network bandwidth, latency, storage capacity, throughput and IOPs and the memory required for the computation). Current IT has evolved from a model where the resources are provisioned anticipating the peak workloads and the structure of the application network is optimized for coping with deviations from equilibrium. Conventional computing models using physical servers (often referred to as bare-metal) cannot cope with wild fluctuations if the new server provisioning times are much larger than the time it takes for the onset of fluctuations and the predictability of their magnitude to pre-plan the provisioning of additional resources. Virtualization of the servers and on-demand provisioning of Virtual machines reduces the provisioning times substantially to institute auto-scaling, auto-failover and live migration across distributed resources using Virtual Machine image mobility. However, it comes with a price:

    1. The Virtual Image is still tied to the infrastructure (network, storage and computing resources supporting the VM and moving a VM involves manipulating a multitude of distributed resources often owned or operated by different owners and touch many infrastructure management systems thus increasing complexity and cost of management.
    2. If the distributed infrastructure is homogeneous and supports VM mobility, it is simpler but the solution forces vendor lock-in and does not allow to take advantage of commodity infrastructure offered by multiple suppliers.
    3. If the distributed infrastructure is heterogeneous, VM mobility now must depend on myriad management systems and most often, these management systems themselves need other management systems to manage their resources.
    4. The VM mobility and management also increase bandwidth and storage requirements and proliferation of point solutions and tools to move across heterogeneous distributed infrastructure that increase operational complexity and additional cost.

Current state of the art based on the mobility of VMs and infrastructure orchestration  is summarized in figure 2.

Anti-Thesis: urrent State of the Art

Figure 2: The infrastructure orchestration based on second guessing the application quality of service requirements and its dynamic behavior

 It clearly shows the futility of orchestrating service availability, performance, compliance, cost and security in a very distributed and heterogeneous environment where scale and fluctuations dominate. The cost and complexity of navigating multiple infrastructure service offerings often outweigh the benefits of commodity computing. It is one reason why enterprises complain that 70% of their budget often is spent on keeping the service lights on.

Alternative Approach: A Clean Separation of Business Logic Implementation and the Operational Realization of Non-functional Requirements

Another approach is to decouple application and business process workflow management from the distributed infrastructure mobility by placing the applications in the right infrastructure that has the right resources, monitor the evolution of the applications and proactively manage the infrastructure to add or delete resources with predictability based on history. Based on the RPO and RTO, adjust the application structure to create active/passive or active/active nodes to manage application QoS and workflow/business process QoS. This approach requires top down method of business process implementation with the specification of the business process intent followed by a hierarchical and temporal specification of process dynamics with context, constraints, communication, control of the group and its constituents and the initial conditions for the equilibrium quality of service (QoS). The details include:

  1. Non-functional requirements that specify availability, performance, security, compliance and cost constraints and the policies specified with hierarchical and temporal process flows. The intent at higher level are translated to the down-stream intent of the computing nodes contributing to the workflow.
  2. A distributed or otherwise structure of network of networks providing the computing nodes with specified SLAs for resources (cpu, memory, network bandwidth, latency, storage IOPs, throughput and capacity)
  3. A method to implement autonomic behavior with visibility and control of application components so that they can be managed with policies defined. When scale and fluctuations demand a change in the structure to transition to a new equilibrium state, the policy implementation processes proactively add or subtract computing nodes or find existing nodes to replicate, repair, recombine or reconfigure the application components. The structural change implements the transition from being to becoming.

A New Architecture to Accommodate Scale and Fluctuations: Toward the Oneness of the Computer and the Computed

There is a fundamental reason why current Turing, von Neumann stored program computing model cannot address large-scale distributed computing with fluctuations both in resources and in computation workloads without increasing complexity and cost (Mikkilineni et. al. 2012). As von Neumann put it “It is a theorem of Gödel that the description of an object is one class type higher than the object.” An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as a process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. According to Alan Turing, Gödel’s theorems show that every system of logic is in a certain sense incomplete, but at the same time it indicates means whereby from a system L of logic a more complete system L_ may be obtained. By repeating the process we get a sequence L, L1 = L_, L2 = L_1 … each more complete than the preceding. A logic Lω may then be constructed in which the provable theorems are the totality of theorems provable with the help of the logics L, L1, L2, … Proceeding in this way we can associate a system of logic with any constructive ordinal. It may be asked whether such a sequence of logics of this kind is complete in the sense that to any problem A, there corresponds an ordinal α such that A is solvable by means of the logic Lα.”

This observation along with his introduction of the oracle-machine influenced many theoretical advances including the development of generalized recursion theory that extended the concept of an algorithm. “An o-machine is like a Turing machine (TM) except that the machine is endowed with an additional basic operation of a type that no Turing machine can simulate.” Turing called the new operation the ‘oracle’ and said that it works by ‘some unspecified means’. When the Turing machine is in a certain internal state, it can query the oracle for an answer to a specific question and act accordingly depending on the answer. The o-machine provides a generalization of the Turing machines to explore means to address the impact of Gödel’s incompleteness theorems and problems that are not explicitly computable but are limit computable using relative reducibility and relative computability.

According to Mark Burgin, an Information processing system (IPS) “has two structures—static and dynamic. The static structure reflects the mechanisms and devices that realize information processing, while the dynamic structure shows how this processing goes on and how these mechanisms and devices function and interact.”

The software contains the algorithms (à la the Turing machine) that specify information processing tasks while the hardware provides the required resources to execute the algorithms. The static structure is defined by the association of software and hardware devices and the dynamic structure is defined by the execution of the algorithms. The meta-knowledge of the intent of the algorithm, the association of specific algorithm execution to a specific device, and the temporal evolution of information processing and exception handling when the computation deviates from the intent (be it because of software behavior or the hardware behavior or their interaction with the environment) is outside the software and hardware design and is expressed in non-functional requirements. Mark Burgin calls this Infware which contains the description and specification of the meta-knowledge that can be also be implemented using the hardware and software to enforce the intent with appropriate actions.

The implementation of Infware using Turing machines introduces the same dichotomy mentioned by Turing with respect to the manager of manager conundrum. This is consistent with the observation of Cockshott et al. (2012) ““The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

The goals of the distributed system determine the resource requirements and computational process definition of individual service components based on their priorities, workload characteristics and latency constraints. The overall system resiliency, efficiency and scalability depend upon the individual service component workload and latency characteristics of their interconnections that in turn depend on the placement of these components (configuration) and available resources. The resiliency (fault, configuration, accounting, performance and security often denoted by FCAPS) is measured with respect to a service’s tolerance to faults, fluctuations in contention for resources, performance fluctuations, security threats and changing system-wide priorities.  Efficiency depicts the optimal resource utilization.  Scaling addresses end-to-end resource provisioning and management with respect to increasing the number of computing elements required to meet service needs.

A possible solution  to address resiliency with respect to scale and fluctuations is an application network architecture, based on increasing the intelligence of computing nodes which, is presented in the Turing centenary conference (2012) for improving the resiliency, efficiency and scaling of information processing systems. In its essence, the distributed intelligent managed element (DIME) network architecture extends the conventional computational model of information processing networks, allowing improvement of the efficiency and resiliency of computational processes. This approach is based on organizing the process dynamics under the supervision of intelligent agents. The DIME network architecture utilizes the DIME computing model with non-von Neumann parallel implementation of a managed Turing machine with a signaling network overlay and adds cognitive elements to evolve super recursive information processing. The DIME network architecture introduces three key functional constructs to enable process design, execution, and management to improve both resiliency and efficiency of application area networks delivering distributed service transactions using both software and hardware (Burgin and Mikkilineni):

  1. Machines with an Oracle: Executing an algorithm, the DIME basic processor P performs the {read -> compute -> write} instruction cycle or its modified version the {interact with a network agent -> read -> compute -> interact with a network agent -> write} instruction cycle. This allows the different network agents to influence the further evolution of computation, while the computation is still in progress. We consider three types of network agents: (a) A DIME agent. (b) A human agent. (c) An external computing agent. It is assumed that a DIME agent knows the goal and intent of the algorithm (along with the context, constraints, communications and control of the algorithm) the DIME basic processor is executing and has the visibility of available resources and the needs of the basic processor as it executes its tasks. In addition, the DIME agent also has the knowledge about alternate courses of action available to facilitate the evolution of the computation to achieve its goal and realize its intent. Thus, every algorithm is associated with a blueprint (analogous to a genetic specification in biology), which provides the knowledge required by the DIME agent to manage the process evolution. An external computing agent is any computing node in the network with which the DIME unit interacts.
  2. Blue-print or policy managed fault, configuration, accounting, performance and security monitoring and control (FCAPS): The DIME agent, which uses the blueprint to configure, instantiate, and manage the DIME basic processor executing the algorithm uses concurrent DIME basic processors with their own blueprints specifying their evolution to monitor the vital signs of the DIME basic processor and implements various policies to assure non-functional requirements such as availability, performance, security and cost management while the managed DIME basic processor is executing its intent. This approach integrates the evolution of the execution of an algorithm with concurrent management of available resources to assure the progress of the computation.
  3. DIME network management control overlay over the managed Turing oracle machines: In addition to read/write communication of the DIME basic processor (the data channel), other DIME basic processors communicate with each other using a parallel signaling channel. This allows the external DIME agents to influence the computation of any managed DIME basic processor in progress based on the context and constraints. The external DIME agents are DIMEs themselves. As a result, changes in one computing element could influence the evolution of another computing element at run time without halting its Turing machine executing the algorithm. The signaling channel and the network of DIME agents can be programmed to execute a process, the intent of which can be specified in a blueprint. Each DIME basic processor can have its own oracle managing its intent, and groups of managed DIME basic processors can have their own domain managers implementing the domain’s intent to execute a process. The management DIME agents specify, configure, and manage the sub-network of DIME units by monitoring and executing policies to optimize the resources while delivering the intent.

The result is a new computing model, a management model and a programming model which infuse self-awareness using an intelligent Infware into a group of software components deployed on a distributed cluster of hardware devices while enabling the monitoring and control of the dynamics of computation to conform to the intent of the computational process. The DNA based control architecture configures appropriately the software and hardware components to execute the intent. As the computation evolves, the control agents monitor the evolution and makes appropriate adjustments to maintain an equilibrium conforming to the intent. When the fluctuations create conditions for unstable equilibrium, the control agents reconfigure the structure in order to create a new equilibrium state that conforms to the intent based on policies.

Figure 3 shows the Infware, hardware and software executing a web service using DNA.

Infware

Figure 3: Hardware and software networks with a process control Infware orchestrating the life-cycle evolution of a web service deployed on a Distributed Computing Cluster

The hardware components are managed dynamically to configure an elastic distributed computing cluster (DCC) to provide the required resources to execute the computations. The software components are organized as managed Turing oracle machines with a control architecture to create AANs that can be monitored and controlled to execute the intent using the network management abstractions of replication, repair, recombination and reconfiguration. With DNA, the datacenters are able to evolve from being to becoming.

It is important to note that DNA is implemented (Mikkilineni, et. al. 2012, 2014) to demonstrate a couple of functions that cannot be accomplished today with current state of the art:

  1. Migrating a workflow being executed in a physical server (a web service transaction including a web server, application server and a database) to another physical server without a reboot or losing transactions to maintain recovery time and recovery point objectives. No virtual machines are required although they can be used just as if they were bare-metal servers.
  2. Provide workflow auto-scaling, auto-failover and live migration with retention of application state using distributed computing clusters with heterogeneous infrastructure (bare metal servers, private and public clouds etc.) without infrastructure orchestration to accomplish them (e.g., without moving virtual machine images or LXC container based images).

The approach using DNA allows the implementation of the above functions without requiring changes to existing applications, OSs or current infrastructure because the architecture non-intrusively extends the current Turing computing model to a managed Turing oracle machine network with control network overlay. It is not a coincidence that similar abstractions are present in how cellular organisms, human organizations and telecommunication networks self-govern and deliver the intent of the system (Mikkilineni 2012).

Only time will tell if the DNA implementation of Infware is an incremental or leap-frog innovation.

Acknowledgements

This work originated from discussions started in  IEEE WETICE 2009 to address the complexity, security and compliance issues in Cloud Computing. The work of Dr. Giovanni Morana, the C3DNA Team and the theoretical insights from professor Eugene Eberbach, Professor Mark Burgin and Pankaj Goyal are behind the current implementation of DNA.

Advertisements

Convergence of Distributed Clouds Grids and their Management
June 23, 2013

Here is an excerpt from the WETICE2013 Track #3 -Convergence of Distributed Clouds, Grids and Their Management

 

Virtualization, Cloud Computing and the Emerging Datacenter Complexity Cliff

Convergence of Distributed Clouds, Grids and their Management – CDCGM2013

WETICE2013 – Hammamet, June 17 – 20, 2013

Track Chair’s Report

Dr. Rao Mikkilineni, IEEE Member, and Dr. Giovanni Morana

Abstract

The Convergence of distributed clouds, grids and their management conference track focuses on virtualization and cloud computing as they enjoy wider acceptance. A recent IDC report predicts that by 2016, $1 of every $5 will be spent on cloud-based software and infrastructure. Three papers address key issues in cloud computing such as resource optimization and scaling to address changing workloads and energy management.  In addition, the DIME network architecture proposed in WETICE2010 is discussed in two papers in this conference, both showing its usefulness in addressing fault, configuration, accounting, performance and security of service transactions with in the service oriented architecture implementation and also spanning across multiple clouds.

While virtualization has brought resource elasticity and application agility to the services infrastructure management, the resulting layers of orchestration and the lack of end-to-end service visibility and control spanning across multiple service provider infrastructure have added an alarming degree of complexity. Hopefully, reducing the complexity in the next generation datacenters will be a major research topic in this conference.

Introduction

While virtualization and cloud computing have brought elasticity to computing resources and agility to applications in a distributed environment, they have also increased complexity of managing various distributed applications contributing to a distributed service transaction delivery by adding layers of orchestration and management systems. There are three major factors contributing to the complexity:

  1. Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an additional burden in the datacenter. A recent global survey commissioned by Symantec Corporation involving 2,453 IT professionals at organizations in 32 countries concludes [1] that the complexity introduced by virtualization, cloud computing and proliferation of mobile devices is a major problem. The survey asked respondents to rate the level of complexity in each of five areas on a scale of 0 to 10, and the results show that data center complexity affects all aspects of computing, including security and infrastructure, disaster recovery, storage and compliance. For example, respondents on average rated all the areas 6.56 or higher on the complexity scale, with security topping the list at 7.06. The average level of complexity for all areas for companies around the world was 6.69. The survey shows that organizations in the Americas on average rated complexity highest, at 7.81, and those in Asia-Pacific/Japan lowest, at 6.15.
  2. As the complexity increases, the response is to introduce more automation of resource administration and operational controls. However, the increased complexity of management of services may be more a fundamental architectural issue related to Gödel’s prohibition of self-reflection in Turing machines [2] than a software design or an operational execution issue. Cockshott et al. [3] conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Automation of dynamic resource administration at run-time makes the computer itself a part of the model and also a part of the problem.
  3. As the services increasingly span across multiple datacenters often owned and operated by different service providers and operators, it is unrealistic to expect that more software that coordinates the myriad resource management systems belonging to different owners is the answer for reducing complexity. A new approach that decouples the service management from underlying distributed resource management systems which are often non-communicative and cumbersome is in order.

The current course becomes even more untenable with the advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them. It is hard to imagine replicating current TCP/IP based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings.  The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012 [3, 4]. Two papers in this conference continue the investigation of its usefulness. Hopefully, this tradition will result in other novel and different approaches to address the datacenter complexity issue while incremental improvements continue as is evident from another three papers.

http://wetice.org

 

Is the Software Defined Network (SDN) Another Detour to a Datacenter Dead-end?
August 6, 2012

Introduction

Frustrated by the inability to fiddle with Internet routing in the real world, Stanford computer scientist Nick McKeown and colleagues developed a standard called OpenFlow that essentially opens up the Internet to researchers, allowing them to define data flows using software–a sort of “software-defined networking.” Installing a small piece of OpenFlow firmware (software embedded in hardware) gives engineers access to flow tables, rules that tell switches and routers how to direct network traffic. Yet it protects the proprietary routing instructions that differentiate one company’s hardware from another. SDN is nothing more than the separation of network data traffic processing from the logic and rules controlling the flow, inspection, and modification of that data. Traditional network hardware, i.e. switches and routers, implement these functions in proprietary firmware partitioned respectively into what is known as the data and control planes. While this is a fine research project, as the major vendors start to take this seriously and are attempting to introduce it in the real-world datacenters, one must ask if this will add or reduce complexity in the already complex datacenter where a host of piece meal solutions are offered by mega corporations seeking to continually increase their revenues without an incentive to reduce complexity by eliminating the number of hardware and software components deployed which would cut into their product sales.

Systems theory tells us that as the number of components increase in a system, the cost of complexity could outweigh the benefits unless architectural reorganization provides a way out.  We argue that the management complexity in current IT infrastructure design, based on the serial von Neumann stored program control implementation of the universal Turing machine, is a more fundamental architecture issue related to the lack of resiliency of the computing model than a software design issue. Cockshott et al. (2012) conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Current generation distributed systems are implemented using a network of Turing machines in which the service and its management are intermixed as shown in figure 1. The resources utilized by the nodes in a network are often controlled by a plethora of management systems which are outside the purview of the service workflow that is utilizing the resources.  Thus the end to end service transaction response is controlled by these management systems which introduce a layer of complexity in coordination and contention resolution making the service much simpler than its management.

Figure 1: Serial von Neumann implementation of Turing Machines

The limitations of the SPC computing architecture were clearly on his mind when von Neumann gave his lecture at the Hixon symposium in 1948 in Pasadena, California (von Neumann, 1987, p. 408). “The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms.  He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond” (von Neumann, 1987,p. 408). It is clear that von Neumann recognized a problem in the way we design computing systems.

“Normally, a literary description of what an automaton is supposed to do is simpler than the complete diagram of the automaton. It is not true a priori that this always will be so. There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself, as long as the automaton is not very complicated, but when you get to high complications, the actual object is much simpler than the literary description.” (von Neumann, 1987,pp. 454-457). He remarked, “It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe.” (von Neumann, 1987,pp. 454-457). The conjecture of von Neumann leads to the fact that “one cannot construct an automaton which will predict the behavior of any arbitrary automaton” (von Neumann, 1987,p. 456). This is so with the Turing machine implemented by the SPC model.

In simpler terms the management complexity is related to the classical Russel Paradox that can be paraphrased as follows: “Who manages the managers?” Gödel’s prohibition of self-reflection in a Turing Machine mandates a hierarchy of Turing machines acting as managers managing other Turing machines implementing the computations described as a sequence of instructions that are compiled into a sequence of 1’s and 0’s. The universal Turing machine (or the general purpose computer) implements these TMs in a synchronous workflow thus prohibiting changes to computations at run-time in any Turing machine while the computation is in progress in that machine (i.e., you cannot change the behavior of that computation (compiled code) till its execution is interrupted).

Current generation server, networking, and storage equipment and their management systems have evolved from server-centric and bandwidth limited network architectures to today’s Cloud computing architecture with virtual servers and broadband networks. During last six decades, many layers of computing abstractions have been introduced to map the execution of complex computational workflows to a sequence of 1s and 0s that eventually get stored in the memory and operated upon by the CPU to achieve the desired result.  These include process definition languages, programming languages, file systems, databases, operating systems etc. While this has helped in automating many business processes, the exponential growth in services in the consumer market also has introduced severe strains on current IT infrastructure. In order to meet the need to rapidly respond to manage the distributed computing resources demanded by changing workloads, business priorities and latency constraints, new layers of resource management are added with the introduction of Hypervisors, virtual machines (VM) and their management. While these layers have made the application or service management more agile, they have introduced a new layer of issues related to their own management. For example, new layers of Virtual machine-level clustering, intrusion detection and performance management, are being introduced in addition to already existing clusters, intrusion detection and performance management systems at the infrastructure, operating systems and distributed resource management layers.

However, this approach is completely unsuited to exploit the new generation many-core servers and high-bandwidth networks now available. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them makes the current generation server, networking and storage equipment and their management systems which have evolved from server-centric and bandwidth limited architectures completely unsuited to use in the next generation computing infrastructure efficiently.  It is hard to imagine replicating current TCP/IP-based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings.  The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

Figure 2 shows the many-core server network supporting multiple bandwidths.

In order to cope with the scaling issues and utilize the hierarchical many-core network of networks effectively, next generation service architecture has to emulate the architectural resiliency of cellular organisms that tolerate faults and implement command and control structures which enable execution of self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing (in short self-*) business processes. This requires new computing models that break the Turing machine barrier to computation by allowing the computer and the computed to be treated in the same model.

Papers Solicited to Address Next Generation Datacenter Infrastructure and Technologies:

The conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2013 is devoted to addressing next generation computing models which support real-time resource reconfiguration of distributed business workflow execution based on latency constraints, changing workloads and business priorities. It is devoted to addressing the assurance of reliability, availability, performance, account management and security of distributed business process execution with appropriate visibility and control.

The objective of the Conference was first stated in WETICE 2009; “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.” We are glad to report that the discussions started in 2009 have directly resulted in an alternative approach to self-managing distributed computing systems totally different from current industry trend showing a way to eliminate the complexity of virtual machines and Hypervisors. If this approach is proven to be theoretically sound (as a paper in WETICE2012 investigated) and extend its usefulness (demonstrated through their feasibility in the form of two proofs of concepts in the last conference) to mission critical environments, the DIME network architecture may yet prove to be an important contribution to computer science.

Following the tradition, the target of the WETICE2013 is to transform current complex, redundant, costly and knowledge intensive IT management into self-configuring, self-monitoring, self-healing and self-optimizing distributed workflow implementations with service management only limited by the speed of light. We identify another emerging area of software defined networks (SDN) as a potential candidate for further investigation without the bias that often surrounds commercial profit motives to see whether the overall complexity of the datacenter will be reduced or the SDNs are yet another layer of complexity.

Papers are solicited to advance the next generation distributed computing and its management infrastructure that leverages the new hardware innovations.  The goals of the conference include (but are not limited to):
  1. Discovering new application scenarios, proposing new operating systems, programming abstractions and tools
  2. Identifying the challenging problem that still need to be solved such as parallel programming, scaling and management of distributed computing elements, and
  3. Reporting results and experiences gained by researchers in building dynamic Grid-based middleware, computing clouds (distributed or otherwise) and workflow management systems.
Submission of papers March 10, 2013
Notification to authors April 1, 2013
Final papers to IEEE-CS April 25, 2013
Paper author’s registration deadline May 10, 2013
 WETICE-2013 Conference June 17-20, 2013

References:

P. Cockshott, L. M. MacKenzie and  G. Michaelson, “Computation and its Limits”, Oxford University Press, Oxford 2012.

J. v.Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, “Automatic studies,” edited by C. E. Shannon, and J. McCarthy, Princeton University Press, 1956, pp. 43-98.

W. Aspray, and A. Burks, “Papers of John von Neumann on Computing and Computer Theory.” Cambridge, MA: MIT Press. 1989.