Implementing Fault Resilient Strategies in Cloud Computing via Federated Learning Approach


Department of Computer Science and IT, Faculty of Computer Applications & Information Technology and Sciences, AKS University, Satna, Madhya Pradesh, 485001, India

Abstract

Faults are inevitable in a very large scale distributed computing system such as cloud computing. The size of distributed computing is enlarging drastically due to the advent of the Internet of things (IoT). Faults occur frequently at any working node and cause the partial or complete failure of the cloud applications. Implementing fault resilient systems and securing cloud systems have become key challenging problems in recent years. A novel model with federated learning (FL) is analyzed and proposed to deal with these challenges. Federated learning, a special kind of distributed deep learning, works in collaboration with the distributed computing machines. A Federated learning model can be deployed on multiple clusters of computing nodes. One of the features of distributed computing is that it is growing drastically towards horizontal and vertical directions. The Federated learning model is deployed on both horizontal and vertical scaling. FL deployed with distributed deep learning can identify, recognize, and resolve the faults to great extent.

Keywords

Distributed Computing, Fault Resilience, Federated Learning, IoT, Cloud Computing

Introduction

To reduce the adverse effects of faults, machine learning (ML) especially the federated learning (FL) approach is deployed. Federated learning is a distributed and decentralized paradigm of protocols. The Federated learning approach is well suited for a distributed system because a set of worker machines (or nodes) can train the local models. Different chunks of datasets are distributed among the worker nodes or third parties. Here sections of a dataset are not shared by the working computational nodes. Thus federated learning is also the most significant model for achieving data privacy and data security in addition to fault tolerance. The existing FL approaches highlight optimizing only one dimension of the target space.

The proposed methods can reduce communication costs and improve the efficiency of distributed computing. Federate deep learning (FDL) method minimizes the adverse effects with an improved convergence rate. This approach utilizes a weighted aggregation for accuracy improvement. FDL is capable to detect and diagnose the faults that occur frequently on end-user devices as well as on the edge. FDL is a novel communication efficient FL approach. It incorporates both synchronous and asynchronous arrangements.

Federated learning (FL) is a multi-modal machine learning system that trains the algorithm among various distributed and decentralized edge devices that holds local datasets. The intelligent device such as PDAs, smart-phones, and desktops or tablets system has been scaling rapidly in recent years. Most of these devices are equipped with multiple sensors that allow them to produce and consume a huge amount of information. Distributed computing hierarchy consists of cloud, edge, and end-user devices. End-user devices train the local models and use local datasets.

End device and client’s behavioral heterogeneity become the key cause of fault inclusion in cloud systems. The cloud system plays a major role in scaling big data.

Preliminaries

Federated learning models include hundreds of thousands of remotely distributed end devices. These devices use their device-generated data or section of datasets provided by the parameter server. All the participating devices get connected with a central server to obtain updated model parameters. In general, the principal objective of federated learning is typically to solve the following:

   min   f x = k = 1 m p k F k ( w )

where m is the total number of participating end devices and p k ≥ 0. The end devices’ objectives are to calculate the gradients of local models. The stochastic gradient descent (SGD) is most probably used algorithm in local devices. After aggregating the local gradients, federated learning generates the global model parameters for obtaining the final inference.

Methodologies

To implement a federated learning strategy, initially, a deep algorithm is deployed on each participating end device for the estimation of the local gradients of the loss function. FL models are deployed on clusters of end devices. FL collaborates and coordinates each end device or clusters of end devices with the help of parameter servers. The following algorithm establishes the correspondence between the server and the client processes.

Amazon SageMaker framework is used to implement the proposed FDL model. Following steps are carried out to accomplish the task.

Creating a notebook instance.

• Preparing the data for preprocessing.

• Training the proposed model with appropriate datasets.

• Deploying the model on designated cloud.

• Evaluating the proposed mode for measuring the performance.

• Monitoring the model’s performance and accuracy.

Simple distributed algorithm

process p1

var u IN init 0

ACK: Boolean init true

begin

~ACK ∧ REC (s) → ACK := true; u := u+1

ACK → send (t); ACK := false

end

process p2

var wait : Boolean init true

begin

~Wait → send (s); wait := true

Wait ∧ REC (s) →Wait:= false

end

Here p1 is the state of the process at the end device and p2 is the state of the process outside of the device i.e., at the cloud where the central servers are deployed.

The parameter or central server is deployed at the cloud layer to orchestrate and coordinate the local machines at the end devices. The parameter server performs the task of aggregating the local updates and upgrading the global model after receiving the updated local models. Edge works as an intermediate layer between the cloud and end-devices. Edge acts almost similar to the cloud, it performs the task of taking the output of end devices as input, applies aggregation and classification if necessary, and finally transfers its intermediate output to the cloud system for further processing if required. The participating end device uses local datasets and local models for training the local datasets and thus reduces the occurrence of faults to a great extent.

Parallel / Distributed Processing Mechanism

In a cloud system, multiple processes are running simultaneously on servers distributed or scattered across the globe. In parallel computing, a task or program is divided into multiple processes. Each process is executed by the processor of the single-processor or multi-processor system. Whenever multiple processes are executed simultaneously by the multi-processor system, it is known as parallel processing.

   Distributed computing is an extension of parallel computing in the sense that parallel processing is performed on the processing units distributed geographically. In case parallel processing is carried out in the cloud computing environment; the processing devices are distributed across the different locations most probably on servers of the data centers. Some protocols must be followed in parallel processing. For example, in a system program like UNIX/LINUX operating system environment a system routine known as the fork is called for creating a new instance of a process.

   RetVal = fork ();

   if ( RetVal == 0 )

   {

      child ();   

      {

         :   //child process

         :   //starts running

         :   //on end device

}

   }

   else

   {

      if ( RetVal == -1)

         Display (child process creation failed);

      else

      :    // parent/master process

      :   // starts running

:   // at this point

}

Synchronization

A distributed system under the cloud environment suffers the practical problem of synchronization among the processes and heterogeneous resources. A very common Lock and Unlock mechanism are used in the proposed system to deal with the synchronization problem. Following is the code for the lock and unlock procedure:

   Lock (L)

      {

      While (L == 1)

         {

            No Operation:

         }

   Unlock (L)

      {

         L = 0; }

In this algorithm, L is the locking variable that works as an entry point. If L = 0, the entry is open and when L = 1, the entry is closed. When a process on the participating end device needs to access the shared data, it invokes a Lock (L) procedure. When the values of L becomes 0, Lock (L} procedure sets its value to 1. An Unlock (L) procedure makes the value of L to 0 (reset).

System Topology

System topology for federated with distributed learning consists of a graph G = (V, E), where V =set of working nodes (or sequential processes) E=set of edges (bi/unidirectional communication channel or links). Figure 1 shows the graph of the proposed model. In the graph, nodes represent heterogeneous edge or end devices and the edge represents the communication channel. The links between the nodes may be guided on unguided. In the context of cloud computing nodes may also represent the groups or clusters of the end devices.

The proposed graph topology for the distributed federated learning approach deploys multiple edge (or end) devices like mobile phones, smart sensors, etc. The end device executes the process and the communication among the processes is performed through the message passing. Each edge device has its model with a local dataset. The end device performs the task of calculation of local gradients of loss function based on localizing dataset. At the cloud end, these gradients are aggregated and updated for optimizing the model.

Experimental Setups

Python programming language is primarily used for building the model and for statistical analysis R language is deployed. Open source TensorFlow and PyCharm computational frameworks have been used for building ML and DL models. AWS platforms, Amazon SageMaker, and the AWS Deep Learning AMIs are implemented to build, train and deploy the proposed model. Amazon SageMaker Studio has been implemented for development, training, deployment and monitoring the proposed FDL approach. An open simulating platform, FLASH is also applied for designing and implementing the proposed FL model.

https://typeset-prod-media-server.s3.amazonaws.com/article_uploads/a660f9d8-2a54-4683-8d59-d5777579f291/image/02db0214-b725-4268-89b6-fcb878e46cce-uw.png
Figure 1: A Graphof Edge Devices With Distributed Federated Learning

CONCLUSIONS AND FUTURE WORK

In the proposed model the participating edge device trains its model with a local dataset. These local datasets are not sharable among the edge devices hence the system preserves the privacy-sensitive personal data. Federated learning collaborates with machine learning without centralized training of the data.

Federated learning poses some of the key problems which have to be resolved: one of the problems is communication cost and the other one is the unreliability of the end devices that need not necessarily participate in the FL process. The proposed line of work opens the options for further research in direction of data security and privacy of personal data.

The existing FL systems are not preserving the heterogeneity of data and device heterogeneity. Heterogeneity reduces the convergence rate of FL. This is one of the core challenges in designing the FL model.