Big data is one of the major technology usages for business operations in today’s competitive market. It provides organizations a powerful tool to analyze large unstructured data to make useful decisions. Result quality, time, and price associated with big data analytics are very important aspects for its success. Selection of appropriate cloud infrastructure at coarse and fine grained level will ensure better results. In this paper, a global architecture is proposed for QoS based scheduling for big data application to distributed cloud datacenter at two levels which are coarse grained and fine grained. At coarse grain level, appropriate local datacenter is selected based on network distance between user and datacenter, network throughput and total available resources using adaptive K nearest neighbor algorihtm. At fine grained level, probability triplet (C, I, M) is predicted using naïve bayes algorithm which provides probability of new application to fall in compute intensive (C), input/output intensive (I) and memory intensive (M) categories. Each datacenter is transformed into a pool of virtual clusters capable of executing specific category of jobs with specific (C, I, M) requirements using self organized maps. Novelty of study is to represent whole datacenter resources in a predefined topological ordering and executing new incoming jobs in their respective predefined virtual clusters based on their respective QoS requirements. Proposed architecture is tested on three different Amazon EMR datacenters for resource utilization, waiting time, availability, response time and estimated time to complete. Results indicated better QoS achievement and 33.15% cost gain of the proposed architecture over traditional Amazon methods.
Full paper:- Big Data QoS- Cluster Computing