Introduction to Data Science - Unit : 1 - Topic 5 : BIG DATA ECOSYSTEM AND DATA SCIENCE

 

BIG DATA ECOSYSTEM AND DATA SCIENCE

Currently many big data tools and frameworks exist, and it’s easy to get lost because new technologies appear rapidly. It’s much easier once you realize that the big data ecosystem can be grouped into technologies that have similar goals and functionalities, which we’ll discuss in this section. Data scientists use many different technologies, but not all of them; we’ll dedicate a separate chapter to the most important data science technology classes.

1.     Distributed file systems

A distributed file system is similar to a normal file system, except that it runs on multiple servers at once. Because it’s a file system, you can do almost all the same things you’d do on a normal file system. Actions such as storing, reading, and deleting files and adding security to files are at the core of every file system, including the distributed one. Distributed file systems have significant advantages:

Ø  They can store files larger than any one computer disk.

Ø  Files get automatically replicated across multiple servers for redundancy or parallel operations while hiding the complexity of doing so from the user.

Ø  The system scales easily: you’re no longer bound by the memory or storage restrictions of a single server.

2.     Distributed programming framework

Ø  Once you have the data stored on the distributed file system, you want to exploit it.

Ø  One important aspect of working on a distributed hard disk is that you won’t move

Ø  your data to your program, but rather you’ll move your program to the data.

3.     Data integration framework

Once you have a distributed file system in place, you need to add data. You need to move data from one source to another, and this is where the data integration frameworks such as Apache Sqoop and Apache Flume excel. The process is similar to an extract, transform, and load process in a traditional data are house.



4.     Machine learning frameworks

When you have the data in place, it’s time to extract the coveted insights. This is where you rely on the fields of machine learning, statistics, and applied mathematics.

5.     NoSQL databases

If you need to store huge amounts of data, you require software that’s specialized in managing and querying this data. Traditionally this has been the playing field of relational databases such as Oracle SQL, MySQL, Sybase IQ, and others.

 

 

6.     Scheduling tools

Scheduling tools help you automate repetitive tasks and trigger jobs based on events such as adding a new file to a folder. These are similar to tools such as CRON on Linux but are specifically developed for big data. You can use them, for instance, to start a MapReduce task whenever a new dataset is available in a directory.

7.     Benchmarking tools

This class of tools was developed to optimize your big data installation by providing standardized profiling suites. A profiling suite is taken from a representative set of big data jobs.

8. System deployment

Setting up a big data infrastructure isn’t an easy task and assisting engineers in deploying new applications into the big data cluster is where system deployment tools shine. They largely automate the installation and configuration of big data components.

9. Service programming

Suppose that you’ve made a world-class soccer prediction application on Hadoop, and you want to allow others to use the predictions made by your application.

10. Security

Big data security tools allow you to have central and fine-grained control over access to the data. Big data security has become a topic in its own right, and data scientists are usually only confronted with it as data consumers.

Comments

Popular posts from this blog

How to Get a Job in Top IT MNCs (TCS, Infosys, Wipro, Google, etc.) – Step-by-Step Guide for B.Tech Final Year Students

Common HR Interview Questions

How to Get an Internship in a MNC