Introduction to Data Science - Unit : 1 - Topic 5 : BIG DATA ECOSYSTEM AND DATA SCIENCE
BIG DATA ECOSYSTEM AND DATA SCIENCE
Currently many big data
tools and frameworks exist, and it’s easy to get lost because new technologies
appear rapidly. It’s much easier once you realize that the big data ecosystem
can be grouped into technologies that have similar goals and functionalities, which
we’ll discuss in this section. Data scientists use many different technologies,
but not all of them; we’ll dedicate a separate chapter to the most important
data science technology classes.
1.
Distributed file systems
A
distributed file system is similar to a normal file system, except that
it runs on multiple servers at once. Because it’s a file system, you can do
almost all the same things you’d do on a normal file system. Actions such as
storing, reading, and deleting files and adding security to files are at the
core of every file system, including the distributed one. Distributed file
systems have significant advantages:
Ø They
can store files larger than any one computer disk.
Ø Files
get automatically replicated across multiple servers for redundancy or parallel
operations while hiding the complexity of doing so from the user.
Ø The
system scales easily: you’re no longer bound by the memory or storage restrictions
of a single server.
2.
Distributed programming framework
Ø Once you have the data stored on the distributed file
system, you want to exploit it.
Ø One important aspect of working on a distributed hard
disk is that you won’t move
Ø your data to your program, but rather you’ll
move your program to the data.
3.
Data integration framework
Once you have a distributed file system in
place, you need to add data. You need to move data from one source to another,
and this is where the data integration frameworks such as Apache Sqoop and
Apache Flume excel. The process is similar to an extract, transform, and load
process in a traditional data are house.
4. Machine learning frameworks
When you have the data in
place, it’s time to extract the coveted insights. This is where you rely on the
fields of machine learning, statistics, and applied mathematics.
5. NoSQL databases
If you need to store huge amounts of data,
you require software that’s specialized in managing and querying this data.
Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others.
6.
Scheduling tools
Scheduling tools help you automate repetitive
tasks and trigger jobs based on events such as adding a new file to a folder.
These are similar to tools such as CRON on Linux but are specifically developed
for big data. You can use them, for instance, to start a MapReduce task
whenever a new dataset is available in a directory.
7.
Benchmarking tools
This class of tools was developed
to optimize your big data installation by providing standardized profiling
suites. A profiling suite is taken from a representative set of big data jobs.
8. System deployment
Setting up a big data
infrastructure isn’t an easy task and assisting engineers in deploying new
applications into the big data cluster is where system deployment tools shine.
They largely automate the installation and configuration of big data components.
9. Service
programming
Suppose that you’ve made a
world-class soccer prediction application on Hadoop, and you want to allow
others to use the predictions made by your application.
10. Security
Big data security tools allow you
to have central and fine-grained control over access to the data. Big data
security has become a topic in its own right, and data scientists are usually
only confronted with it as data consumers.
Comments
Post a Comment