Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Introduction to Big Data!
You have completed Introduction to Big Data!
Preview
How do you get insight on all this data?
Learn More
- Apache Hadoop homepage
- Wikipedia has a great history of the Hadoop project and much of the Big Data ecosystem that formed on top of Hadoop
- Apache Spark homepage
- Apache Spark Documentation
- Apache Solr Quickstart
- Apache Lucene homepage
- Elasticsearch in 5 minutes
- Getting started with Tensorflow
- Getting started with Scikit-Learn
- Machine Learning on Spark with MLlib
- More ML on Spark Tutorial
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Storing data is only part of the process.
0:00
Often, you'll need to get
insight out of that data or
0:03
process it at lightning
speeds in order to keep up.
0:05
Let's discuss three of the main
computational use cases for big data.
0:08
When we talk about
generalized data processing,
0:12
we're talking about being fed data and
running a computation over it.
0:15
It's typically fed through a stream,
and the processing could be simple
0:19
as counting, taking the variance of data,
or some other custom algorithm.
0:22
Generalized data processing
forms the foundation
0:27
of most what of you'll encounter if you
work with big data systems and problems.
0:30
These systems provide APIs, or
application programming interfaces,
0:35
which are specific documented functions
that are available for you to use.
0:39
These functions are robust
enough to solve nearly
0:43
any problem with data
in nearly any format.
0:46
Apache Hadoop is the most popular
generalized data processing engine
0:49
in the big data world today.
0:54
Hadoop is based off of a system originally
developed at Google to rank webpages of
0:56
the entire Internet.
1:00
Apache Spark is a newer player to
the game, and it's built on top of Hadoop.
1:01
Spark brings lighting fast
speed not available in Hadoop
1:06
through largely in-memory processing and
a simpler API.
1:09
Spark also has the ability to handle
streaming data, work with graphs, and
1:14
manipulate text data as SQL.
1:18
It also has a machine learning component.
1:21
Nearly every company that deals
with big data uses Hadoop or
1:25
Spark at some place in their stack.
1:28
Hadoop and Spark are both backed by HDFS.
1:31
Remember, that is for
Hadoop Distributed File System.
1:34
And therefore, it can scale to tens of
thousands of machines in a single cluster.
1:37
Okay, so let's move on to our next
computational use case, search.
1:42
Often, you'll need to find some piece of
data within all the data that you have, so
1:47
that you can display it to a user.
1:51
Now to find that relevant data, all of
your internal data has to be stored in
1:54
a way that can be quickly retrieved, and
surfaced to the application asking for it.
1:58
This turns out to be such
a difficult problem at scale
2:03
that there are major tools built just for
this.
2:06
So next up in our computational
use cases is search.
2:10
Popular tools here include Solr and
Lucene, both of which are Apache projects.
2:15
You've probably noticed a bunch of these
big data projects are part of Apache.
2:21
Check the teacher's notes for more.
2:25
Lucene is the full tech search tool
that Solr uses to provide more advanced
2:27
searching features.
2:32
Full text searching involves breaking
your content up into search terms so
2:34
that it can ignore differences in tense,
or other differences,
2:37
like the singular book
versus the plural books.
2:41
Users of Solr and Lucene include Netflix,
DuckDuckGo, Instagram, AOL, and Twitter.
2:45
Elasticsearch is another popular
open source search tool and
2:53
it is an alternative choice to Lucene and
Solr.
2:56
These systems take data from different
storage layers, like HDFS, then
3:00
index the data on the disk, and finally
provide APIs for front-end clients that
3:04
hook into the search engine and perform
full-text searches on that indexed data.
3:10
The next computational use case that we
are going to look at is machine learning.
3:14
You can think of this as training
computers to recognize patterns.
3:19
Now this is done through statistical
analysis and more complex algorithms.
3:23
Check the teacher's notes for more.
3:27
TensorFlow and
3:29
scikit-learn are two of the most popular
machine learning frameworks available.
3:30
Machine learning can be used for
a plethora of applications.
3:35
Some examples where it's used are in
recommending products or services.
3:38
It can also be used to detect and
prevent financial or ad fraud.
3:42
The magic behind self driving cars
operating on crowded roads is greatly
3:46
powered by machine learning.
3:50
TensorFlow is a Google open source
project that allows users to build
3:52
complex data flow graphs that can perform
a wide variety of machine learning tasks.
3:56
Since TensorFlow can be rather complex,
4:01
beginners are usually advised to
start exploring with scikit-learn.
4:04
Scikit-learn is a Python-based
framework that is very approachable.
4:08
It is a wide ranging set of features for
all kinds of machine learning.
4:11
So that just about wraps up
the domain of computations.
4:16
Let's take a deeper look at our
final domain, infrastructure,
4:18
right after this quick break.
4:22
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up