Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
      You have completed Scraping Data From the Web!
      
    
You have completed Scraping Data From the Web!
Preview
    
      
  Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping.
- Horse Land web site
- Horse Land site source code
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
                      Before we jump into Python and
start wrangling data from a web page,
                      0:00
                    
                    
                      I think it will be helpful to revisit
what a web page looks like in code.
                      0:03
                    
                    
                      How is a web page structured,
or more specifically,
                      0:07
                    
                    
                      how a web page should be structured.
                      0:10
                    
                    
                      In your journey with web scraping,
you'll likely come across a site or
                      0:13
                    
                    
                      two where you ask, hold your horses,
why aren't any of the tags closed?
                      0:17
                    
                    
                      Or, seriously,
there are five h1 tags on this page?
                      0:21
                    
                    
                      If we look at how an HTML page
should be structured, it starts and
                      0:26
                    
                    
                      ends with an opening and closing html tag.
                      0:30
                    
                    
                      Inside the html tag,
we have a head section which has tags for
                      0:33
                    
                    
                      metadata about the page and other
essential information for the document.
                      0:38
                    
                    
                      The page title will be
found in here as well.
                      0:41
                    
                    
                      Next, we have the body section where
the content of the page is found.
                      0:44
                    
                    
                      Inside here is where we'll do
the majority or our scraping.
                      0:48
                    
                    
                      Things like heading tags,
div, paragraph, anchor, and
                      0:51
                    
                    
                      form elements will reside inside here.
                      0:56
                    
                    
                      I mentioned that structure
is how a page should look,
                      0:59
                    
                    
                      sometimes reality is different.
                      1:02
                    
                    
                      Let's take a look at how
lenient HTML can be written and
                      1:04
                    
                    
                      still look good in the browser.
                      1:07
                    
                    
                      This will point out some
of the challenges and
                      1:09
                    
                    
                      benefits that we can come into
when attempting to scrape a site.
                      1:11
                    
                    
                      Let's take a look at a sample
website that the amazing design team
                      1:16
                    
                    
                      here at Treehouse put together.
                      1:19
                    
                    
                      It's hosted on GitHub Pages,
which is great
                      1:21
                    
                    
                      because it allows us to view the site and
easily see the HTML code.
                      1:24
                    
                    
                      Check the teacher's notes for the link.
                      1:29
                    
                    
                      I'm using the Chrome browser, and
                      1:32
                    
                    
                      if we open up the developer's tools
with Option+Cmd+I on a Mac, or
                      1:33
                    
                    
                      Ctrl+Shift+I on Windows,
we can examine the structure of our page.
                      1:37
                    
                    
                      Here at the top,
we see the head section, and
                      1:43
                    
                    
                      can expand that to see that
it contains a few things.
                      1:45
                    
                    
                      There's some metadata, there's links
to our style sheet and fonts, and
                      1:48
                    
                    
                      there it is, our page title.
                      1:52
                    
                    
                      We'll see how to scrape that
information in code here shortly.
                      1:54
                    
                    
                      The body section is where, as I mentioned,
                      1:59
                    
                    
                      we'll find most of the interesting
items we'll want to scrape.
                      2:01
                    
                    
                      We see that we have a few different
div elements that separate the page
                      2:04
                    
                    
                      into different logical components.
                      2:08
                    
                    
                      Such as the graphical header,
there's our featured image,
                      2:11
                    
                    
                      and then down here, there's
the links at the bottom of the page.
                      2:15
                    
                    
                      The main portion of this particular
webpage is the list of horses with
                      2:17
                    
                    
                      the images.
                      2:22
                    
                    
                      We see here in the HTML that they all
reside here in this unordered list
                      2:23
                    
                    
                      section, with the imageGallery ID and
card-wrap class.
                      2:28
                    
                    
                      If we expand this section,
we see a bunch of list items.
                      2:33
                    
                    
                      These look like potential
scraping targets, and
                      2:37
                    
                    
                      we'll explore them more specifically,
later in the course.
                      2:39
                    
                    
                      One thing I do want to mention here is
that modern web browsers can hide a lot of
                      2:43
                    
                    
                      HTML errors for us.
                      2:48
                    
                    
                      Inline elements such as span, and some
block level elements such as paragraph
                      2:49
                    
                    
                      tags may not be closed in the actual HTML,
but the browser closes them for us.
                      2:55
                    
                    
                      If we take a look here,
we see this paragraph here at the bottom.
                      3:01
                    
                    
                      We see that it has a class of credits,
and there's an opening and closing p tag.
                      3:07
                    
                    
                      However, if we look at the source code for
this file on GitHub,
                      3:11
                    
                    
                      that's down here under index.html.
                      3:15
                    
                    
                      So in here,
we scroll down to the bottom of the page.
                      3:19
                    
                    
                      We see the opening p tag on line 43, but
                      3:23
                    
                    
                      there isn't a closing tag when
this paragraph ends on line 46.
                      3:25
                    
                    
                      In this case, the browser helps us out for
web scraping tasks.
                      3:30
                    
                    
                      Fortunately, HTML doesn't
have to be perfect.
                      3:34
                    
                    
                      With some web page anatomy under our
belts, let's take a quick pit stop before
                      3:37
                    
                    
                      we get started with some scraping tasks
with the Python package, Beautiful Soup.
                      3:42
                    
              
        You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up