Anqi Shao, MS
Department of Life Sciences Communication
UW-Madison
Mar, 2023
How do you get data for your research?
Are they ...
Q1: What if we want to download a large amount of contents from a website?
To scrape a website, which is usually written in html, we need to first understand the structure of it. Here's an example.
Different approaches to parse html data
#necessary libraries
import requests
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
url = "https://seedworld.com/?s=crispr" #what page are we looking at?
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'}) # make a request to the URL with a specified User-Agent header
webpage = urlopen(req).read() # read the webpage content
soupy = soup(webpage, "html.parser") # parse the HTML content using BeautifulSoup
links = soupy.find_all("a", class_="td-image-wrap") # find all anchor tags with class "td-image-wrap"
link_list = []
#some loops to get all the article links here
for link in links:
try:
sub_content_url = link["href"]
except: # if there is no href attribute, set sub_content_url to "NA"
sub_content_url = "NA"
link_list.append(sub_content_url)
print ("I have found " + str(len(link_list)) + " links in the page you provided.")
I have found 44 links in the page you provided.
link_list_sub = link_list[0:5]
link
'https://seedworld.com/breaking-myths-surrounding-the-seed-sector/'
df = pd.DataFrame(columns=["header", "time", "content"]) #create a dataframe to store our stuff
for link in link_list_sub:
req = Request(link , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soupy = soup(webpage, "html.parser")
header = soupy.find("h1", class_="entry-title")
header_text = header.text
heady = soupy.find('header', class_='td-post-title')
timer = heady.find("time", class_="entry-date updated td-module-date")
timer_text = timer.text
content = soupy.find("div", class_="td-ss-main-content")
content_text = content.text
df = df.append({"header": header_text, "time":timer_text, "content": content_text}, ignore_index=True)
df.head()
Not going to demo about it today
APIs became popular before I dive more into selenium
Anti-scraping challenges
What is an API?
-- Application Programming Interface
"An interface that is ready-to-use and can retrieve pre-packaged information"
Available endpoints
Copyright
Privacy
Strict rate limits
""You look at some of the conferences we attended, you know, 50% of the social computing papers would be about Twitter and sometimes even more, because that was the data that we had access to."
--- Dr. Kate Starbird, University of Washington
"You can't buy a picture book with 1 million pictures of giraffes."
-- Dr. Stuart Russel, OBE, from UC Berkeley
anqi.shao@wisc.edu
anqishao.com