Implementation and Ethics of Automatic Online Data Collection¶

Anqi Shao, MS

Department of Life Sciences Communication

UW-Madison

Mar, 2023

Introduction¶

Three basic methodologies of quantitative communication research¶

  • Survey
  • Experiment
  • Content Analysis

How do you get data for your research?

Are they ...

  • Available?
  • Accessible?
  • Representive?
  • Valid?

Outline¶

  • Our objectives today
  • Webscrapers based on html parsing
  • Retrieving social media content via APIs
  • Here's the dataset, but what's next?

Webscrapers based on html parsing¶

  • What is HTML
  • Browser -> right click -> inspect
  • Different approaches to parse html data
  • 5 minute demo

old scraper.png

Q1: What if we want to download a large amount of contents from a website?

To scrape a website, which is usually written in html, we need to first understand the structure of it. Here's an example.

html-1.png

html-2.png

html-2.jpg

Different approaches to parse html data

static scraper.png

Short demo with BeautifulSoup¶

In [ ]:
#necessary libraries
import requests
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
In [ ]:
url = "https://seedworld.com/?s=crispr" #what page are we looking at?

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'}) # make a request to the URL with a specified User-Agent header

webpage = urlopen(req).read() # read the webpage content             
soupy = soup(webpage, "html.parser") # parse the HTML content using BeautifulSoup

links = soupy.find_all("a", class_="td-image-wrap") # find all anchor tags with class "td-image-wrap"

link_list = []
#some loops to get all the article links here
for link in links:
  try:
    sub_content_url = link["href"]
  except: # if there is no href attribute, set sub_content_url to "NA"
    sub_content_url = "NA"
  link_list.append(sub_content_url)

print ("I have found " + str(len(link_list)) + " links in the page you provided.")
I have found 44 links in the page you provided.
In [ ]:
link_list_sub = link_list[0:5]
In [ ]:
link
Out[ ]:
'https://seedworld.com/breaking-myths-surrounding-the-seed-sector/'
In [ ]:
df = pd.DataFrame(columns=["header", "time", "content"]) #create a dataframe to store our stuff


for link in link_list_sub:
  req = Request(link , headers={'User-Agent': 'Mozilla/5.0'})
  webpage = urlopen(req).read()
  soupy = soup(webpage, "html.parser")

  header = soupy.find("h1", class_="entry-title")
  header_text = header.text

  heady = soupy.find('header', class_='td-post-title')
  timer = heady.find("time", class_="entry-date updated td-module-date")
  timer_text = timer.text

  content = soupy.find("div", class_="td-ss-main-content")
  content_text = content.text

  df = df.append({"header": header_text, "time":timer_text, "content": content_text}, ignore_index=True)
In [ ]:
df.head()

Simulating the scrolling behavior from Selenium package¶

selenium.jpg

  • Not going to demo about it today

  • APIs became popular before I dive more into selenium

  • Online data retrieving is no longer FFA

Anti-scraping challenges

  • CAPTCHA checks (e.g., your free work for AI technology)
  • Blocking IP addresses
  • JavaScript challenges (displaying dynamic contents)
  • Legal actions against scrapers

Retrieving social media content via APIs¶

  • What is an API?
  • Social media APIs
  • Obtaining API keys
  • Third-party tools

What is an API?

-- Application Programming Interface

bank teller.png

"An interface that is ready-to-use and can retrieve pre-packaged information"

APIs from Twitter¶

twitter-api-1.jpg

Available endpoints

  • Tweets lookup
  • Access user timeline
  • Recent/full archive tweet search
  • Likes look up
  • Follows lookup
  • ...

https://developer.twitter.com/en/docs/twitter-api/rate-limits

twitter-api-2.png

APIs from META¶

  • Most APIs for data collection worked until late 2018
  • Currently mostly for business uses

not-for-academia-meta-api.png

Third party tools¶

apify.png

API¶

IRB¶

Got the data, but what's next?¶

  • Rising ethical concerns of (automated) online data collection
  • Even tighter data accessibility for the academia
  • You don’t really need that much data
  • You really need that much data

Ethical concerns (as advertised)¶

  • Copyright
  • Privacy

Copyright

art-copyright.png

Privacy

twitter-privacy.png

Data accessibility for researchers¶

api reaper.png

  • Academic API not working as expected
  • Restricted quota limit for data retrieving
  • Higher expenses (Feb 2023)

vosoughi.png

Strict rate limits

limit rates.png

""You look at some of the conferences we attended, you know, 50% of the social computing papers would be about Twitter and sometimes even more, because that was the data that we had access to."

--- Dr. Kate Starbird, University of Washington

limit rate-2.png

You don't really need that much data.¶

  • ✅Larger sample size reduce Type II errors.
  • ❗Messy large social media sample + simple OLS regressions can lead to trival shit findings

xyz contagion.png

You do really need that much data.¶

  • For training machine learning models
  • Low data efficiency in current machine learning algorithms

"You can't buy a picture book with 1 million pictures of giraffes."

-- Dr. Stuart Russel, OBE, from UC Berkeley

giraffe.jpg

Thank you¶

anqi.shao@wisc.edu

anqishao.com