PHS 7045: Advanced Programming
We will review three technologies to build automatic reports based on web-scraped data:
Before we proceed, let’s take a look at the specific goal:
Webscrape PubMed to download a list of the most recent ABM papers.
The resulting report can be viewed at https://github.com/UofUEpiBio/PHS-7045-egga.
We need to extract the information from this website:
Let’s start by looking into the Quarto document used to do so. You can download it from here.
What?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites – Wikipedia
How?
The rvest
R package provides various tools for reading and processing web data.
Under the hood, rvest
is a wrapper of the xml2
and httr
R packages.
(in the case of dynamic websites, take a look at selenium)
We want to directly capture the table of COVID-19 death rates per country from Wikipedia.
library(rvest)
library(xml2)
# Reading the HTML table with the function xml2::read_html
covid <- read_html(
x = "https://en.wikipedia.org/w/index.php?title=COVID-19_pandemic_death_rates_by_country&oldid=1117643862"
)
# Let's see the output
covid
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
Warning
The current version of the wikipedia document doesn’t have the proper XPath. So we will skip the example and directly call the function to retrieve the table.
We want to get the HTML table in the doc. To do such, we can use the function xml2::xml_find_all()
and rvest::html_table()
The first will locate the place in the document that matches a given XPath expression.
XPath, XML Path Language, is a query language to select nodes in an XML document.
An excellent tutorial can be found here
Modern Web browsers make it easy to use XPath!
Live Example! (inspect elements in Google Chrome, Mozilla Firefox, Internet Explorer, and Safari)
xml2
and the rvest
package (cont. 2)Now that we know what the path is, let’s use that and extract
# A tibble: 6 × 4
Country `Deaths / million` Deaths Cases
<chr> <chr> <chr> <chr>
1 World[a] 885 7,075,455 776,840,500
2 Peru 6,601 220,975 4,526,977
3 Bulgaria 5,678 38,759 1,338,327
4 North Macedonia 5,428 9,990 352,049
5 Bosnia and Herzegovina 5,118 16,403 403,979
6 Hungary 5,069 49,095 2,236,646
In simple terms: Free cloud computing time (2,000 minutes/month) on any OS, whenever you want
quarto
.Docker
images.R CMD check
R packages (or any software) on various versions.The core component of GitHub actions is the workflow files.
The workflow file (stored under .github/workflows
)
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
schedule:
- cron: '0 0 * * 0' # https://crontab.guru/
name: Build it
jobs:
Build:
runs-on: ubuntu-latest
container: rocker/tidyverse:4.2.2
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPO: ${{ github.event.repository.name }}
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
# Installing quarto
- uses: quarto-dev/quarto-actions/setup@v2
with:
version: 0.3.71
- name: Install packags and render
run: |
install2.r xml2 quarto
quarto render README.qmd
# There's an error with EndBug, need to use the safe.directory
# option. More here
# https://git-scm.com/docs/git-config#Documentation/git-config.txt-safedirectory
- name: Dealing with GitConfig
run: |
git config --global --add safe.directory /__w/${GITHUB_REPO}/${GITHUB_REPO}
- uses: EndBug/add-and-commit@v9
with:
add: README.md
Let’s see bit by bit
When the action triggers:
When there’s a push to the main or master branches.
And once a week, every Monday at 0 hours.
It runs on the lastest version of Ubuntu
But within a container (rocker/tidyverse:4.2.2)
Build:
runs-on: ubuntu-latest
container: rocker/tidyverse:4.2.2
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPO: ${{ github.event.repository.name }}
GITHUB_PAT
and GITHUB_REPO
.The Build
job has five steps:
quarto
version 0.3.71.xml2
and quarto
R packages and render the README.qmd
document.- uses: actions/checkout@v3
with:
fetch-depth: 0
- uses: quarto-dev/quarto-actions/setup@v2
with:
version: 0.3.71
- name: Install packags and render
run: |
install2.r xml2 quarto
quarto render README.qmd
- name: Dealing with GitConfig
run: |
git config --global --add safe.directory /__w/${GITHUB_REPO}/${GITHUB_REPO}
- uses: EndBug/add-and-commit@v9
with:
add: README.md
Of these five steps, steps 1, 2, and 5 were pulled directly from the GA marketplace.
Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.[8] Because all of the containers share the services of a single operating system kernel, they use fewer resources than virtual machines – Wiki
rocker/tidyverse has already installed many R packages and their dependencies apt packages. e.g. the tidyverse package, the devtools package, the rmarkdown package, some R Database Interface packages, the data.table package, the fst package, and the Apache Arrow R package. – Source: Rocker Project
https://github.com/UofUEpiBio/PHS-7045-egga