While the powers at be have sworn me to secrecy, I can probably sneak by with letting you know that I’ve spent the last month or two here at Cal prepping for an air quality project in Mongolia’s capital city: Ulaanbaatar (UB).
Now Ulaanbaatar is no ordinary study site– if UB were a stout, it’d be short, cold, and hazy with a dry finish. Its average wintertime temps approach -20°C, mountain walls tower several thousand feet above it, and annual rainfall hovers just below 8 inches. Easily one of the harshest capitals on Earth. To put it more succinctly, UB’s winters are looooong, its precipitation is seldom, and its temperature inversions are frequent. The result is an eight-month heating period during which large quantities of air pollution are created by heavy coal use (heating & electricity gen.) and left suspended in the air.
In fact, the World Health Organization identifies the airshed here as one of the top ten most-polluted on the planet. Recent studies (though they are few) estimate local wintertime particulate (PM2.5) concentrations at up to 7 times the WHO annual guidelines. And the numbers are even greater in the poorer, peri-urban ger regions just outside of downtown proper.
All this to say: UB is an exciting spot for an energy/health wonk like meself.
Annnnnyways, I’ve been just devouring information about Mongolia these past few weeks– I’ll do anything I can to prep for my upcoming excursion into the harshweatherness (and exquisite natural beauty [and amazingly warm culture]) of UB. Which leads me to my first foray into coding. You see, a tweeter by the name of @UB_Air has begun tweeting daily average air quality readings. The actual figures are difficult to parse in a meaningful way from the source feed (an official UB webpage), and so I thought:”well hey, why don’t I try to write some code to index data from this UB_Air feed!? I could analyze it, and manipulate it, and plot it, and it would be sweet.” And so I did. Weeks later and one MedEpi final project later, I’d taught myself enough R code (and cajoled enough friends into helping me) to scrape/manipulate/plot a bit of data.

… turns out somebody’s already figured out how to do this for the official source feed here: http://ubdata.cloudfoundry.com. Dang. (and awesome.) But it was an edifying project nonetheless.
Here’s the code, with a BIG shoutout to David Ruau at Brain Chronicle, whose R code for scraping the US Beijing Embassy’s twitter feed was ueber helpful and the backbone of my code (link to his code/site here):
library(twitteR)
library(ggplot2)
### Download tweets from UB_air Twitter Handle
tweets <- userTimeline(‘UB_air’,n=10000)
length(tweets)
### Create a function that allows you to target only those tweets with air quality data in them
PMGrep <- function(x) {
grep(“PM2.5 =”, x$getText(), value=T)
}
###Isolate contents of listed files in “aq” file
AQ <- unlist(lapply(tweets, PMGrep))
###Exclude redundant tweets. UB_Air tweets simultaneously in English and Mongolian. The result is duplicate reports. To ensure we are not double counting our data, we will exclude those made in Mongolian.
#The simplest way involves merely keeping only those tweets that maintain an English hashtag, #UBAir.
AQ<- AQ[grep("#UBAir", AQ)]
###Then ensure that we are only capturing data with time/date stamps
AQ<- AQ[grep("/[0123456789]“, AQ)]
### Isolate time fragment of tweet as new variable “time”
PM25 <-AQ
time<- sub(“.*(\\d\\d/\\d\\d/\\d\\d\\d\\d \\d\\d:\\d\\d).*”,”\\1″,PM25, perl=T)
### Standardize format of time (posix time)
time <- strptime(time, format=”%m/%d/%Y %R”)
###Isolate PM data
#Isolate PM2.5 readings
PM<-substr(PM25, 1,11)
#Remove non-numeric values
PM <- as.character(sub(“PM2.5 = “, “\\1″, PM, perl=T))
PM <- as.numeric(sub(“[[:alpha:]]$”, “\\1″, PM, perl=T))
###Create a data frame of times and PM levels
data <- data.frame(PM=PM, time=time)
data <- data[order(data$time),]
###Descriptive Statistics
summary(data$PM)
summary(data$time)
###Plot
#Basic Plot
p <- ggplot(data=data, aes(x=time, y=PM, group=1)) +
theme(
panel.background = element_rect(fill = “transparent”, colour = “gray”)
) +
geom_line(colour=”black”, size=0.5) +
geom_point(colour=”black”, size =2, shape=21, fill=”white”) +
ylim(0, max(data$PM))+
xlab(“Time and Day”) + ylab(“PM2.5, 24 Hour Average”)
ggtitle(“24hr Air Quality Averages (PM2.5), Ulaanbaatar”)
p
#Add WHO guidelines
#source: http://www.who.int/mediacentre/factsheets/fs313/en/index.html
pw <- p +
geom_hline(aes(yintercept=25), colour=”dark red”,label=”WHO 24hr Guideline”) +
geom_hline(aes(yintercept=10), linetype =”dashed”,label=”WHO Annual Guideline”, colour=”dark red”) +
ggtitle(“24hr Air Quality Averages (PM2.5), Ulaanbaatar”)+
geom_text (aes(x2,y2, label = texthere),
data.frame(x2=strptime(“11/6/2012 00:00″, format=”%m/%d/%Y %R”), y2=20, texthere=”WHO 24hr Guideline”)) +
geom_text (aes(x2,y2, label = texthere),
data.frame(x2=strptime(“11/29/2012 00:00″, format=”%m/%d/%Y %R”), y2=5, texthere=”WHO Annual Guideline”))
pw
#Define average
pwa <- pw +
geom_hline(aes(yintercept=(mean(data$PM))), colour=”blue”,label=”Average 24hr Guideline”)
pwa
#Plot Model of PM v. Time (linear w/ 95% confidence regions)
modell <- pw +
geom_smooth(method=lm, colour=”blue”)
modell
#Plot Smoothed Model of PM v. Time (loess w/95% confidence regions)
models <- pw +
geom_point(shape=1, colour=”black”) +
geom_smooth(colour=”blue”)
models