
Benjana Guraziu

R is an awesome language, but I wouldn’t be this excited about it if I weren’t so excited about the people that are in this community. NYR was a great conference that really highlighted the strength of the R community


Book club

Date: Saturday, July 14, 2018 Time: 11am Author: Meredith


Bringing the family together

My husband’s family throws a family reunion every year and this year we’ve been tasked with co-planning it. We were trying to decide on the best location for everyone, so I embarked on a mission to find the center of all of our residences. library(tidyverse) library(leaflet) Geocoding the locations I began by putting together a quick


ML Kit with Brahim Elbouchikhi and Sachin Kotwani

He holds an MBA from Carnegie Mellon University, and dual bachelor’s degrees in Business Management and Computer Science from the University of Missouri - Columbia. His hobbies include traveling with his family, chasing his daughter around the house, and tinkering with mobile apps and backends


Time after time

I first learned how to check for autocorrelation via autocorrelation function (ACF) plots in R in a class on time series However, the examples we worked on were all single, long term time series with no missing values and no


A package for tidying nested lists

Run the app like


Creativity in Data Analysis

Missing data are present in almost every dataset and the most important question a data analyst can ask when confronted with missing data is “Why are the data missing?” It’s important to develop some understanding of the mechanism behind what makes the data missing in order to develop an appropriate strategy for dealing with missing data


Day 55-62: R: The Boring Bits

The “random walk on CRAN” project, however, has been on hold for a bit - and in truth today’s post is a bit of a cop out because there’s no package here at all and barely anything resembling code. Instead, it’s some initial thoughts about how to revisit some of my teaching material. For today though, I have a different goal… But still, it’s nice to think about what we might do if we have more


Mail merge in 2018 with R

Matt Dray Two-thousand and late Clip art! Fax machines! CD-ROMs! Dial-up modems! The World Wide Web! Mail merge! These exotic terms give me flashbacks to computer class at the turn of the millennium


Models are about what changes, and what doesn't

The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important. Models are about what changes, and what doesn’t. Some are useful. In mathematics change is often best described with differential equations, and that’s how I will motivate and justify my models today


My first hackathon (part 1)

Gender pay gap hackathon Last weekend I went to my first hackathon. It was organised by the AI club for gender minorities, codebar and ellpha. We used data on the gender pay gap available here. I had a great time so I wanted to share my experience. This is the first part of my first hackathon


Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer


Shiny 1.1.0

Without this capability, when Shiny performs long-running calculations or tasks on behalf of one user, it stalls progress for all other Shiny users that are connected to the same process


Storrrify #satRdayCDF 2018

Now, let’s have a look at the day as tweeted by me… I obtained 22


Running python in the RStudio IDE

A quick look at running python inside the RStudio IDE. When version 1


Selection effects

My limited goals: Perhaps the central difference between working in the Stata environment and in R is that in R you always have to be declaring which data frame you are working with. In Stata, you just have one active data frame and then you can refer to the variables by their names alone



This post is my attempt to fill this gap. Rather than focusing on a specific task, I take a more general approach, and explain the fundamental abstractions underpinning Tensorflow. With a good grasp of these concepts, deep learning with Tensorflow becomes intuitive and straightforward


Fast Fiedler Vector Computation

While this is easy to implement, it comes with the huge drawback of computing many unnecessary eigenvectors. We just need one, but we calculate all 100 in the example. The bigger the graph, the bigger the overheat from computing all eigenvectors


Golf, Tidy Data, and Using Data Analysis to Guide Strategy

Introduction I’m going to use this post to discuss some of the aspects of data science that interest me most (tidy data as well as using data to guide strategy). I’ll be discussing these topics through the lens of a data analysis of results from a few high school golf tournaments


Plotting house price and income trends

In this post we will create some plots of house prices and incomes for the United States and individual states. We will also try out the bea.R package to get data from the U.S. Bureau of Economic Analysis. We’ll end up with something like this: Per usual we’ll do it with R and I’ll include code so you can follow along


Solution to a frustrating rJava problem

Go to the command line and run: According to the solution at the aforementioned link, this will “create a link to libjvm.dylib inside R’s lib folder”


Using IRanges for Non-Integer Overlaps

Lets actually test differences in speed by counting how many overlapping points there


Announcing new software review editors

The overall goals of rOpenSci are fully aligned with my interests and passions, both personally and also professionally as a Research Software Engineer, tasked with helping researchers make the most of their code and


Diagnosing Ruby Memory Issues in Cloud Foundry's API Server

Occasionally our end-users will use the platform in ways we might not have predicted, which results in unique and difficult-to-reproduce issues


How to upload files to GEO

I used my google account. soft link does not work for me… After your transfer is complete, you need to tell the NCBI. After file transfer is complete, please e-mail GEO with the following information: - GEO account username (tangming2005@gmail.com); - Names of the directory and files deposited; - Public release date (required - up to 3 years from now - see FAQ)


Ideally, I Hope to Simply Copy and Run Your Example

That is why you should provide a fully reproducible example whenever possible, instead of describing all the steps to create such an example. For step 1, I had to copy it to an Rmd document in my RStudio session. For step 2, I had to copy the code and put it in an R code chunk


Multiple Sclerosis and exercise

Hi athletes. I recently went to a Multiple Sclerosis (MS) conference in Paris organized by the ARSEP foundation, mostly to see what people are doing on the immunology side of the disease, and I started looking into the foundation a bit more and that’s how I started doing this article


Odds ratios and logistic regression basics

Binary outcome variables that only take on two distinct values such as alive vs. not alive are very common in medicine and elsewhere


On Cache Invalidation

First of all, the main purpose of caching is speed. The basic idea is simple: if you know you are going to compute the same thing, you may just load the result saved from the previous run, and skip the computing this time. There are two keywords here: “the same thing”, and “the saved result”


R-Ladies Sydney Launch!

I thought I would spend all my PhD reading textbooks and learning new R techiniques (see below). However research, admin, and teaching can get in the way! I learned how to develop packages to complement my research and to also improve workflow Learn from examples! Eg, this presentation was based off Alison Hill’s R Ladies talk about blogdown! Work smart, not hard


Story time

I’ve often been asked what information should be put inside a user story, or what kind of template I use for a user story on a project. In order to provide a useful answer to this question, we must look back and see where user stories came from


Implementing Time Windowing in an Evented Streaming System

Hi there! Welcome to the second and final installment of my trending twitter hashtags example series. In part 1, we covered the basic dataflow and logic of the application


Naming Things

The oldSchool namer generally mixes case in an R package, often capitalising the “R”, or going all in on ALL CAPS. Examples: Although personally I wouldn’t use this style as it can make it difficult to type, they have a certain charm, and are easy to google - provided you spell it


SciLifeLab Prize

Last year I submitted an entry to this competition and I enjoyed the experience, even if it was a bit rushed. The process of joining the competition is relatively straight forward: You don’t need to pay for competing! You already did the very hard part of completing your Ph.D


The tidyverse is for EDA, not packages

Because the tidyverse is a set of packages designed for interactive data analysis, this is, in short, a bad idea


Turn Robert's Beard Purple!

Links: I decided to participate in the Walk To End Alzheimer’s this year, coming up on August 25th here in Lexington


Coding Codes

One of my sons has signed up with the S.H.O.E


Continued Introduction

This is a continuation to my first post just to see how things look using blogdown. So… first, I’ll call one of the built in data sets: library(tidyverse) library(hexSticker) data(cars) summary(cars) ## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.:19.0 3rd Qu.: 56.00 ## Max. :25


Data Methods Discussion Site

I have learned more from Twitter than I ever thought possible, from those I follow and from my followers. Quick pointers to useful resources has been invaluable. I have also gotten involved in longer discussions


HTML Widgets for Non-HTML Output Formats

Screenshotting HTML widgets, Shiny apps, and arbitrary URLs works in any non-HTML output formats: PDF, Word, EPUB, RTF, and PowerPoint, etc


It's been a Month

Its been a month since the coding period of GSoC has started. My first evaluations are done. I have passes it successfully thanks to my mentor Colin. :) First few weeks of the coding period was spent familiarizing myself with the basic concepts of Bayesian Inference


Looking into #KeepFamiliesTogether

This week I’m at the Seattle branch of the Summer Institute on Computational Social Science


RStudio Connect v1.6.4

There are a few breaking changes and a handful of new features that are highlighted below. We encourage you to upgrade as soon as possible! Please take note of important breaking changes before upgrading. RStudio Connect includes Pandoc 1 and will now also include Pandoc 2


The visual taming of a paradox

@drob has posted code to play with on Twitter today. To illustrate what he calls a veridical paradox he’s posted the set up, the code and result of a coin flipping experiment: There are some good and exact explanations in the thread, for this at-first-glance puzzle. But I didn’t see a visualization that might give you quick intuition about what is going on


Using Slack for Academic Departmental Communication

versus This blog post was made possible thanks


GeoPAT 2

Now, try to do the same with the image below representing a land cover over a part of Eastern


One Little Thing



Principled Bayesian Workflow

He gave also succinct advise for model calibration and validation


Quick Guide to Gene Name Conversion

There are several popular naming systems for (human) genes: But, he did not add Refseq names. So if you need to get RefSeq names into one of the others, you’ll have to do another step


Really not a fish? Scraping my mathematical family tree

From the above I deduced that I was allowed to scrape mathematicians’


The Role of Resources in Data Analysis

When learning about data analysis in school, you don’t hear much about the role that resources-time, money, and technology-play in the development of analysis. This is a conversation that is often had “in the hallway” when talking to senior faculty or mentors


How I Found Myself doing Data science in marketing

Since February 26 i am working at Seedstars Ivory Coast in a venture named Bora digital who works mostly in marketing and digital marketing


If Ranked-choice voting decided the (second) BEST Black Mirror episode

The data is particularly unique because many users did not make a full ranking of all 19 episodes, with some users ranking just a single episode as their favorite


Introductory Post

After a bit of research and testing, I have decided to start using the blogdown package in R in order to continue blogging


The Copernicus toolbox and the role of software in climate services

The best thing of the C3S is that they are trying to foster the creation of a ecosystem of data services and — not surprisingly — software (design, development, architecture) plays a critical role here


Behind the Viz

Lots of packages here: The gather() function pivots the data, we name “episodes” as the key (what the column names will be called) and rankings as the values (what the data in those columns will now be called), while telling the function to not mess with the columns user, other, or


Day 51-52: Kabling

Not wanting to give away the list of things that might appear on bingo cards (or alternatively, not actually having written all the items yet!) I’ll need to find some content to use for this post… Oh yes little network, you had me at “hello”. Here’s what we get: It’s a good start, but it’s all structure and no style! No, wait, that doesn’t work. Why? There we


Federalist Papers

Every couple of weeks I like to explore data that’s brand new to me. I anticipate a one-hour, one-off project. Usually this turns out to be a beautiful lie, and the projects chew up much more time. Still, this enticing time-line is pulling me into new projects from time to time


Multivariate Adaptive Regression Splines in a Nutshell

Like standard linear regression, MARS uses the ordinary least squares (OLS) method to estimate the coefficient of each term. However, instead of an original predictor, each term in a MARS model is a basis function derived from original predictors. A basis function takes one of the following forms: MARS does not treat categorical predictors differently from standard linear


One Little Thing

One of the most frequent topics on which I blog in recent years is Chinese literature (of course, only in my Chinese blog). In particular, I often quote poems in my posts. To quote a poem in Markdown, you have to add two trailing spaces after every line of the poem



Say we have a fruit bowl–mmm!–consisting of strawberries, raspberries, blueberries, and blackberries. There are 50 strawberry slices, 20 blueberries, 15 raspberries, and 15 blackberries


Re-introduction to gghighlight

But, please forget about that gghighlight; gghighlight has become far more powerful and simple! So, let me re-introduce about gghighlight. What do you do when you explore a data that is too large to print? OK, good


How to debug AWS Application Load Balancers with minimal colourful vituperations

Introduction From the Temples of Testers, a browser bestowed a 504 gateway timeout in your newly deployed internal facing Application Load Balancer (ALB). There was a gnashing of molars and gurning of visages. Your ALB isn’t responding. Don’t panic and be “oh wow! heavy heavy heavy” like Neil from the Young Ones


People vs. Institutions in Data Analysis

Invest in businesses any idiot could run because someday one will. A perhaps more detailed version of this sentiment comes from fellow legendary investor Warren Buffett, in his testimony before the U.S


What I wish I'd known

Something I never tire of hearing is the story of how someone arrived at where they are


Build Binary R Packages for the Homebrew Version of R?

Personally I don’t really care about if a project is “sticky”, and I believe being magnetic is a lot more difficult and valuable than being sticky for an open source project. Okay, that is a little bit digression for this blog post. If you use the Homebrew version of R, it will be super easy to upgrade or remove R in the future


Covariance -- A Visual Walk Through

In a previous post, I’ve looked at walking through the calculation of variance and standard deviation, visualizing each step. This post is dedicated to the visualization of another statistic: covariance


Day 47-50: Paletter

Okay! It’s a Thursday evening. Solo parenting is over. My partner is back in town. The kids are in bed. Tina Turner is playing over the wireless. I’m the last one awake. Time for an R post, because that’s just the kind of girl I


Exploring European attitudes and behaviours using the European Social Survey

The 4th of March of 2018 I submitted the package to rOpensci, intimidated but very excited about the peer review process. To my surprise, the process was enriching, respectful and transparent, unlike my previous experience in academic research


Quick Tip

This post originally appeared on the Mango Solutions blog


Real-time Streaming Pattern

Introduction I am starting a series of posts looking at a variety of data processing patterns used to build real-time stream processing applications, the use cases that the patterns relate to, and how you would go about implementing within


Uncertainty and Sample Size

As a hungry botanist I couldn’t think of a more a-peel-ing metaphor than fruit. Let’s say we have a fruit bowl consisting of strawberries, raspberries, blueberries, and blackberries


World Cup Club Representation

It’s common knowledge amongst association football fans that the World Cup, while being the highest profile event in the world, isn’t necessarily a display of the best football play in the


Late anniversary edition redux

This afternoon, I was looking over some simulations I plan to use in an upcoming lecture on multilevel models. I created these examples a while ago, before I started this blog


Principal Component Analysis

PCA Number of factor retained by psycho::n_factors() Extraction of the variables With article about correlations, we saw data from airquality were correlated. Sometimes it is need to use Principal Component Analysis (PCA) to determine non correlated variables in order to analyze data


Tidy matched pairs t-test

Introduction The matched pairs (t)-test is for comparing two measurements obtained on the same individual, such as a before and an after measurement. This is different from the two-sample (t)-test, which has two independent sets of measurements, one for each experimental condition, with each set collected on different


Eat near the Big Ben? That will cost you...

#MakeoverMonday is a fun data visualization initiative; most participants use Tableau as their preferred visualization tool. But I’ve used R and ggplot() and the organizers and participants have been very welcoming


My current research, for laypeople

I have struggled a lot to explain my current line of research to people in my life who aren’t scientists. But if I can’t explain my research to everyone, then I can’t claim to really know what I’m doing


The Best Experience in Remote Talks that I Have Given

For someone who usually does not prefer traveling (and cannot travel too far in a couple of years), I have to say this system is just perfect


The ssh Package

Because the ssh package is based on libssh it does not need to shell out. Therefore it works natively on all platforms without any runtime dependencies. Even on Windows


naniar 0.3.1

There were a few things that changed in this release, some of them big, some small, and some technical, let’s break them down


(R) Fetching JSON

1 The Setting 2 Inspect a Website’s DOM & HTTP Requests 3 Extract JSON 3.1 Quick: JSON from File (copy & paste) 3.1.1 It’s all about [[1]] 3.1.2 Two Lines, Pt. 1

3.2 Robust: Fetch JSON from API with GET 3.3 Parse JSON

4 From JSON to Tidy 4.1 Tidy Date #1 with lubridate 4.2 Tidy Date #2: type_convert() to the Rescue / Two Lines, Pt



All these tweets are threads, so you’ll have to open them to see them: click on the blue bird on the right side of each tweet. Lately I’ve been processing some strong feelings related to feeling unwelcome, homesickness and loneliness


tiler 0.2.0 CRAN release

Lastly, consider the power of your system before attempting to make a ton of tiles for large images at very high resolutions


Could an Independent Yorkshire Win the World Cup - Rest of the World/UK

To save time, I’m gonig to used saved versions of the datasets I built up over the 5 blog posts. I won’t include the functions in this blog post either, but the article uses (at most very slight modified) functions from the previous 5 posts. We first need to sort the players into either the UK vs


Day 39-46

This series of posts has been on hold for the last few days because I’ve been solo parenting and had a few deadlines at work. I have no idea how single parents


June 11th, 2018 R 3.5.0 on Debian and Ubuntu

Dirk, Michael and Johannes, June


Kalman Filter for a dynamic linear model in R

As an economist with a background in econometrics and forecasting I recognize that predictions are often (usually?) an exercise in futility. Forecasting, after all, is hard. While non-economists have great fun pointing this futility out, many critics miss out on why it’s so hard. There are at least two reasons why forecasting is hard


When 'Docker' meets 'Make'

Being a DevOps engineer, it’s very common that we use tools like AWS CLI, Docker/ECS, and Ansible to build continuous deployment solutions. It is also common to use tools like JenkinsCI to fully automate the deployment of your applications. Recently I have experienced that, due to some bizarre and varied reasons, you cannot always use CI


Could an Independent Yorkshire Win the World Cup - Simulate World Cups

Now that we have the teams for each county, we want to work out how well they would do at a world cup. For this, we need to know roughly what their ranking would be compared to actual nations


Plot geom_sf() On OpenStreetMap Tiles

Clearly display license attribution. For “Technical Usage Requirements” section, I have to read this more carefully. Let’s look at the requirements one by one. Valid HTTP User-Agent identifying application. Faking another app’s User-Agent WILL get you blocked. If known, a valid HTTP Referer


.rprofile: Julia Silge

KO: What is your name, job title, and how long have you been using R? KO: Wow! What were you all about before that? I have a bachelor’s degree in physics, a PhD in astrophysics, and I did a postdoc and research for a while


Could an Independent Yorkshire Win the World Cup - Picking Teams

First, we need a list of plausible formations, and the positions they contain. There’s a handy list of the default FIFA18 formations online which we’ll scrape


Peeking behind the curtain with {slidex}

I gave a lightning talk (slides here) this past weekend at the second annual Cascadia R Conference that was focused on creating and contributing new themes to the {xaringan} package, which is essentially a really well thought out and well-organized R Markdown wrapper around the remark


Unconf18 projects 4

In the spirit of exploration and experimentation at rOpenSci unconferences, these projects are not necessarily finished products or in scope for rOpenSci packages


What’s the IGO dataset?

This webpage is meant provide students and the curious with an visual, explorable introduction to the dataset. The number of IGOs observed over the time period of 1815 to 2005 has dramatically increased. At the beginning of this period there were just a handful, but now they number more than 300


Article Round Up June 2018

The first article is quite long, but easily skim-able


Could an Independent Yorkshire Win the World Cup - Finding British Player's Birthplaces

To select our county teams, we need to know where each British player was born (and thus their ‘county’ nationality)


Fair is foul, and foul is fair

We see that the text variable contains one line of text for each row. Given this format, we can create a new data frame with a row for each word token found in the Bing lexicon of sentiment words. By using this lexicon, sentiment words are simpky assigned a value of positive or negative. Have a look at the other options with ?get_sentiments


Power Analyses for an Unconditional Growth Model using {lmer}

We collected measures of these variables at three time points, approximately evenly spaced apart, and, for the purposes of these analyses, I decided to treat the data as if they were collected at precisely the same equally spaced interval for all participants