July 2021 – Working With Data

Using R — Packaging a C library in 15 minutes

Yes, this post condenses 50+ hours of learning into a 15 minute tutorial. Read ’em and weep. (That is, you read while I weep.)

Using R — Calling C code with Rcpp

In two previous posts we described how R can call C code with .C() and the more complex yet more robust option of calling C code with .Call(). Here we will describe how the Rcpp package can be used to greatly simplify your C code without forcing you to become expert in C++.

Using R – .Call(“hello”)

In an introductory post on R APIs to C code, Calling C Code ‘Hello World!’, we explored the .C() function with some ‘Hello World!’ baby steps. In this post we will make a leap forward by implementing the same functionality using the .Call() function.

Using R – Calling C code ‘Hello World!’

One of the reasons that R has so much functionality is that people have incorporated a lot of academic code written in C, C++, Fortran and Java into various packages. Libraries written in these languages are often both robust and fast. If you are using R to support people in a particular field, you may be called upon to incorporate some outside code into your R environment. Unfortunately, much of the documentation on how to do this is written at a very high level. In this post we will distil some of the available information on calling C code from R into three “Hello World” examples.

Ten UNIX commands every data manager should know

Working with data from varied sources can be frustrating — some data will be in CSV format; some in XML; some available as HTML pages; other data as relational databases or MS Excel spreadsheets.

This post will cover the UNIX tools that every data manager needs to be familiar with in order to work with varied data sources.

Data Structures – Tabular vs. Relational

With enough effort it is possible to fit a square peg into a round hole. But we have all learned — sometimes more than once — that it is much easier if peg and hole have the same shape.

Logging and error handling in operational systems

Operational systems, by definition, need to work without human input. Systems are considered “operational” after they have ben thoroughly tested and shown to work properly with a variety of input.

However, no software is perfect and no real-world system operates with 100% availability or 100% consistent input. Things occasionally go wrong – perhaps intermittently. In a situation with occasional failures it is vitally important to have good logging and error handling. The MazamaCoreUtils R package helps with these tasks.

Data volumes

Despite what they say, size does matter.

Successful data management is all about finding the proper tools and formats for dealing with your data. There is no one-size-fits-all solution. The very first question you should be asking yourself is: “How much data are we talking about?”

Best Best Practices Ever!

Every once in a while I read something that is so insightful, so clearly written and so well documented that it enters my own personal pantheon of “Best Ever” documents. I recently added a new, simply divine article titled Best Practices for Scientific Computing and hope that everyone reading this post also takes the time to read that article. I’m including the outline here only to encourage you to read the article in it’s entirety. It is extremely well written.

When k-means clustering fails

Letting the computer automatically find groupings in data is incredibly powerful and is at the heart of “data mining” and “machine learning”. One of the most widely used methods for clustering data is k-means clustering. Unfortunately, k-means clustering can fail spectacularly as in the example below.