matrixStats: Consistent Support for Name Attributes via GSoC Project
Author: Angelina Panagopoulou, GSoC student developer, undergraduate in the Department of Informatics & Telecommunications (DIT), University of Athens, Greece
We are glad to announce recent CRAN releases of matrixStats with support for handling and returning name attributes. This feature is added to make matrixStats functions handle names in the same manner as the corresponding base R functions. In particular, the behavior of matrixStats functions is now the same as the apply()
function in R, resolving previous lack of, or inconsistent, handling of row and column names. The added support for names
and dimnames
attributes has already reached a wide, active user base, while at the same time we expect to attract users and developers who lack this feature and therefore could not use matrixStats package for their needs.
The matrixStats package provides high-performing functions operating on rows and columns of matrices. These functions are optimized such that both memory use and processing time are minimized. In order to minimize the overhead of handling name attributes, the naming support is implemented in native (C) code, where possible. In matrixStats (>= 0.60.0), handling of row and column names is optional. This is done to allow for maximum performance where needed. In addition, in order to avoid breaking some scripts and packages that rely on the previous semi-inconsistent behavior of functions, special care has been taken to ensure backward compatibility by default for the time being. We have validated the correctness of these newly implemented features by extending existing package tests to check name attributes, measuring the code coverage with the covr package, and checking all 358 reverse-dependency packages using the revdepcheck package.
Example
useNames
is an argument added to each of the matrixStats functions that gained support naming. It takes values TRUE
, FALSE
, or NA
. For backward compatible reasons, the default value of useNames
is NA
, meaning the default behavior from earlier versions of matrixStats is preserved. If TRUE
, names
or dimnames
attribute of result is set, otherwise, if FALSE
, the results do not have name attributes set. For example, consider the following 5-by-3 matrix with row and column names:
> x <- matrix(rnorm(5 * 3), nrow = 5, ncol = 3, dimnames = list(letters[1:5], LETTERS[1:3]))
> x
A B C
a 0.30292612 1.3825644 -0.2125219
b 0.15812229 2.7719647 1.6237263
c -0.09881700 -0.6468119 -0.6481911
d 0.38520941 -0.8466505 -0.4779964
e -0.01599926 -0.8907434 0.6334347
If we use the base R method to calculate row medians, we see that the names attribute of the results reflects the row names of the input matrix:
> library(stats)
> apply(x, MARGIN = 1, FUN = median)
a b c d e
0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926
If we use matrixStats function rowMedians()
with argument useNames = TRUE
set, we get the same result as above:
> library(matrixStats)
> rowMedians(x, useNames = TRUE)
a b c d e
0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926
If the name attributes are not of interest, we can use useNames = FALSE
as in:
> rowMedians(x, useNames = FALSE)
[1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926
Doing so will also avoid the overhead, time and memory, that otherwise comes from processing name attributes.
If we don’t specify useNames
explicitly, the default is currently useNames = NA
, which corresponds to the non-documented behavior that existed in matrixStats (< 0.60.0). For several functions, that corresponded to setting useNames = FALSE
, however for other functions it corresponds to setting useNames = TRUE
, and for others it might have set, say, row names but not column names. In our example, the default happens to be the same as useNames = FALSE
:
> rowMedians(x) # default as in matrixStats (< 0.60.0)
[1] 0.30292612 1.62372626 -0.64681187 -0.47799635 -0.01599926
Future Plan
The future plan is to change the default value of useNames
to TRUE
or FALSE
and eventually deprecate the backward-compatible behavior of useNames = NA
. The default value of useNames
is a design choice that requires further investigation. On the one hand, useNames = TRUE
as the default is more convenient, but creates an additional performance and memory overhead when name attributes are not needed. On the other hand, make FALSE
the default is appropriate for users and packages that rely on the maximum performance. Whatever the new default will become, we will make sure to work with package maintainers to minimize the risk for breaking existing code.
Google Summer of Code 2021
The project that introduces the consistent support for name attributes on the matrixStats package is a part of the R Project’s participation in the Google Summer of Code 2021.
Links
- The matrixStats GSoC 2021 project
- matrixStats CRAN page
- matrixStats GitHub page
- All commits during GSoC 2021 - author Angelina Panagopoulou
Authors
- Angelina Panagopoulou - Student Developer: I am an undergraduate in the Department of Informatics & Telecommunications (DIT) in University of Athens.
- Jakob Peder Pettersen - Mentor: PhD Student, Department of Biotechnology and Food Science, Norwegian University of Science and Technology (NTNU). Jakob is a part of the Almaas Lab and does research on genome-scale metabolic modeling and behavior of microbial communities.
- Henrik Bengtsson - Co-Mentor: Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF). He is the author and maintainer of a large number of CRAN and Bioconductor packages including matrixStats.
Contributions
Phase I
- All functions implements
useNames = NA/FALSE/TRUE
using R code and tests are written. - Identify reverse dependency packages that rely on
useNames = NA/FALSE/TRUE
. - New release on CRAN with
useNames = NA
. This allow useRs and package maintainers to complain if anything breaks.
Phase II
- Changed C code structure such that
validateIndices()
always returnR_xlen_t*
. Clean up unnecessary macros.- Outcome: shorter compile times, smaller compiled package/library, fewer exported symbols.
- Simplify C API for
setNames()/setDimnames()
. - Implemented
useNames = NA/FALSE/TRUE
in C code where possible and cleanup work too.
Summary
We have completed all goals that we had initially planned. The release 0.60.0 of matrixStats on CRAN included the contributions of GSoC Phase I (“implementation in R”) and a new release of version 0.60.1 includes the contributions of Phase II (“implementation in C”).
Experience
When I first heard about the Google Summer of Code, I really wanted to participate in it, but I thought that maybe I do not have the prerequisite knowledge yet. And it was true. It was difficult for me to find a project that I had at least half of the mentioned prerequisites. So, I started looking for a project based on what I would be interested in doing during the summer. This project was an opportunity for me to learn a new programming language, the R language, and also to get in touch with advanced R. I am grateful for all the learning opportunities: programming in R, developing an R package, using a variety of tools that make developing R packages easier and more productive, working with GitHub tools, interacting with the open source community. My mentors had an understanding of the lack of experience and really helped me achieve this. Participating in Google Summer of Code 2021 as student developer is definitely worth it and I recommend every student who wants to open source contribute to give it a try.
Acknowledgements
- The Google Summer of Code program for bringing more student developers into open source software development.
- Jacob Pettersen for being a great project leader and for providing guidance and willingness to impart his knowledge. Henrik Bengtsson whose insight and knowledge into the subject matter steered me through R package development. I am very grateful for the immense amount of useful discussions and valuable feedback.
- The members of the R community for building this warming community.