Using R, and Rcpp, efficiently calculate the upper-triangle of the matrix created by taking the dot-product of a single vector by its transposed self.

Given the following vector and its transposition
$$A = \begin{bmatrix}1 & 2 & 3 & 4\end{bmatrix}$$
$$B = A^T = \begin{bmatrix}1\\ 2\\ 3\\ 4\end{bmatrix}$$

The result of the dot-product:

$$A \bullet B = \begin{bmatrix}
1 & \pmb{2} & \pmb{3} & \pmb{4}\\
2 & 4 & \pmb{6} & \pmb{8}\\
3 & 6 & 9 & \pmb{12}\\
4 & 8 & 12 & 16\end{bmatrix}$$

The upper-triangle of the resulting matrix is indicating by those numbers in bold.

$$U = \begin{bmatrix}2 & 3 & 6 & 4 & 8 & 12\end{bmatrix}$$

Now, there are two common, base functions in R to compute this matrix and return its upper-triangle: %*% and tcrossprod(x)

Here are a couple wrapper functions, that return the resulting upper-triangle (using upper.tri()):

R.upper.tri <- function(v) {
  matrix = v %*% t(v)
  matrix[upper.tri(matrix)]
}

R.upper.tri.2 <- function(v) {
  matrix = tcrossprod(v)
  matrix[upper.tri(matrix)]
}

And the results:

> R.upper.tri(c(1,2,3,4))
[1]  2  3  6  4  8 12
> R.upper.tri.2(c(1,2,3,4))
[1]  2  3  6  4  8 12

You see we get exactly what we were looking for as U.

Now, both of these are very straight-forward and perfect for standard use cases. However, when it comes to performing this operation frequently, over larger vectors, the potential for performance improvements becomes increasingly noticeable.

# Using random numbers in the range 1:10, of increasing lengths, tested 100 times each:
> microbenchmark::microbenchmark(
+   R.upper.tri(runif(3, 1, 10)), 
+   R.upper.tri(runif(10, 1, 10)),
+   R.upper.tri(runif(100, 1, 10)),
+   R.upper.tri(runif(300, 1, 10))
+   ,times = 100
+ )
Unit: microseconds
                           expr     min        lq       mean   median        uq      max neval cld
   R.upper.tri(runif(3, 1, 10))  15.261   17.4010   32.96587   20.572   37.4495  146.260   100  a 
  R.upper.tri(runif(10, 1, 10))  17.245   19.3895   37.44702   23.358   46.8425  140.080   100  a 
 R.upper.tri(runif(100, 1, 10))  89.000   95.2420  223.28620  114.420  170.8850 6603.911   100  a 
 R.upper.tri(runif(300, 1, 10)) 785.662 1913.4350 2201.49373 2028.131 2123.3885 8881.646   100   b

The first place I tried was to implement a simple Rcpp function using Armadillo’s matrix functions:

// [[Rcpp::depends(RcppArmadillo)]]

#include <RcppArmadillo.h>
#include <Rcpp.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arma::rowvec vector_to_ut_mul(arma::vec v) {
  arma::mat mat = v * trans(v);
  mat.diag().zeros();
  return(trans(mat.elem(find(trans(trimatl(mat))))));
}

Test our vector A:

> vector_to_ut_mul(c(1,2,3,4))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    3    6    4    8   12

Now the same performance tests:

> microbenchmark::microbenchmark(
+   vector_to_ut_mul(runif(3, 1, 10)),
+   vector_to_ut_mul(runif(10, 1, 10)),
+   vector_to_ut_mul(runif(100, 1, 10)),
+   vector_to_ut_mul(runif(300, 1, 10)),
+   times = 100
+ )
Unit: microseconds
                                expr     min       lq      mean   median       uq      max neval cld
   vector_to_ut_mul(runif(3, 1, 10))   3.449   4.0235   5.36643   4.7860   5.6395   34.781   100 a  
  vector_to_ut_mul(runif(10, 1, 10))   4.688   5.7665   7.47304   6.6505   7.7330   28.154   100 a  
 vector_to_ut_mul(runif(100, 1, 10))  41.205  44.6345  49.28298  47.0500  50.9950  143.942   100  b 
 vector_to_ut_mul(runif(300, 1, 10)) 359.396 535.3850 580.06845 567.2695 602.2450 1000.952   100   c

So there is some noticeable improvements, but in the larger vectors we see the performance, expectedly, start to degrade.

So to make it faster, we can actually skip performing the full multiplication, rather iterate the vector itself only multiplying the necessary indices that make up the upper-triangle of the would-be matrix.

#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

//' Calculates the upper triangle of a vector of integers  if it
//' were converted to a matrix. This actually skips creating the
//' matrix, by only multiplying the necesseary indices of the
//' vector.
//'
//' @param v - A vector of integers
//' @export
// [[Rcpp::export]]
std::vector<int> vector_to_ut(std::vector<int> v) {
  int vL = v.size();
  int vS = ( (vL * (vL + 1)) / 2) - vL ;
  int s = 0;
  std::vector<int> vR( vS );
  for( int i = 2; i <= vL; i++ ) {
    for (int j = 0; j < i-1; j++ ) {
      vR[s] = v[j] * v[i-1];
      s++;
    }
  }
  return vR;
}

Again, test against our vector A:

> vector_to_ut(c(1,2,3,4))
[1]  2  3  6  4  8 12

Again with the same performance tests:

Unit: microseconds
                            expr     min       lq      mean   median       uq     max neval cld
   vector_to_ut(runif(3, 1, 10))   3.369   4.0515  10.44103   5.4200   7.6670 199.673   100 ab 
  vector_to_ut(runif(10, 1, 10))   3.738   4.4430   8.00061   6.0465   9.9895  40.958   100 a  
 vector_to_ut(runif(100, 1, 10))  10.069  12.4600  22.76412  17.4190  24.0350 112.748   100  b 
 vector_to_ut(runif(300, 1, 10)) 124.615 139.1790 175.37466 153.3485 195.0290 689.874   100   c

And the results of the larger operations, all together for comparison:

benchmark_ut(1:300,1000)

Unit: microseconds
                expr     min        lq      mean    median       uq       max neval
      R.upper.tri(v) 668.556 1291.3745 2483.8589 1616.2400 2126.235 112633.59  1000
    R.upper.tri.2(v) 674.230 1306.0820 2428.9174 1645.2025 2224.076  69641.66  1000
 vector_to_ut_mul(v) 342.695  628.1725 1375.3058 1101.8775 1358.649  16224.55  1000
     vector_to_ut(v)  42.420  118.7545  237.3842  190.1145  224.825  17368.39  1000