Does subsetting (matrices or arrays) always perform a partial copy?

Some large datasets are pushing memory and some functions I’m writing to the limit. I wanted to ask some questions about subsetting, of matrices and arrays in particular:

  1. Does defining a variable as a subset of another lead to copy? For instance
<pre style="background-color:#ffffff;">
<span style="color:#323232;">x <- matrix(rnorm(20*30), nrow=20, ncol=30)
</span><span style="color:#323232;">y <- x[, 1:10]
</span>

Some exploration with https://rdrr.io/cran/pryr/man/object_size.html from pryr seems to indicate that a copy is made when y is created, but I’d like to be sure.

  1. If I enter a subset of a matrix/array as argument to a function, does it get copied before the function is started? For instance in
<pre style="background-color:#ffffff;">
<span style="color:#323232;">x <- matrix(rnorm(20*30), nrow=20, ncol=30)
</span><span style="color:#323232;">y <- dnorm(0, mean=x[,1:10], sd=1)
</span>

I wonder if the data in x[,1:10] are copied and then given as input to dnorm.

I’ve heard that data.table allows one to work with subsets without copies being made (unless necessary), but it seems that one is constrained to two dimensions only – no arrays – that way.

Cheers!

MalditoBarbudo,

As far as I know, yeah, R always work with copy on modification. Some libraries as you mention (data.table) can have object/classes to avoid this, but I’m not aware of any of them working with arrays (more than 2D). Maybe parquet or arrow have something like this??

pglpm,
@pglpm@lemmy.ca avatar

Thank you for the suggestion! Worth looking at parquet and arrow indeed.

morcution,

+1 for parquet and arrow. If you’re pushing memory better to just treat it as a completely out of memory problem. If you can split the data into multiple parquet files with hive style or directory partitioning it will be more efficient. You don’t want parquet files too small though (I’ve heard people saying 1 GB each file is ideal, colleagues at work like 512 MB per file - but that’s on an AWS setup).

Bonus is once you’ve learned the packages it’ll be the same for all out of memory big datasets.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • r_programming@programming.dev
  • ethstaker
  • DreamBathrooms
  • mdbf
  • InstantRegret
  • ngwrru68w68
  • magazineikmin
  • everett
  • thenastyranch
  • Youngstown
  • slotface
  • cisconetworking
  • kavyap
  • osvaldo12
  • modclub
  • megavids
  • GTA5RPClips
  • khanakhh
  • tacticalgear
  • Durango
  • rosin
  • normalnudes
  • Leos
  • provamag3
  • tester
  • cubers
  • anitta
  • JUstTest
  • lostlight
  • All magazines