Native formats in SAS and R for data are different for important reasons, the most of which is that these applications were built in different eras, and therefore run differently. Native formats in SAS and R for data vary largely because of how the program processes data. This blog post will showcase the differences between data in their native formats in SAS and R.
Native Formats in SAS and R of Different Objects are Different in Each Program
There can be native formats in SAS and R of objects that are not datasets. For example, one object you can create in SAS is a macro, and in R, you could create a matrix. But if you are doing data science and data analytics, in SAS, you will use a “dataset”, and in R, you will use a “dataframe”. Those are the official names of the data objects you would use as the object of a regression, for example.
Native Format of SAS Dataset: *.sas7bdat
SAS native datasets are stored in a file type called *.sas7bdat. If you take my free online course in getting started with SAS OnDemand for Academics (ODA), I go into detail about this file format. I talk about the different attributes of these file types, the advantages and disadvantages of them in terms of usages, migration and storage, and opportunities for their interoperability.
Briefly, the *.sas7bdat format of a file has these attributes:
- Bloated. It tends to be larger than the same data in *.txt or *.csv format. That is because it includes a lot of SAS metadata that SAS builds into the file before exporting it.
- Hard to read. It really can only be read by SAS – but the foreign package in R can convert it to a format usable by R.
- Easy for SAS to read. That’s actually the main selling point of *.sas7bdat – which is that SAS has no trouble importing it. That is the advantage of storing your datasets in this format if you run a SAS shop – but it’s a disadvantage if you are trying to do SAS/R integration.
Native Format of R Dataset: *.rds
As I just highlighted above, if you use SAS’s native data format when you are using SAS, SAS is very happy. It likes having all that metadata packed into that *.sas7bdat file. By contrast, R doesn’t really care that much if you use its native format – meaning *.rds – or use more typical formats such as *.txt or *.csv. What’s nice about *.rds, however, is that there are “no suRpRises”. That’s a bad pun, not a typo – I mean there are no unpleasant surprises when reading an *.rds file into R that you might get with a *.csv file.
Typical unpleasant surprises I have gotten reading non-rds datasets into R are:
- Weird variables names. Either R doesn’t see the name of the variable so it makes up something (like Var1), or it changes it in a weird way – by adding dashes or dots.
- Problems splitting variables in the right places. When you read in a dataset that might have some characters in a column that throw off R’s automatic reading process – such as a comma in the value of a string in a variable – you might get problems with column splits in weird places. This often happens if there is a column that has some text in it, and someone put a comma in there, and you are reading in a *.csv. If you convert it first to an *.rds and export it and read it back in, R won’t have that problem when you import it.
- Problems seeing goofy characters in variable values. For example, you might read in a *.csv in R and see some Chinese or Arabic characters in some values in some fields, but they aren’t supposed to be there. Something is interpreting something wrong. If you clean that up and export as an *.rds, you won’t have that problem when you read in the dataset using R the next time.
The solution to all these problems is to first read the dataset into R – for example, from *.csv format – and then edit it in R, and export it as an *.rds file. You can watch those videos, and also, access my example code on Github. Then, when you read it in next time, it will be in the format you want. So in the end, the purpose of using native formats in SAS and R is to make the data objects you are using more compatible with the programs you are using, so it improves your overall programming experience.
Updated January 4, 2022. Added Github link October 12, 2022.
Native formats in SAS and R of data objects have different qualities – and there are reasons behind these differences. Learn about them in this blog post!