Parses .Nesstar binary containers from India’s MoSPI (National Sample Survey rounds 64, 66, 68) into R data frames and exports them to CSV. Also works with NESSTAR files from other statistical agencies.
Documentation: https://nesstarr.saketlab.org
Installation
remotes::install_github("saketlab/nesstarR")Quick start
library(nesstarR)
x <- nesstar_parse("path/to/file.Nesstar")
x
#> <nesstar_binary>
#> File : file.Nesstar
#> Datasets : 3
nesstar_datasets(x)
#> dataset_number row_count variable_count
#> 1 1 500000 42
#> 2 2 120000 18
#> 3 3 80000 31
nesstar_variables(x, dataset_number = 1)
df <- nesstar_read_dataset(x, dataset_number = 1)
df_subset <- nesstar_read_dataset(x, dataset_number = 1,
columns = c("AGE", "DISTRICT", "INCOME"))Metadata
NESSTAR files embed Huffman-compressed XML with variable labels and category codes:
meta <- nesstar_metadata(x)
ds <- meta$datasets[[1]]
ds$file_name # original survey file name
ds$variables[[1]] # list(name, label, categories)Older NSS-format files without XML blocks return NULL with a warning.
Export to CSV
nesstar_export() writes each dataset to its own CSV file, reading 50,000 rows at a time to keep memory use low:
nesstar_export(x, output_dir = "./data")
# Writes: data/file_ds1.csv.gz, data/file_ds2.csv.gz, ...
nesstar_export(x, output_dir = "./data",
datasets = c(1, 2), compress = FALSE, chunk_size = 10000L)Functions
| Function | Description |
|---|---|
nesstar_parse() |
Parse a .Nesstar file into a nesstar_binary object |
nesstar_datasets() |
List datasets with row/variable counts |
nesstar_variables() |
Variable names, types, and binary layout for a dataset |
nesstar_read_dataset() |
Read one dataset into a data frame (optional column subset) |
nesstar_metadata() |
Decode Huffman-compressed XML labels and category codes |
nesstar_export() |
Export datasets to CSV or CSV.gz (chunked) |
Binary format
Two column types: Mode 1 (fixed-width UTF-16LE strings, returned as character) and Mode 5 (packed numerics from 4-bit nibbles to IEEE 754 float64, returned as numeric). The parser auto-detects byte order and converts all strings to UTF-8. Files up to 5 GB are supported via 40-bit offset handling.