Skip to contents

Parses .Nesstar binary containers from India’s MoSPI (National Sample Survey rounds 64, 66, 68) into R data frames and exports them to CSV. Also works with NESSTAR files from other statistical agencies.

Documentation: https://nesstarr.saketlab.org

Installation

remotes::install_github("saketlab/nesstarR")

Quick start

library(nesstarR)

x <- nesstar_parse("path/to/file.Nesstar")
x
#> <nesstar_binary>
#>  File      : file.Nesstar
#>  Datasets  : 3

nesstar_datasets(x)
#>   dataset_number row_count variable_count
#> 1              1    500000             42
#> 2              2    120000             18
#> 3              3     80000             31

nesstar_variables(x, dataset_number = 1)

df <- nesstar_read_dataset(x, dataset_number = 1)
df_subset <- nesstar_read_dataset(x, dataset_number = 1,
                                   columns = c("AGE", "DISTRICT", "INCOME"))

Metadata

NESSTAR files embed Huffman-compressed XML with variable labels and category codes:

meta <- nesstar_metadata(x)

ds <- meta$datasets[[1]]
ds$file_name       # original survey file name
ds$variables[[1]]  # list(name, label, categories)

Older NSS-format files without XML blocks return NULL with a warning.

Export to CSV

nesstar_export() writes each dataset to its own CSV file, reading 50,000 rows at a time to keep memory use low:

nesstar_export(x, output_dir = "./data")
# Writes: data/file_ds1.csv.gz, data/file_ds2.csv.gz, ...

nesstar_export(x, output_dir = "./data",
               datasets = c(1, 2), compress = FALSE, chunk_size = 10000L)

Functions

Function Description
nesstar_parse() Parse a .Nesstar file into a nesstar_binary object
nesstar_datasets() List datasets with row/variable counts
nesstar_variables() Variable names, types, and binary layout for a dataset
nesstar_read_dataset() Read one dataset into a data frame (optional column subset)
nesstar_metadata() Decode Huffman-compressed XML labels and category codes
nesstar_export() Export datasets to CSV or CSV.gz (chunked)

Binary format

Two column types: Mode 1 (fixed-width UTF-16LE strings, returned as character) and Mode 5 (packed numerics from 4-bit nibbles to IEEE 754 float64, returned as numeric). The parser auto-detects byte order and converts all strings to UTF-8. Files up to 5 GB are supported via 40-bit offset handling.

License

MIT