Working with Data?

Data Re-use

Locating Data

Columbia University Libraries Information System (CULIS) Resources:
Other Resources:
  • re3data & Databib  Registry of Research Data Repositories: a global registry of research data repositories merged with DataBib  A tool for helping people identify and locate online repositories of research data
  • ODiSEA  International Registry on Research Data: highlighting the combined efforts of six Spanish universities to make open access data available
  • BioSharing Registers “well-constituted efforts developing standards for describing and sharing biosciences experiments, ensuring these resources are informative and discoverable”
  • Nature  A list of recommended repositories for items published in Nature
  • WebProtege A free and open source ontology-development tool that is capable of “interacting, constructing and consuming knowledge” with approaching technologies.

Data Licenses

Data often have associated licenses that detail reuse permissions and restrictions. It is important to check these to make sure that you can use the data in the manner you wish. (Interested in licensing your own data? See here)


Data Storage Options Provided by Columbia

Research Data Storage, Sharing, and Transfer Options

Active storage @ Columbia

  • Columbia researchers in all areas generally use storage infrastructure that is provided at the school, department or division level by one of several local IT groups. If these resources are not available or adequate, researchers frequently use their own research money to build storage solutions, which could include staff time in addition to hardware, software, and a backup regimen.
  • CUIT provides all active Columbia UNI holders with a small default amount (20-80 MB) of home directory file storage. You can use this to create a basic personal homepage on Columbia’s central Unix system (called CUNIX). This space is backed up with a regimen that includes offsite copies. You can send a request an increase over the default amount to These requests are considered on a case-by-case basis.
  • CUIT provides a Central LAN service for a fee that is used primarily by some departments on campus so that their administrative staff have access to secure storage space managed by CUIT. This space is mounted on a user’s work PC and its contents are backed up with a regimen that includes offsite copies. This service uses Microsoft Exchange server technology which provides each user with a Microsoft Exchange or “Alpha” email account, calendar, contacts, etc., and each department with optional shared storage space.
  • Backup services: Although there is not a centrally supported backup service, SpiderOak, a secure cloud based computer backup solution, has signed a BAA with Columbia allowing for the storage of sensitive data. CUMC Security has certified them for the backup of PII and PHI data, originally for use by a single research project, but MSPH IT will maintain the certification for individual users. SpiderOak supports a variety of plan types, which may be reviewed here:

Archival storage @ Columbia

  • Looking to make your research outputs findable and accessible by others? Try Academic Commons, Columbia’s online research repository. You may deposit individual files of up to 10 GB in size at no charge. When depositing individual files larger than 100MB please contact CDRS for upload arrangements.
  • In support of the NSF requirements for data management and sharing plans, Academic Commons will also accept individual files of more than 10 GB, and up to 100 GB, for a one-time charge of $5 per GB, payable at the time of deposit.
  • Researchers expecting to preserve files larger than 100 GB should discuss their special needs with CDRS.
  • Contact the appropriate Columbia Libraries archive to discuss the options for archiving physical data.
  • For other storage options see here.

Storage media for long-term data archiving

There are a number of options that have advantages and disadvantages. They include:

  • Hard drives
  • Tape back ups
  • Servers
  • Cloud storage services

CDs and DVDs are not good options for long-term storage. They have a life span of five to 10 years maximum.

Columbia’s Evolving Research Data Storage Strategy

Research Data Storage Approaches at Columbia

Launch of Shared Research Computing Facility

Managing the Big Data Explosion (EDUCAUSE Review)

File Format

Recommended formats for long-term data archiving

(Adapted from MIT,

  • Nonproprietary
  • Uncompressed
  • Unencrypted
  • Commonly used by your research community
  • Using an open, documented standard


  • Audio: AIFF, MP3, MXF, WAVE
  • Bundlers or Containers: BagIt, GZIP, TAR, StuffIt, ZIP
  • Databases: CSV, XML,
  • Geospatial: DBF, GeoTIFF, NetCDF,
  • Images, moving: AVI, MOV, MPEG, MXF
  • Images, still: BMP, GIF, JPEG 2000, PDF, PNG, TIFF
  • Statistics: ASCII, DTA, POR, SAS, SAV
  • Tabular data: CSV
  • Text: ASCII, HTML, PDF/A, UTF-8,  XML,
  • Web archive: WARC


Preservation of the software needed to work with the data, or ease of migration to open file format types.

USGS Data Management: Data & File Format

Library of Congress Recommended Format Specifications

There need not be a digital dark age – how to save our data for the future by Matthew Woollard

Describing Data (Metadata)


Metadata is structured information that describes, explains, locates, and otherwise makes it easier to retrieve and use as an information resource. There are three main types: descriptive, administrative, and structural.

In order to help make your data usable and accessible to you and others in the future, you need to create and archive accurate metadata along with your data.

Some content

(adapted from Cornell University,

  • Contact information
  • Geographic locations
  • Details about units of measure
  • Abbreviations or codes used in the dataset instrument
  • Protocol information
  • Survey tool details
  • Provenance
  • Version information
  • And much more…

Best Practice

(Adapted from

  • Make metadata central to your study design or research project. Adding metadata after the fact is expensive and time-consuming.
  • Use existing metadata standards and controlled vocabularies where possible.
  • At the very least, supply these Dublin Core metadata elements (or related elements in other metadata schemes):
  • Creator name(s)
  • Title of dataset
  • File Information (what programs are needed to open and work with the data)
  • Methodology
  • To better increase discoverability (as well as compliance with grant requirements), supply these suggested metadata:
  • Permanent Identifier(s) (DOIs, PubMed IDs, etc)
  • External link(s)/URI(s) (if multiple copies of this dataset exist in other databases or websites)
  • Coverage/Creation Dates
  • Grant Number(s)
  • Consider supplying preservation metadata(e.g. technical specifications, MD5 checksums, etc.) in addition to general metadata that describes the data set.


Data Description Standards by the DCC

USGS Data Management Describing Data – Metadata


Data Security Policy at Columbia

See the Computing and Technology section of the Columbia Administrative Policy Library for relevant policies. These include:

Best practices for data security

 Physical Security

  • Restrict access to computers, offices, and storage media.
  • Store lab notebooks and samples in locked cabinets.
  • Only let trusted individuals troubleshoot computer problems.
  • Use appropriate environmental controls.

Network Security

  • Keep confidential and sensitive data on computers not connected to the Internet.
  • Keep virus protection up to date.
  • Do not send confidential data via email or FTP (and if you must, use encryption).
  • Use passwords on files and computers.
  • Dispose of data properly at the end of the retention period.


  • Any portable device (flash drive, laptop) used to store confidential or identifying data must be encrypted.
  • Data files with protected health information must be encrypted.
  • Keep passwords and keys on paper in a secure location and in an encrypted file.

Online Trust Alliance (OTA) Security and Privacy Enhancing Best Practices

NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy