Find answers to common questions about:
Data Management and Storage
Who is responsible for data management?
Columbia’s Retention and Access to Research Data Policy (Columbia UNI required) names the Principal Investigator (PI) of a research project as responsible for determining what data need to be retained and for setting up systems for organizing and archiving project data.
What is a data management plan?
A data management plan (DMP) is a brief document that outlines how you will collect, organize, manage, store, secure, back up, preserve, and share your data.
Am I required to create a DMP for my grant application?
Check with your research sponsor. Programs that currently require DMPs include:
- National Science Foundation (NSF): All directorates
- National Endowment for the Humanities (NEH), Office of Digital Humanities Digital Humanities Implementation Grants
How should I structure my DMP?
View a template that can help you create a DMP on the NSF Data Management Plan Requirements page[/link]. Though this template was created in response to the NSF’s data management and sharing policies, it can be used as a starting point for creating any DMP.
Does Columbia have a policy on data retention?
Yes. You can find the Retention and Access to Research Data policy on the website of the Office of the Executive Vice President for Research. The policy includes the following statement: “Research data must be archived for a minimum of three years after the final project close-out, with original data retained wherever possible.”
Be aware that other policies and laws requiring a longer period of data retention, such as the Health Information Portability and Accountability Act (HIPAA), may apply.
What data storage options does Columbia provide?
An overview of options currently available at Columbia can be found here: link
- Columbia researchers in all areas generally use storage infrastructure that is provided at the school, department or division level by one of several local IT groups. If these resources are not available or adequate, researchers frequently use their own research money to build storage solutions, which could include staff time in addition to hardware, software, and a backup regimen.
- CUIT provides all active Columbia UNI holders with a small default amount (20-80 MB) of home directory file storage. You can use this to create a basic personal homepage on Columbia’s central Unix system (called CUNIX). This space is backed up with a regimen that includes offsite copies. You can send a request an increase over the default amount to firstname.lastname@example.org. These requests are considered on a case-by-case basis.
- CUIT provides a Central LAN service for a fee that is used primarily by some departments on campus so that their administrative staff have access to secure storage space managed by CUIT. This space is mounted on a user’s work PC and its contents are backed up with a regimen that includes offsite copies. This service uses Microsoft Exchange server technology which provides each user with a Microsoft Exchange or “Alpha” email account, calendar, contacts, etc., and each department with optional shared storage space.
- In Academic Commons, Columbia’s online research repository, you may deposit individual files of up to 10 GB in size at no charge. When depositing individual files larger than 100MB please contact CDRS for upload arrangements.
- In support of the NSF requirements for data management and sharing plans, Academic Commons will also accept individual files of more than 10 GB, and up to 100 GB, for a one-time charge of $5 per GB, payable at the time of deposit.
- Researchers expecting to preserve files larger than 100 GB should discuss their special needs with CDRS.
- Contact the appropriate Columbia Libraries archive to discuss the options for archiving physical data.
What are the best storage media for long-term data archiving?
There are a number of options, which all have advantages and disadvantages. They include:
- Hard drives
- Tape back ups
- Cloud storage services
What about CDs or DVDs?
CDs and DVDs are not good options for long-term storage. They have a life span of five to 10 years maximum.
What are the best formats for long-term data archiving?
- Commonly used by your research community
- Using an open, documented standard
- Audio: AIFF, MP3, MXF, WAVE
- Bundlers or Containers: BagIt, GZIP, TAR, StuffIt, ZIP
- Databases: CSV, XML,
- Geospatial: DBF, GeoTIFF, NetCDF,
- Images, moving: AVI, MOV, MPEG, MXF
- Images, still: BMP, GIF, JPEG 2000, PDF, PNG, TIFF
- Statistics: ASCII, DTA, POR, SAS, SAV
- Tabular data: CSV
- Text: ASCII, HTML, PDF/A, UTF-8, XML,
- Web archive: WARC
Considerations: Preservation of the software needed to work with the data, or ease of migration to open file format types
More information: Library of Congress’ Sustainability of Digital Formats
What are best practices for backing up data?
Ideally, you should have three copies:
- the original
- a copy kept locally on a different device or media
- a copy kept remotely (preferably in another geographic area to serve as a back up in case of a local catastrophic event)
You should also have a schedule for regularly checking on the accessibility of backup copies, and a plan for migrating data to new file or media formats.
What is metadata?
Metadata is structured information that describes, explains, locates, and otherwise makes it easier to retrieve and use an information resource. There are three main types: descriptive, administrative, and structural.
In order to help make your data usable, and accessible to you and others in the future, you need to create and archive accurate metadata along with your data.
Check out some of the data description / metadata schemas listed here: Data Description Standards by the DCC
Does Columbia have policies about data security?
- Business Continuity and Recovery Policy
- Data Classification Policy
- Data Sanitation/Disposal of Electronic Equipment Policy
- Electronic Data Security Breach Reporting and Response
- Electronic Information Resources Security Policy
- Encryption Policy
- Email Usage Policy
- Information Resources Usage Policy
- Information Security Charter
- Information Security Risk Management Policy
- Network Protection Policy
- Access Control and Log Management
- Registration and Protection of Systems
- Registration and Protection of Endpoints
- Sanitization and Disposal of Information Resources
What are best practices for data security?
- Restrict access to computers, offices, and storage media.
- Store lab notebooks and samples in locked cabinets.
- Only let trusted individuals troubleshoot computer problems.
- Use appropriate environmental controls.
- Keep confidential and sensitive data on computers not connected to the Internet.
- Keep virus protection up to date.
- Do not send confidential data via email or FTP (and if you must, use encryption).
- Use passwords on files and computers.
- Dispose of data properly at the end of the retention period.
- Any portable device (flash drive, laptop) used to store confidential or identifying data must be encrypted.
- Data files with protected health information must be encrypted.
- Keep passwords and keys on paper in a secure location and in an encrypted file.
What data encryption options does Columbia offer?
You can obtain the following from CUIT. Some costs may be involved:
- GuardianEdge Hard Disk Encryption
- Savant Protection application whitelisting software
Other options include:
- BitLocker (for Windows removable storage devices)
- Encrypting File System (Windows native)
- File Vault (Mac native)
- 7 Zip (Windows)
- Truecrypt (Windows, OS X, Linux)
Which research sponsors require data sharing?
The requirement affecting the largest number of Columbia researchers is the NSF data sharing policy. As part of this policy, the NSF requires that all grant applications include a two-page data management plan. For more information, see our page on the NSF data management policy requirements.
Other funders with data sharing requirements include the National Institutes of Health (for projects with $500,000 or more of direct costs in any one year), the Howard Hughes Medical Institute, and the Wellcome Trust.
Do journals require data sharing?
Am I required to archive raw, processed, or final research data?
Check the requirements of your funder or publisher. Journal sharing policies usually require that the data underlying your published article by made available. The NSF policy, however, potentially encompasses a much broader definition of data, and “may include, but is not limited to: data, publications, samples, physical collections, software and models.”
Where can I put my data so that it is accessible to others?
One good option is an online repository. Data and institutional repositories are available for many types of data. Columbia’s repository, Academic Commons, accepts data from any field. Learn more about depositing data in Academic Commons here.
Subject-based repositories are available for researchers in a range of disciplines. Learn more on the Data Repositories page.
Is data under copyright?
Copyright and ownership questions around data can be complex and vary depending on the jurisdiction. In the U.S., facts cannot be copyrighted. But an original compilation of facts, such as in a database, may be copyrightable. And expressive data—or data involving a certain amount of judgement or creativity—or an expressive representation of data such as a graph, is copyrightable.
See “Facts and Non-Creative Works” on this page of the Copyright Advisory Office website.
I am the Principal Investigator on a sponsored research project at Columbia. What rights do I have to the data that result from the research project?
Though sponsors grant research funds to the Trustees of Columbia University, usually the PI acts as steward of the research data and makes decisions on its use and distribution within the parameters of sponsor and Columbia guidelines.
See the Intellectual Property section under “Obligations and Responsibilities of Officers of Instruction and Research” in the Faculty Handbook for more information on Columbia policies. Make sure you are aware of your obligations under these policies and those of your research sponsor.
Can I take the data from the research projects on which I was a PI if I leave Columbia?
You should not assume you can take data with you. Whether or not you can do so depends on many factors, including the status of the project and the policies of the research sponsor and Columbia. Contact Sponsored Projects Administration for more information.
I am a post-doctoral researcher at Columbia. What rights do I have to the data that result from my research activities?
The answer to this question depends on the specific circumstances of your research. Talk to your work supervisor or contact the Office of Postdoctoral Affairs.Back to Data Management index page