Working with Data?
- Data + Design offers a great starting point for understanding, collecting (survey design), cleaning, transforming, and visualizing data: https://infoactive.co/data-design
- Also, take a look at the Tools & Resources section!
Columbia University Libraries Information System (CULIS) Resources:
- Columbia Data Repositories
- Columbia University Library data collections include spatial (GIS) and numeric data catalogs. The DSSC Data Service within the Digital Social Science Center (DSSC) focuses on users doing quantitative work and needing assistance with either GIS or statistical software.
- re3data & Databib Registry of Research Data Repositories: a global registry of research data repositories merged with DataBib A tool for helping people identify and locate online repositories of research data
- ODiSEA International Registry on Research Data: highlighting the combined efforts of six Spanish universities to make open access data available
- BioSharing Registers “well-constituted efforts developing standards for describing and sharing biosciences experiments, ensuring these resources are informative and discoverable”
- Nature A list of recommended repositories for items published in Nature
- WebProtege A free and open source ontology-development tool that is capable of “interacting, constructing and consuming knowledge” with approaching technologies.
Data often have associated licenses that detail reuse permissions and restrictions. It is important to check these to make sure that you can use the data in the manner you wish. (Interested in licensing your own data? See here)
Data Storage Options Provided by Columbia
Active storage @ Columbia
- Columbia researchers in all areas generally use storage infrastructure that is provided at the school, department or division level by one of several local IT groups. If these resources are not available or adequate, researchers frequently use their own research money to build storage solutions, which could include staff time in addition to hardware, software, and a backup regimen.
- CUIT provides all active Columbia UNI holders with a small default amount (20-80 MB) of home directory file storage. You can use this to create a basic personal homepage on Columbia’s central Unix system (called CUNIX). This space is backed up with a regimen that includes offsite copies. You can send a request an increase over the default amount to email@example.com. These requests are considered on a case-by-case basis.
- CUIT provides a Central LAN service for a fee that is used primarily by some departments on campus so that their administrative staff have access to secure storage space managed by CUIT. This space is mounted on a user’s work PC and its contents are backed up with a regimen that includes offsite copies. This service uses Microsoft Exchange server technology which provides each user with a Microsoft Exchange or “Alpha” email account, calendar, contacts, etc., and each department with optional shared storage space.
- Backup services: Although there is not a centrally supported backup service, SpiderOak, a secure cloud based computer backup solution, has signed a BAA with Columbia allowing for the storage of sensitive data. CUMC Security has certified them for the backup of PII and PHI data, originally for use by a single research project, but MSPH IT will maintain the certification for individual users. SpiderOak supports a variety of plan types, which may be reviewed here: https://spideroak.com
Archival storage @ Columbia
- Looking to make your research outputs findable and accessible by others? Try Academic Commons, Columbia’s online research repository. You may deposit individual files of up to 10 GB in size at no charge. When depositing individual files larger than 100MB please contact CDRS for upload arrangements.
- In support of the NSF requirements for data management and sharing plans, Academic Commons will also accept individual files of more than 10 GB, and up to 100 GB, for a one-time charge of $5 per GB, payable at the time of deposit.
- Researchers expecting to preserve files larger than 100 GB should discuss their special needs with CDRS.
- Contact the appropriate Columbia Libraries archive to discuss the options for archiving physical data.
- For other storage options see here.
Storage media for long-term data archiving
There are a number of options that have advantages and disadvantages. They include:
- Hard drives
- Tape back ups
- Cloud storage services
CDs and DVDs are not good options for long-term storage. They have a life span of five to 10 years maximum.
Recommended formats for long-term data archiving
(Adapted from MIT, http://libraries.mit.edu/data-management/store/formats/)
- Commonly used by your research community
- Using an open, documented standard
- Audio: AIFF, MP3, MXF, WAVE
- Bundlers or Containers: BagIt, GZIP, TAR, StuffIt, ZIP
- Databases: CSV, XML,
- Geospatial: DBF, GeoTIFF, NetCDF,
- Images, moving: AVI, MOV, MPEG, MXF
- Images, still: BMP, GIF, JPEG 2000, PDF, PNG, TIFF
- Statistics: ASCII, DTA, POR, SAS, SAV
- Tabular data: CSV
- Text: ASCII, HTML, PDF/A, UTF-8, XML,
- Web archive: WARC
Preservation of the software needed to work with the data, or ease of migration to open file format types.
Describing Data (Metadata)
Metadata is structured information that describes, explains, locates, and otherwise makes it easier to retrieve and use as an information resource. There are three main types: descriptive, administrative, and structural.
In order to help make your data usable and accessible to you and others in the future, you need to create and archive accurate metadata along with your data.
(adapted from Cornell University, http://data.research.cornell.edu/content/writing-metadata)
- Contact information
- Geographic locations
- Details about units of measure
- Abbreviations or codes used in the dataset instrument
- Protocol information
- Survey tool details
- Version information
- And much more…
(Adapted from http://libraries.iub.edu/describing-data-metadata)
- Make metadata central to your study design or research project. Adding metadata after the fact is expensive and time-consuming.
- Use existing metadata standards and controlled vocabularies where possible.
- At the very least, supply these Dublin Core metadata elements (or related elements in other metadata schemes):
- Creator name(s)
- Title of dataset
- File Information (what programs are needed to open and work with the data)
- To better increase discoverability (as well as compliance with grant requirements), supply these suggested metadata:
- Permanent Identifier(s) (DOIs, PubMed IDs, etc)
- External link(s)/URI(s) (if multiple copies of this dataset exist in other databases or websites)
- Coverage/Creation Dates
- Grant Number(s)
- Consider supplying preservation metadata(e.g. technical specifications, MD5 checksums, etc.) in addition to general metadata that describes the data set.
Data Security Policy at Columbia
- Business Continuity and Recovery Policy
- Data Classification Policy
- Data Sanitation/Disposal of Electronic Equipment Policy
- Electronic Data Security Breach Reporting and Response
- Electronic Information Resources Security Policy
- Encryption Policy
- Email Usage Policy
- Information Resources Usage Policy
- Information Security Charter
- Information Security Risk Management Policy
- Network Protection Policy
- Access Control and Log Management
- Registration and Protection of Systems
- Registration and Protection of Endpoints
- Sanitization and Disposal of Information Resources
Best practices for data security
- Restrict access to computers, offices, and storage media.
- Store lab notebooks and samples in locked cabinets.
- Only let trusted individuals troubleshoot computer problems.
- Use appropriate environmental controls.
- Keep confidential and sensitive data on computers not connected to the Internet.
- Keep virus protection up to date.
- Do not send confidential data via email or FTP (and if you must, use encryption).
- Use passwords on files and computers.
- Dispose of data properly at the end of the retention period.
- Any portable device (flash drive, laptop) used to store confidential or identifying data must be encrypted.
- Data files with protected health information must be encrypted.
- Keep passwords and keys on paper in a secure location and in an encrypted file.