Your computer's hard drive is a good storage device, but it is not immune from disaster. Hard drives fail without warning, and people delete files by accident. Data recovery is possible in some situations, but there is no guarantee that your data can be recovered. For that reason, backup is essential. Talk to the IT staff in your department or institution about backup services for your data and other important files.
A good backup strategy has two characteristics:
In other words, your backup data should be as current as possible and located somewhere different from the original data. Backup services offered by your department or institution may make copies automatically on a regular basis, which keeps the backup copy current, but in some cases you may be responsible for updating your backup by manually copying files. Talk to your IT staff for details. Backup services offered by your department or institution will ensure physical separation.
If your department or institution does not provide a backup service, consider these approaches:
Ideally, you should use combination of two backup strategies for your data. Do not depend on a single backup. For example, you could use an external hard drive in addition to your institution's backup services.
What about the cloud?
You can store data and other files with cloud services such as Amazon S3, Dropbox, SpiderOak, Google Drive, Apple iCloud, and Microsoft SkyDrive. The cost will vary depending on the size of your data files. A limited amount of storage is often free.
It is important to understand that cloud services have licenses and legal obligations that may expose your data to legal risks and violate any privacy or confidentiality associated with your data. Read the license carefully before using any cloud service. All the services listed above store your data on servers in the U.S, and therefore your data will be subject to U.S. laws. Your institution or funding agency may have policies about using these services, so check with them before opening an account.
Other backup strategies
Get in the habit
If your backup strategy involves manually copying files to a storage drive, get in the habit of copying your data on a regular basis. As a rule, make a copy of your data whenever you make changes to your data. For example, if your data changes on a daily basis, then make a copy every day.
Why cite data?
It may seem odd to cite data files, but it makes sense if you think of data as a kind of literature. When you cite data (even your own data) in a transparent manner, you give readers a reason to trust your work. At the same time, by providing a trail back to the data files, you give readers an opportunity to reproduce and verify your conclusions. Beyond issues of trust and verification, citing data plays an important role in establishing credit and maintaining intellectual property.
If you produce and publish data files, then you benefit from data citation. When researchers cite your data files, they alert peers and colleagues to your work, establish your data in the scholarly record, and make it possible to measure the impact of your research.
Data centres and repositories often provide citation guidelines for data files, so check with the centre or repository before publishing your paper. If you cannot obtain any guidelines for your data, then you should try to match the standard in your field. When in doubt, aim to create a citation that includes the following information:
Some repositories may have additional rules about attribution and acknowledgement. Consult the repository before citing a dataset.
What is a Data Management Plan?
A Data Management Plan describes your data files, your plans for data storage, and rules for sharing your data.
Why make a Data Management Plan?
Writing a Data Management Plan can help you identify challenges associated with the long-term accessibility and sustainability of your data. It can also help you organize your research process. If you're working with co-investigators and/or research assistants, your Data Management Plan will provide consistent guidelines for handling data.
In the United States, researchers must include a Data Management Plan with grant proposals submitted to the National Science Foundation (NSF) and other funding agencies. We expect that SSHRC, CIHR, and NSERC in Canada will follow the NSF example in the near future.
Creating a Data Management Plan
In general, a Data Management Plan should include:
This is a general overview, and individual funding agencies may have different or additional requirements. Please consult your funding agency for guidelines. Talk to the data librarian at your institution for assistance with your Data Management Plan.
While SSHRC does not require a data management section in the grant application, the organization is committed to the principle that research data collected with grant funds belong in the public domain. SSHRC's Research Data Archiving Policy outlines what the organization requires from grant recipients in regards to sharing research data.
The SSHRC policy states that:
"All research data collected with the use of SSHRC funds must be preserved and made available for use by others within a reasonable period of time. SSHRC considers "a reasonable period" to be within two years of the completion of the research project for which the data was collected."
Canadian Institutes of Health Research (CIHR)
CIHR does not require a data management section in the grant application, but it has two important policies related to data management: CIHR Policy on Access to Research Outputs and CIHR Grants and Awards Management.
CIHR expect that researchers will:
"deposit bioinformatics, atomic, and molecular coordinate data into the appropriate public database (e.g. gene sequences deposited in GenBank) immediately upon publication of research results."
"retain original data sets for a minimum of five years (or longer if other policies apply)." This applies to published and unpublished data.
In addition, if the research involves clinical trials, the researchers must "deposit aggregate data in an unbiased, publicly accessible database (clinical trials registry)."
Natural Science and Engineering Research Council of Canada (NSERC)
NSERC does not appear to have any policies related to data management at this time, nor does a data management section appear on the grant application.
Genome Canada has a section in their Guidelines for Funding Large-Scale Genomics Research Projects titled "Handling of Data and Resources", in which it states that applications must include clearly defined policies and plans for managing the data and resources to be generated.
Data Analysis – plan must include
i) A diagram showing the data flow for the information created by all project components.
ii) A description of the data flow.
iii) A description of the computer analysis strategies for the data.
iv) A plan for the long-term preservation (archiving) of the analysis results and, where appropriate, raw data.
v) A description of personnel requirements needed to realize the data analysis.
Genome Canada also has a Data Release and Resource Sharing Policy in which it encourages researchers to share their data and resources as rapidly as possible and have a plan in place to speed up the process.
Data and Resource Sharing – plan must address the following issues
i) Data and resource types – what data and resources will be generated?
ii) Timing and mechanism for sharing – for each data and resource type, when, how and where will these be made available. Where there are recognised public databases and repositories these must be used, and if none are currently available, what are the plans for making the resource in question available to the community at large.
iii) Quality – what quality control/assurance mechanisms will be in place?
iv) Standards – are there community standards for the data and/or resources being generated, how will the project conform to these. Genome Canada expects that data and resources generated will conform with internationally accepted standards, and reference to these standards should be made when these are available.
v) Ethical, privacy and confidentiality issues – if the data could be of a potentially sensitive nature, how will this be handled? Where the research involves human subjects, how will the interests of the research participants be protected? How does the Data and Resource Sharing Plan comply with the terms of the consent?
vi) Intellectual Property (IP) - will there be any restrictions or delays on data and/or resource sharing to ensure protection of any IP or proprietary data and/or resources?
vii) Terms and Conditions - what terms and conditions, if any, of access and use of the data and/or resources will be implemented? Please note, when making data and resources available, researchers cannot place limits on questions posed or methods used, nor require co-authorship as a condition for receiving data or resources.
Genome Canada has a Policy on Access to Research Publications as well.
Michael Smith Foundation for Health Research (MSFHR)
MSFHR has no requirement to include a data management section in the grant application. However, it does have an Open Access to Research Outputs Policy.
The policy states that award recipients are required:
"to deposit bioinformatics, atomic, and molecular coordinate data, as already required by most journals, into the appropriate public database immediately upon publication of research results.”
Ontario Institute for Cancer Research (OICR)
OICR does not require a data management section in any of their grant applications. However, the organization does have a Policy on Open Access Publication and Data Retention, which states that:
"OICR requires OICR-supported scientists to deposit bioinformatics, atomic and molecular coordinate data, and
source code for software into the appropriate public database, as already required by most journals, immediately upon publication of research results"
Why is file naming important?
Data files, like any other physical or electronic files, need to be uniquely labeled and well organized in order to be identifiable and accessible by researchers, students, and lab staff. Clear, consistent, and self-explanatory file names make identification a lot easier. They also provide a simple mechanism for version control.
File naming essentials
To be helpful, file names must have conventions that everyone on your team follows.
Create self-explanatory names
Other researchers and lab staff should be able to understand the meaning of your file names without a guide. The more descriptive and self-explanatory the name, the easier it will be to identify the data. Furthermore, since files will sometimes be moved from their original location, the descriptive information should be independent of location.
Keep track of versions
Datasets will often have multiple versions. You can identify versions by including a sequential version number in the file name. (eg. v02, v03) Sometimes including the date of the revision in the file name is useful to distinguish between versions (eg. filename_20101124.csv), as versioning becomes difficult to manage.
File formats may seem inconsequential, but they can have a huge impact on the long-term accessibility and sustainability of your data. As software evolves, some formats become inaccessible or unreadable. When this happens, the data stored in those formats will be effectively lost. To avoid this fate, it's important to use file formats that have the best chance of remaining accessible and widely supported over the long term.
Recommended formats for data
The University of Oregon has an overview of recommended formats for a wide variety of common data types.
What is Intellectual Property?
Intellectual property is a work, invention, or expression that is created wholly or partly using one's intellect. This form of knowledge can be legally protected. Canada has four main types of statutory intellectual property protection:
What are your Intellectual Property Rights over data you generated?
The intellectual property rights for research data are a bit of a grey area in Canadian law. Raw data created during a research project is not considered an invention (ie. patentable) nor an expression of an idea (ie. able to be copyrighted) in the Canadian legal system. However, data is important, unique, and valuable information to both the creator and future users, and it can carry some rights provisions. These rights are determined by a number of variables that can influence the degree of your intellectual property protection.
If you do retain a high degree of intellectual property rights over the data, you need to consider how you want that data to be used by the wider community. It might be beneficial to define terms and conditions with respect to sharing and using your data. For example, you may have data that requires discretion or special handling, such as data restricted due to third-party aggreements or ethical and privacy concerns. You may want to keep this material "darkened" (unavailable to users) or build certain restrictions into the data permissions.
Using another researcher's data
Depending on the data, your project may have to obtain permissions for use. If the data is online and openly accessible, be sure to adhere to any conditions of use (repositories and data centres will often have their own policies regarding access permissions) and remember to always cite the data. See our overview of data citation for more information.
In some situations, you may need to consult your institution's Intellectual Property Management Office or Office of Technology Transfer to navigate the restrictions and permissions associated with the data you seek.
What is metadata?
Metadata is basic information about your data. It's sometimes called data documentation. Metadata usually includes information about the researchers, the name of the dataset, the date the dataset was created, the topic of the research, the repository that holds the data, any terms and conditions associated with the data, and so on. In general, metadata provides context for data. Metadata helps researchers and students distinguish your dataset from similar or related datasets in a data repository or library catalogue. It also helps students, postdocs, lab staff, and colleagues understand the data you've collected.
Each discipline has its own rules or customs for metadata, and most data respositories have formal metadata standards for submissions. If you plan to deposit your data in a repository, you will probably have to create metadata for your dataset or, at least, provide the repository with enough information to create metadata. Becoming familiar with the metadata standard in your discipline will make this process a lot easier. Metadata is often structured as a series of fields and recorded in XML format.
Creating metadata for your data
You should plan to use a metadata standard that's acceptable to your data repository. Talk to the repository manager about your research. He or she may have resources or tools, such as an online form, that make it easy to create and upload metadata. For additional assistance, talk to the data librarian at your library.
If you compose metadata yourself, try to capture the basic details of your project in clear, straightfoward language. Feel free to recycle material you used in research proposals or grant applications. Remember that your audience will include researchers and students. What will they need to know about your data?
The parts of metadata
Whether you'll be creating metadata yourself or providing information to an archivist, librarian, or curator, you'll have to describe basic facts about your data:
Many metadata standards require additional information, such as:
If your metadata standard requires information about research design, data quality, and variables, you may have to describe:
Based on information from the UK Data Archive.
There are a wide variety of metadata standards, and each has its own rules. Some are fairly simple and minimalist, and others are complex and elaborate. Nevertheless, most metadata standards have a lot in common. The examples below illustrate some of the diversity and some of the similarities.
Metadata from the Dryad repository. This metadata describes data associated with an article published in the Journal of Evolutionary Biology. Note that the researchers stipulated an embargo period: the data will not be available to the public until the article has been published.
Metadata from Scholars Portal Dataverse. Like the example above, the data is related to a published article. Note that someone added an access restriction: "Data is restricted to current U of T faculty, staff, and students only. Redistribution is prohibited."
Metadata from the Polar Data Catalogue (excerpt). This shows a section of the metadata that describes the location of the study, keywords, and conditions of use.
Why share your data?
It's becoming more common for researchers to publish their data at some point. There are good reasons for publishing data:
Finding a data repository
If you wish to share your data, it's important to choose a trusted repository or data centre that facilitates long-term preservation and access restrictions for your data. Check out these options:
It may be helpful to create a Data Sharing Plan in order to define your expectations for researchers and students who are interested in using your data. The NIH has created a helpful document outlining the key elements to consider when preparing a Data Sharing Plan.