Skip to main content
Scholars Portal logo

Research Data Repositories

Backup

Why backup?

Your computer's hard drive is a good storage device, but it is not immune from disaster. Hard drives fail without warning, and people delete files by accident. Data recovery is possible in some situations, but there is no guarantee that your data can be recovered. For that reason, backup is essential. Talk to the IT staff in your department or institution about backup services for your data and other important files.

Backup basics

A good backup strategy has two characteristics:

  1. up-to-date copies
  2. physical separation

In other words, your backup data should be as current as possible and located somewhere different from the original data. Backup services offered by your department or institution may make copies automatically on a regular basis, which keeps the backup copy current, but in some cases you may be responsible for updating your backup by manually copying files. Talk to your IT staff for details. Backup services offered by your department or institution will ensure physical separation.

DIY backup

If your department or institution does not provide a backup service, consider these approaches:

  1. If you are a faculty member or graduate student at an Ontario university, you can use the Scholars Portal Dataverse service. You can store your data in a private, secure, password-protected workspace. You can access Dataverse from any internet connection. Dataverse is popular in the social science community, but you can use it to store a wide variety of data files. If you wish, you can use Dataverse to share your data with your co-investigators or colleagues.
  2. Store a copy of your data on an external hard drive. You can make a copy manually or use the backup functionality in Windows (Backup and Restore) or Mac OS (Time Machine). Large-capacity external hard drives are relatively inexpensive. Ideally, your external drive should be in a different location from your main computer. Plan to replace your external drive every 4-5 years. Keep in mind that, like any electronic device, an external drive can be stolen or damaged, so consider stashing your drive in a safe place when not in use.

Ideally, you should use combination of two backup strategies for your data. Do not depend on a single backup. For example, you could use an external hard drive in addition to your institution's backup services.

What about the cloud?

You can store data and other files with cloud services such as Amazon S3, Dropbox, SpiderOak, Google Drive, Apple iCloud, and Microsoft SkyDrive. The cost will vary depending on the size of your data files. A limited amount of storage is often free.

It is important to understand that cloud services have licenses and legal obligations that may expose your data to legal risks and violate any privacy or confidentiality associated with your data. Read the license carefully before using any cloud service. All the services listed above store your data on servers in the U.S, and therefore your data will be subject to U.S. laws. Your institution or funding agency may have policies about using these services, so check with them before opening an account.

Other backup strategies

  1. USB sticks ("flash drives" or "thumb drives") are fairly robust, but they have some limitations. First, they will deteriorate and fail after a certain amount of use. The more you use the drive, the faster it will fail. If you copy data to a USB stick on a frequent basis, plan to replace your USB stick after a year of use. In addition, they are easy to lose.
  2. Burning your data to CD-ROM or DVD-ROM discs is acceptable for short-term, occasional backup of small amounts of data, but it is not recommended for frequent backup or long-term backup. If you make copies of your data on a regular basis, an external hard drive will be cheaper. An external hard drive will also hold far more data than a CD-ROM or DVD-ROM. Moreover, CD-ROM and DVD-ROM discs degrade over time, so they are not suitable for long-term preservation.

Get in the habit

If your backup strategy involves manually copying files to a storage drive, get in the habit of copying your data on a regular basis. As a rule, make a copy of your data whenever you make changes to your data. For example, if your data changes on a daily basis, then make a copy every day.

Citing Data

Why cite data?

It may seem odd to cite data files, but it makes sense if you think of data as a kind of literature. When you cite data (even your own data) in a transparent manner, you give readers a reason to trust your work. At the same time, by providing a trail back to the data files, you give readers an opportunity to reproduce and verify your conclusions. Beyond issues of trust and verification, citing data plays an important role in establishing credit and maintaining intellectual property.

If you produce and publish data files, then you benefit from data citation. When researchers cite your data files, they alert peers and colleagues to your work, establish your data in the scholarly record, and make it possible to measure the impact of your research.

Citation essentials

Data centres and repositories often provide citation guidelines for data files, so check with the centre or repository before publishing your paper. If you cannot obtain any guidelines for your data, then you should try to match the standard in your field. When in doubt, aim to create a citation that includes the following information:

  1. Author or Creator or Originator or Responsible Party (could be an individual, group, institution, or agency)
  2. Date published
  3. Name of the dataset (include an edition or version number if possible)
  4. Data centre or repository
  5. Web address, DOI other unique, persistent, global link
  6. Date accessed or retrieved

Some repositories may have additional rules about attribution and acknowledgement. Consult the repository before citing a dataset.

Based on information from University of Oregon and IASSIST.

Data Management Planning

What is a Data Management Plan?

A Data Management Plan describes your data files, your plans for data storage, and rules for sharing your data.

Why make a Data Management Plan?

Writing a Data Management Plan can help you identify challenges associated with the long-term accessibility and sustainability of your data. It can also help you organize your research process. If you're working with co-investigators and/or research assistants, your Data Management Plan will provide consistent guidelines for handling data.

In the United States, researchers must include a Data Management Plan with grant proposals submitted to the National Science Foundation (NSF) and other funding agencies. We expect that SSHRC, CIHR, and NSERC in Canada will follow the NSF example in the near future.

Creating a Data Management Plan

In general, a Data Management Plan should include:

  1. A description of the types of data that will be produced or collected. You should identify the file formats in which your data will be stored, maintained, and made available. This section could include the methodology or explanation of how the data will be collected.
  2. A description of the standards (metadata) and tools that you will use to annotate or describe your data.
  3. A description of how your data will be organized, archived, and protected during your research project. This should include storage methods and backup procedures for your data, as well as the physical and digital resources needed. Any security or protection measures required for sensitive material or intellectual property should be addressed here as well.
  4. A description of access policies, terms and conditions for data sharing. How will the data be accessed and/or shared during and after the project? Who can access it? Explain rationale for any restrictions on who may access the data and under what conditions, and if these restrictions will change over time.
  5. Plans for eventual termination or transition of the data collection after completion of the research project. Where will the data be preserved for the long-term? How will it remain accessible?

This is a general overview, and individual funding agencies may have different or additional requirements. Please consult your funding agency for guidelines. Talk to the data librarian at your institution for assistance with your Data Management Plan.

Data Management Plans Specific to Canadian Funding Bodies
 
Social Sciences and Humanities Research Council (SSHRC)

While SSHRC does not require a data management section in the grant application, the organization is committed to the principle that research data collected with grant funds belong in the public domain. SSHRC's Research Data Archiving Policy outlines what the organization requires from grant recipients in regards to sharing research data.

The SSHRC policy states that:

"All research data collected with the use of SSHRC funds must be preserved and made available for use by others within a reasonable period of time. SSHRC considers "a reasonable period" to be within two years of the completion of the research project for which the data was collected."

Canadian Institutes of Health Research (CIHR)

CIHR does not require a data management section in the grant application, but it has two important policies related to data management: CIHR Policy on Access to Research Outputs and CIHR Grants and Awards Management.

CIHR expect that researchers will:

"deposit bioinformatics, atomic, and molecular coordinate data into the appropriate public database (e.g. gene sequences deposited in GenBank) immediately upon publication of research results."

and

"retain original data sets for a minimum of five years (or longer if other policies apply)." This applies to published and unpublished data.

In addition, if the research involves clinical trials, the researchers must "deposit aggregate data in an unbiased, publicly accessible database (clinical trials registry)."

Natural Science and Engineering Research Council of Canada (NSERC)

NSERC does not appear to have any policies related to data management at this time, nor does a data management section appear on the grant application.

Genome Canada

Genome Canada has a section in their Guidelines for Funding Large-Scale Genomics Research Projects titled "Handling of Data and Resources", in which it states that applications must include clearly defined policies and plans for managing the data and resources to be generated. 

Data Analysis plan must include
i) A diagram showing the data flow for the information created by all project components.
ii) A description of the data flow.
iii) A description of the computer analysis strategies for the data.
iv) A plan for the long-term preservation (archiving) of the analysis results and, where appropriate, raw data.
v) A description of personnel requirements needed to realize the data analysis.

Genome Canada also has a Data Release and Resource Sharing Policy in which it encourages researchers to share their data and resources as rapidly as possible and have a plan in place to speed up the process.

Data and Resource Sharing – plan must address the following issues
i) Data and resource types – what data and resources will be generated?
ii) Timing and mechanism for sharing – for each data and resource type, when, how and where will these be made available. Where there are recognised public
databases and repositories these must be used, and if none are currently available, what are the plans for making the resource in question available to the community at large.
iii) Quality – what quality control/assurance mechanisms will be in place?
iv) Standards – are there community standards for the data and/or resources being generated, how will the project conform to these. Genome Canada expects that data and resources generated will conform with internationally accepted standards, and reference to these standards should be made when these are available.
v) Ethical, privacy and confidentiality issues – if the data could be of a potentially
sensitive nature, how will this be handled? Where the research involves human subjects, how will the interests of the research participants be protected? How does the Data and Resource Sharing Plan comply with the terms of the consent?
vi) Intellectual Property (IP) - will there be any restrictions or delays on data and/or
resource sharing to ensure protection of any IP or proprietary data and/or resources?
vii) Terms and Conditions - what terms and conditions, if any, of access and use of the data and/or resources will be implemented? Please note, when making data and resources available, researchers cannot place limits on questions posed or methods
used, nor require co-authorship as a condition for receiving data or resources.

Genome Canada has a Policy on Access to Research Publications as well.

Michael Smith Foundation for Health Research (MSFHR)

MSFHR has no requirement to include a data management section in the grant application. However, it does have an Open Access to Research Outputs Policy.

The policy states that award recipients are required:

"to deposit bioinformatics, atomic, and molecular coordinate data, as already required by most journals, into the appropriate public database immediately upon publication of research results.”

Ontario Institute for Cancer Research (OICR)

OICR does not require a data management section in any of their grant applications. However, the organization does have a Policy on Open Access Publication and Data Retention, which states that:

"OICR requires OICR-supported scientists to deposit bioinformatics, atomic and molecular coordinate data, and
source code for software into the appropriate public database, as already required by most journals, immediately upon publication of research results"

File Naming

Why is file naming important?

Data files, like any other physical or electronic files, need to be uniquely labeled and well organized in order to be identifiable and accessible by researchers, students, and lab staff. Clear, consistent, and self-explanatory file names make identification a lot easier. They also provide a simple mechanism for version control.

File naming essentials

Be consistent

To be helpful, file names must have conventions that everyone on your team follows.

  1. Establish rules for creating file names, folder names, and general directory structure
  2. Always include the same information in each file name, in the same order (e.g. the date is always YYYYMMDD)
  3. Consistency tips:         
    • Do not use special characters: & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < >       
    • Use only one period and only before the file extension. Use an underscore instead of a period or space everywhere else in the file name (e.g. file_name.doc)
    • Use application-specific, lowercase codes in the 3-letter file extension (e.g. csv, gif, shp, por, cml, hdf)      
    • When using sequential numbering, insert leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100, and so on.

Create self-explanatory names

Other researchers and lab staff should be able to understand the meaning of your file names without a guide. The more descriptive and self-explanatory the name, the easier it will be to identify the data. Furthermore, since files will sometimes be moved from their original location, the descriptive information should be independent of location.

  1.    Brevity is appreciated - try to keep file and folder names under 32 characters
  2.    Where applicable, include relevant information such as:
  • Unique identifier (eg. grant/funding number in folder name)       
  • Project or research data name       
  • Conditions (instrument, temperature, location, etc.)       
  • Run of experiment (sequential)       
  • Date (in file properties too)

Example: InternationalPolarPeopleProject_survey_Nunavut_20101124/10089/en.txt

If your files are already named, you can use a renaming application such as ReNamer (Windows XP/Vista/7), Rename Master (Windows XP/Vista/7), Batch File Rename (Mac), or another.

Keep track of versions

Datasets will often have multiple versions. You can identify versions by including a sequential version number in the file name. (eg. v02, v03) Sometimes including the date of the revision in the file name is useful to distinguish between versions (eg.  filename_20101124.csv), as versioning becomes difficult to manage.

You can also use version control software such as Bazaar (Windows, Ubuntu, OS X, etc.), Codeville (runs on all major platforms), Mercurial (Windows, OS X, Linux, etc), or another.

Sources: University of Oregon and North Carolina Department of Cultural Resources.

File Formats

Formats matter

File formats may seem inconsequential, but they can have a huge impact on the long-term accessibility and sustainability of your data. As software evolves, some formats become inaccessible or unreadable. When this happens, the data stored in those formats will be effectively lost. To avoid this fate, it's important to use file formats that have the best chance of remaining accessible and widely supported over the long term.

General tips

  1. You're always safer with formats that are commonly used in your field, so aim to use formats that the majority of your peers use for their data.
  2. For long-term preservation, avoid formats that can only be opened by a single program or by using a specific piece of hardware. If you capture and process data in a format that's tied to particular software or hardware, plan to store that data in a format that's more suited to long-term preservation.
  3. When in doubt, ask the data librarian at your institution or the data manager at your repository for advice about file formats.

Recommended formats for data

The University of Oregon has an overview of recommended formats for a wide variety of common data types.

Intellectual Property

What is Intellectual Property?

Intellectual property is a work, invention, or expression that is created wholly or partly using one's intellect. This form of knowledge can be legally protected. Canada has four main types of statutory intellectual property protection:

  1. Patents for inventions
  2. Trademarks for brand identity
  3. Designs for product appearance
  4. Copyright for material

What are your Intellectual Property Rights over data you generated?

The intellectual property rights for research data are a bit of a grey area in Canadian law. Raw data created during a research project is not considered an invention (ie. patentable) nor an expression of an idea (ie. able to be copyrighted) in the Canadian legal system. However, data is important, unique, and valuable information to both the creator and future users, and it can carry some rights provisions. These rights are determined by a number of variables that can influence the degree of your intellectual property protection.

  1. Your university's/institution's regulations and policies - In many universities, data collected under research projects funded by the federal or provincial governments belongs to the university. Look into your institution's policies or, if applicable, check with your insitution's Intellectual Property Management Office or Office of Technology Transfer.
  2. The conventions of your field and/or your relationship with your supervisor - These factors may determine the degree of acknowledgment you are accorded for participation in a project and therefore influence your intellectual property rights over the research data.
  3. Your funding/grant source - Some funding bodies will attach no intellectual property claims to the research they fund, while others may claim either licensing rights, a share of royalties, or require that the data be made broadly available to the research community through public data archiving or other methods. Again, consult with your institution's Intellectual Property Management Office or Office of Technology Transfer and make sure that you are fully aware of the terms and conditions of the funding before signing an agreement.

If you do retain a high degree of intellectual property rights over the data, you need to consider how you want that data to be used by the wider community. It might be beneficial to define terms and conditions with respect to sharing and using your data. For example, you may have data that requires discretion or special handling, such as data restricted due to third-party aggreements or ethical and privacy concerns. You may want to keep this material "darkened" (unavailable to users) or build certain restrictions into the data permissions.

Using another researcher's data

Depending on the data, your project may have to obtain permissions for use. If the data is online and openly accessible, be sure to adhere to any conditions of use (repositories and data centres will often have their own policies regarding access permissions) and remember to always cite the data. See our overview of data citation for more information.

In some situations, you may need to consult your institution's Intellectual Property Management Office or Office of Technology Transfer to navigate the restrictions and permissions associated with the data you seek.

For more information on Intellectual Property, please see the Canadian Association for Graduate Studies' Guide.

Metadata for Datasets

What is metadata?

Metadata is basic information about your data. It's sometimes called data documentation. Metadata usually includes information about the researchers, the name of the dataset, the date the dataset was created, the topic of the research, the repository that holds the data, any terms and conditions associated with the data, and so on. In general, metadata provides context for data. Metadata helps researchers and students distinguish your dataset from similar or related datasets in a data repository or library catalogue. It also helps students, postdocs, lab staff, and colleagues understand the data you've collected.

Each discipline has its own rules or customs for metadata, and most data respositories have formal metadata standards for submissions. If you plan to deposit your data in a repository, you will probably have to create metadata for your dataset or, at least, provide the repository with enough information to create metadata. Becoming familiar with the metadata standard in your discipline will make this process a lot easier. Metadata is often structured as a series of fields and recorded in XML format.

Creating metadata for your data

You should plan to use a metadata standard that's acceptable to your data repository. Talk to the repository manager about your research. He or she may have resources or tools, such as an online form, that make it easy to create and upload metadata. For additional assistance, talk to the data librarian at your library.

If you compose metadata yourself, try to capture the basic details of your project in clear, straightfoward language. Feel free to recycle material you used in research proposals or grant applications. Remember that your audience will include researchers and students. What will they need to know about your data?

The parts of metadata

Whether you'll be creating metadata yourself or providing information to an archivist, librarian, or curator, you'll have to describe basic facts about your data:

  1. The names of the lead investigators.
  2. The title of your project.
  3. An abstract.
  4. The temporal coverage of your data.
  5. Any publication date associated with your data.
  6. The file format(s) used to store your dataset.
  7. The subject or topic of your research.
  8. Any rights and access restrictions associated with your data.
  9. Your contact information.
  10. The name of the data repository that stores your data.

Many metadata standards require additional information, such as:

  1. The spatial coordinates of your observations or study area.
  2. Any instruments used in data collection.
  3. Hardware and software used in data collection and processing.
  4. Any scale or resolution associated with your data.
  5. Relationships between datasets or files.
  6. Relationships between datasets and codebooks or data dictionaries.
  7. The version of each dataset or file and the nature of any modifications.

If your metadata standard requires information about research design, data quality, and variables, you may have to describe:

  1. Data collection protocol and sampling design.
  2. Data validation, checking, proofing, cleaning, and other quality assurance procedures.
  3. Names, labels, and descriptions for variables, and their values.
  4. Explanation of codes and classification schemes.
  5. Codes of, and reasons for, missing values.
  6. Derived data created after collection, with code, algorithm, or command file used to create them.
  7. Weighting and grossing variables created.
  8. Data listing with descriptions for cases, individuals, or items studied.

Based on information from the UK Data Archive.

Metadata Examples

There are a wide variety of metadata standards, and each has its own rules. Some are fairly simple and minimalist, and others are complex and elaborate. Nevertheless, most metadata standards have a lot in common. The examples below illustrate some of the diversity and some of the similarities.

Example 1

Metadata from the Dryad repository. This metadata describes data associated with an article published in the Journal of Evolutionary Biology. Note that the researchers stipulated an embargo period: the data will not be available to the public until the article has been published.

Example 2

Metadata from Scholars Portal Dataverse. Like the example above, the data is related to a published article. Note that someone added an access restriction: "Data is restricted to current U of T faculty, staff, and students only. Redistribution is prohibited."

Example 3

Metadata from the Polar Data Catalogue (excerpt). This shows a section of the metadata that describes the location of the study, keywords, and conditions of use.

Sharing Data

Why share your data?

It's becoming more common for researchers to publish their data at some point. There are good reasons for publishing data:

  1. In some cases, funding agencies, institutions, and employers require researchers to publish their data within a certain period of time. The delay gives you an opportunity to publish your findings before publishing your data.
  2. It is often a good way to backup and archive your data (especially if you deposit your data with a trusted repository or data centre).
  3. It may increase your data's citation rate and stimulate interest in your research.
  4. To establish a public record of your research and its publication date.
  5. It can help speed up further research in certain disciplines.

Finding a data repository

If you wish to share your data, it's important to choose a trusted repository or data centre that facilitates long-term preservation and access restrictions for your data. Check out these options:

  1. If you are affiliated with an academic institution, ask a librarian about your institutional repository. It should provide a secure place to store and share your data. These repositories will be subject to your institution's policies regarding access permissions, so make sure terms and conditions are acceptable.
  2. There are discipline/field-specific repositories and data centres (see the tab above) that provide data storage, preservation, and access controls. These repositories will also have their own access permissions and, depending on their location, may be subject to different laws. Read the terms and conditions before depositing your data.

It may be helpful to create a Data Sharing Plan in order to define your expectations for researchers and students who are interested in using your data. The NIH has created a helpful document outlining the key elements to consider when preparing a Data Sharing Plan.