Research Data Management

Research Data Management (RDM) means taking care of your research data in an organized and responsible way throughout the entire data life cycle. Well-managed research data follows the FAIR Data Principles, meaning others should be able to understand your data and reproduce your results. Effective RDM helps your project meet legal, ethical, and funding requirements while ensuring that research outputs remain discoverable, reusable, trustworthy, citable, and protected for the long term.

Email the Data and Visualization Librarian, Siti Lei (siti.lei@dukekunshan.edu.cn) for support with RDM, data management plan (DMP) creation, and data deposit.

Visit Office of Research Support (RSO) at DKU for support with research applications and funding opportunities.

What is ‘Research Data'?

Research data is the information collected, observed, generated, or created for the purpose of analysis and the production of original research results, together with any associated documentation, code, or scripts. Research data can be digital and analog, and includes both primary data created by researcher(s) and secondary data (obtained from other sources, such as census datasets or materials produced by other researchers).

Research data is diverse:
  • Raw/processed data (e.g., survey results, sensor readings, experimental measurements)
  • Text and documents (e.g., interview transcripts, fieldnotes, sketches, manuscripts)
  • Images, audio, video (e.g., medical images, photos of fieldwork, audio recordings)
  • Code and software (e.g., code scripts, models, algorithms for analysis)
  • Digital projects (e.g., website, StoryMap, 3D models, games)
  • Derived datasets (e.g., cleaned/aggregated data produced from raw data)

For yourself and your project:

  • Efficiency: RDM saves time and resources in long term
  • Data Integrity: RDM ensures accuracy and reliability of your data
  • Transparency & Replication: RDM makes your research process clear and enables others to replicate your results
  • Preservation, Sharing, & Reuse: RDM supports long-term access and future use of your data

There are also practical reasons:

  • Journal Policies: many journals require authors to share supporting data and include a data sharing plan or data availability statement as part of their publication requirements
  • Funding Requirements: research funding agencies require clear data management plans in grant proposals to ensure that research data is managed in accordance with open science principles
  • Ethics Compliance: researchers are responsible for managing data ethically and responsibly
FAIR stands for Findable, Accessible, Interoperable, Reusable.
  • Findable: For data to be findable there must be sufficient metadata; there must be a unique and persistent identifier; and the data must be registered or indexed in a searchable resource.
  • Accessible: To be accessible, metadata and data should be readable by humans and by machines, and it must reside in a trusted repository.
  • Interoperable: Data must share a common structure, and metadata must use recognized, formal terminologies for description.
  • Reusable: Data and collections must have clear usage licenses and clear provenance, and meet relevant community standards for the domain.
* Refer to the National Library of Medicine, https://www.nlm.nih.gov/oet/ed/cde/tutorial/02-200.html

Best practices for RDM involve the entire data lifecycle, from the start to the end of a project. The main stages include Create, Store, Use, Share, Archive, and Destroy, each governed by applicable policies, rules, laws, and regulations that ensure ethical and responsible data handling.

The lifecycle of research data does not end when a project concludes. Instead, researchers are responsible for guiding data through stages of long-term preservation and potential reuse, ensuring that data remain accessible, secure, and valuable beyond the original study.

* Image courtesy of the University of Virginia Library Research Data Services + Sciences, http://data.library.virginia.edu/data-management/lifecycle 

Classification Level
Sensitivity
Explanation
Storage Requirements
Examples
1
Non – confidential
Research data that can be accessed by the general public
Must be properly configured by DKU requirements
Research data that has been de-identified in accordance with applicable rules; published research; published information about Duke Kunshan University; public-facing websites
2
Benign information to be held confidentially
Research data that Duke Kunshan University has chosen to keep confidential but the disclosure of which would not harm the institution
Must be properly configured by DKU requirements
Unpublished research data; drafts of research papers; patent applications; work-in-progress papers
3
Sensitive, or confidential information
Research data that if disclosed could cause risk of material harm or legal liability to individuals or Duke Kunshan University
Must not be stored on personal devices unless such devices are encrypted according to DKU requirements
Research data containing personally identifiable information and not classified in Level 4; Duke Kunshan IDs when associated with information that could identify individuals; any personal data protected under Chinese laws and regulations and not classified in Level 4 or 5
4
Very sensitive information
Research data that would likely cause serious harm to individuals or Duke Kunshan University if disclosed
Data must be stored on the DKU Protected Network. Should not be transferred locally unless thoroughly anonymized and verified
Individually identifiable financial or medical information; information commonly used to establish identity that is protected by Chinese laws and regulations, and not classified in Level 5; individually identifiable genetic information that is not in Level 5; national security information; passwords and PINs that can be used to access confidential information
5
Extremely sensitive information
Research data that would cause severe harm to individuals or the University if disclose
Data must be stored entirely on the DKU Protected Network at all times
Research data covered by a regulation or agreement that requires that it be stored or processed in a high security environment on the Duke Kunshan University Protected Network (DKUPN); certain individually identifiable medical records and genetic information, categorized as extremely sensitive

* Refer to Data Security and Storage by DKU Research Support Office.

Data Management Plan​

Data Management Plan (DMP) is a document that plans out how research data is to be generated, managed, shared and stored during the entire research period from its implementation to after its completion. Funding agencies, research institutions, and journals often require a DMP to ensure that data are well-organized, secure, and reusable. Researchers should manage research data in accordance with the DMP to ensure responsible stewardship and future reuse.

The Office of Research Support at DKU provides Data Management Plans guidance to assist researchers in creating an effective DMP.

Support

Recommended tools for creating a DMP:

  • DMP Tool developed by the University of California
  • DMP Online developed by the Digital Curation Centre in UK
  • DMP Assistant developed by the Digital Research Alliance of Canada

For assistance with developing a DMP, contact Data and Visualization Librarian, Siti Lei (siti.lei@dukekunshan.edu.cn)

Submit your completed DMP to the Office of Research Support (research-support@dukekunshan.edu.cn)

Research group procedures (aka ‘lab procedures’, ‘standard operating procedures’) set expectations for working in collaborative research environments. They vary by group but typically cover policies (e.g., data ownership, confidentiality), workflows (e.g., file naming, version control), roles and responsibilities, use of space and equipment, approved tools and software, and general research and data management practices.

They differ from a DMP as they define how collaboration and data stewardship are organized across multiple projects, while a DMP is project-specific, detailing how data will be collected, stored, shared, and preserved in alignment with those procedures.

Establishing and documenting onboarding and offboarding procedures is essential for all research groups and collaborative projects. These procedures should include clear actions related to research data to standardize knowledge transfer and ensure that all team members have appropriate access to information, systems, and files. Effective procedures help reduce the risk of data loss or mishandling and ensure compliance with institutional and data security standards.

Onboarding procedures for research data may include:

  • Reviewing relevant policies, procedures, and documentation
  • Reviewing data management expectations and best practices
  • Reviewing available tools and resources for data storage and collaboration
  • Reviewing or creating data workflows for the project
  • Clarifying roles and responsibilities for data management

Offboarding procedures for research data may include:

  • Transferring ownership of files and shared drives
  • Updating documentation and metadata for data files
  • Selecting files for retention, archiving, or secure deletion
  • Removing permissions and access to systems, drives, and repositories

Ethical responsibility is essential to research. Consent and ethics safeguard participants, ensure that data is collected accurately and lawfully, and foster trust in research outcomes. At DKU, researchers should plan for consent and ethical approval before beginning data collection to ensure compliance with institutional policies, Chinese regulations, and responsible data management practices.

Things to consider before data collection:

  • Informed Consent: Participants should be clearly informed about what data will be collected, how it will be used, and who will have access. Their consent must be voluntary, documented (signed or digital), and allow withdrawal at any time.
  • Ethical Approval: Ethics protocols cover security, access and retention for human data. Research projects at DKU involving human participants require review and approval from the Research Support Office (RSO).

 

Contact Institutional Review Board (IRB) to review and approve your research’s ethical protocols.

Folder & File Organization

Folder

Instead of storing data and files in default computer locations (e.g., Desktop or Downloads), you should create separate folders to organize them by category. 

Your folder directory structure should prioritize clarity and easy discoverability. Keep it simple – limit the structure to no more than 4 levels and 10 or fewer subfolders within each level.

File

Organizing your research files in a clear and consistent way makes your data easier to understand, share, and keep safe for the long term. It also saves you time when you need to find or reuse your files later. A good system should be descriptive, well-structured, and used consistently, with clear documentation explaining how the data was created, collected, and processed, as well as any information needed to help others interpret and reuse it accurately.

Examples of data documentation include:

  • README files – describe the file organization and naming system
  • Codebooks – explain attributes/codes and their meanings
  • Data dictionaries – define variables and fields
  • Scripts – record data processing and analysis steps

Recommended tools for creating data documentation:

Tips for naming folders and files:

  • Keep name short (under 32 characters)
  • Name files differ from folders
  • Use alphanumeric characters (avoid special characters such as & , * % # * ( ) ! @$ ^ ~ ‘ { } [ ] ? < > –)
  • Use CamelCase or underscores instead of periods or spaces
  • Use date format ISO 8601: YYYYMMDD
  • Use meaningful and unique names
  • Use leading zeros (for a sequence of 1-100 number: 001-100)
  • Use version control if needed (e.g. v1, v001, v1_1 instead of “final2”, “revised”, or versioning system like Git)
  • Be consistent!
 
An example of a filename convention:
  • YYYYMMDD_ContentDescription_Version.ext

File formats is important for long-term data preservation and accessibility. Whenever possible, use open, non-proprietary, and widely supported formats (e.g., CSV, TXT, TIFF, or XML) rather than proprietary ones (e.g., Excel .xlsx, SPSS .sav, or Photoshop .psd) that may require specific software to open.

Open formats increase the likelihood that your data can be accessed, shared, and reused in the future. In the meanwhile, it is also helpful to document the file formats used in your project and explain any software dependencies.

When proprietary formats are unavoidable, consider saving an additional copy in an open or standardized format for preservation.

Version control helps you track changes to your data, documents, and code over time, ensuring that earlier versions can be recovered if needed. Clear version control practices help facilitate accuracy, reproducibility, and accountability throughout the research process.

When designing a file naming system, consider including  version numbers or dates in filenames (e.g., 20251013_InterviewData_v002.csv), and maintain a change log or brief note describing what was modified in each version.

Tips for data backup:

  • Backup after a major edit/alteration (not after every save)
  • Create backup copies for:
    • Things you cannot replicate
    • Things that would be difficult or take a lot of effort or resources to recreate
  • Have multiple copies, saved in different places
    • 3-2-1 rule (3 copies, 2 types of storage, 1 of which is offsite)
  • Automate your backup process when possible

Consider using following storage options:

  • Personal computer hard drive
  • External hard drives (with provisions)
  • Departmental servers (if available)
  • Cloud storage (if appropriate)
 

Support

For assistance with additional storage space for research data, contact the Office of Information Technology (IT) at DKU.

Understanding the risks associated with your data can help you adequately protect it. This is important for:

  • Supporting integrity of your research data (e.g., no unauthorized modifications)
  • Protecting against data loss or intellectual privacy theft
  • Protecting confidential/sensitive data from unauthorized access
  • Ensuring compliance with sponsor or partnership agreements

Check out DKU Data Security and Storage for guidance on identifying data classification level and storage requirement based on their sensitivity, confidentiality levels, and relevance to human subjects.

Sensitive Data

Sensitive data refers to information that, if disclosed, could cause harm to individuals, organizations, national security, or society.

Tips for managing sensitive data:
  • Encrypted if stored outside a secure server environment
  • Encryption optional if stored inside a secure server environment
  • Maintain a clear access log or audit trail of who opens or edits the data
  • Document storage locations and sensitivity levels in a metadata or README file
  • Review permissions and security settings periodically
  • Permanently destroy all copies after the official retention period using approved deletion tools

Confidential Data

Confidential data refers to any information that subjects to legal or contractual obligations to be kept private or restricted to authorized individuals or parties entrusted to safeguard them from unauthorized access, misuse, disclosure, modification, loss, or theft.

Tips for managing confidential data:
  • Store identifiable information (like names or IDs) separately from research files in an encrypted folder
  • Keep an inventory showing where personal data are stored and who has access
  • Use consistent file naming to indicate anonymized or restricted content
  • Limit access to authorized team members only and update permissions regularly
  • Follow the institution’s retention schedule and securely delete files when no longer needed

Human Data

Human data refers to information obtained from or about individuals, communities, and groups. Human data may be considered sensitive and/or confidential and may be subject to specific ethical, legal, and contractual obligations.

Tips for managing human data:
  • Store consent forms, ethics approvals, and related documentation in a labeled folder
  • Organize transcripts, notes, and recordings using a structured and consistent folder hierarchy
  • Keep de-identification notes describing what personal details which were removed or replaced
  • Use version control to record updates or cleaning steps for qualitative materials
  • Archive only anonymized or aggregated human data in repositories after the project ends
Finding Data Sources​

In addition to the data you create yourself, you can explore the following sources to find secondary or third-party datasets:

  • Work with Secondary Data – guidelines for reviewing terms of use, copyright, and citation requirements when accessing databases and using secondary data or datasets.
  • Data Resource Search Tool – this tool provides access to both licensed (proxy-based) and open databases, helping the DKU community discover a wide range of data resources.
  • Data Availability Statements – many academic journals include data availability statements that describe how to access the datasets associated with a published article.

DKU Library’s data and visualization services provide workshops, software tutorials, and resource guides to support data processing and analysis. Topics include: 

Check the Office of Information Technology (OIT) and DKUL’s Tools and Software for available resources.

If you plan to use a campus computer, check out Public Devices for information. 

When using Artificial Intelligence (AI) to process or analyze research data, researchers must apply strict ethical, legal, and security safeguards. Appropriate consent, privacy protection, and institutional approval are mandatory before working with AI.
 

Be mindful of uploading research data to open AI tools, which carries the risk of exposing unpublished or confidential information and may be retained into the AI’s training model without the researcher’s permission. Sensitive and personal data should never be uploaded to or exposed through such AI tools.

Check out DKU AI Literacy: Policies & Guidelines for more information.

After the Research Project

You’ve completed your project! Now what should you do with all the data?

  • Retention: Intentionally keeping data after a project is completed. Reasons for data retention may include:
    • Meeting funder or institutional requirements
    • Supporting your research if it is ever questioned
    • Allowing for further or follow-up analysis in the future
    • Preserving unique or irreplaceable data
  • Preservation: A set of managed activities that ensure your data remains stable, usable, and accessible for as long as needed.
  • Sharing: Making your data available to others for validation, reuse, or future research.

Note: Before sharing your research data with others or depositing your research data to a repository, check with the Office of Research Support(RSO) to ensure safe and compliant submission with national laws and policies. In some cases, you may be required to de-identify or destroy your data, especially when it involves sensitive or confidential information.

As researchers at DKU, individuals are responsible for understanding and complying to the national laws and institutional policies of both China and the U.S. that regulate research data management and cross-border data exchange.

China Laws

The Cyber Security Law of People’s Republic of China (CSL) is a foundational regulation that ensures network security, protects national sovereignty in cyberspace, and safeguards the rights of citizens, organizations, and the public interest. It promotes the secure development of China’s digital economy by establishing systems for network security classification, user information protection, and critical information infrastructure management.

The Personal Information Protection Law of the People’s Republic of China (PIPL) is a special law that aims to protect the rights and interests of personal information, standardize personal information processing activities, and promote the rational use of personal information. It designs a system for the entire process of personal information processing, puts forward strict requirements for the protection of sensitive information and cross – border provision of personal information, and clarifies the rights of individuals and the obligations of processors.

The Data Security Law of the People’s Republic of China (DSL) is a fundamental law  governing data processing and protection in China. It aims to ensure data security, promote lawful data use, and safeguard national sovereignty and public interests. The law introduces systems for data classification, risk assessment, security review, and incident handling to protect the rights of individuals and organizations while supporting secure data development and utilization.

The Provisions on Facilitating and Promoting Cross-Border Data Flow aims to balance data security with international data exchange. It establishes guidelines for data classification, requiring critical data to be stored domestically while allowing non-sensitive data to flow across borders. Companies must conduct risk assessments, implement security measures, and obtain user consent for data transfers. The provisions encourage international cooperation, streamline compliance procedures, and promote data-driven innovation. It also emphasizes protecting personal information and ensuring transparency in data processing. Overall, the framework seeks to foster global digital trade while safeguarding national security and individual privacy.

U.S. Laws

The Common Rule (45 CFR 46) is the main U.S. regulation governing research involving human subjects. It protects participants’ rights, welfare, and privacy through ethical standards for data collection, storage, and use. The rule requires Institutional Review Board (IRB) approval and informed consent, ensuring that identifiable and sensitive data are handled responsibly and securely in federally funded research.

The Data Management and Sharing Policy by the Natural Institutes of Health (NIH) sets national standards for managing, preserving, and sharing research data. It requires all NIH-funded researchers to submit a Data Management and Sharing Plan (DMSP) outlining how data will be documented, protected, and made accessible. The policy promotes transparency, reproducibility, and alignment with open science and FAIR data principles.

The Federal Information Security Modernization Act mandates strict security standards for information systems managed by federal agencies and contractors. It establishes a framework for protecting data confidentiality, integrity, and availability through risk assessments, access controls, and regular monitoring. FISMA ensures that research projects involving federal data or funding comply with federal cybersecurity and data protection requirements.

Data repositories are online platforms that store, organize, and preserve datasets, often making them available for sharing and reuse. They are widely used by research communities to share and discover data.

There are three main types of data repositories:

  • Disciplinary repositories focus on a particular area of research or type of data. They often have requirements for data formats, documentation, and metadata. You can find disciplinary data repositories by checking in with your peers, reviewing relevant journals for recommendations, or reviewing re3data, a registry of research data repositories.
  • Multidisciplinary/generalist repositories are not focused on a particular field and typically accept all types of data. Some examples of multidisciplinary repositories include FRDRDryadZenodo, and figshare.
  • Institutional data repositories are generalist repositories provided by a specific institution. Duke Research Data Repository is the Duke University’s institutional data repository. It accepts research data from research conducted at or under the auspices of the Duke University.

Understanding copyright and licensing helps define how your data can be shared and reused. Always include a license statement in your metadata or README to clarify permissions and restrictions to your users.

  • Copyright Ownership: In most cases, the creator or principal investigator (PI) holds copyright to the data they produce, unless otherwise stated in a grant, institutional, or collaborative agreement.
  • Secondary Data: If your project includes data collected or created by others, review and respect the original license or terms of use before redistributing or modifying it.
  • Licensing: When sharing your own data, attach a clear license that specifies how others may use it.
  • Institutional & Funder Policies: Some funders or institutions may have requirements about data ownership, rights retention, or preferred licensing models. Check these before publishing or depositing your dataset.