Data management

This tutorial is in-progress and will be updated with additional examples and graphics.

Fundamentals

  • The technology is less important than the planning however…
    • Excel can cause issues so use with caution
    • IT vs information professionals ⇒ IT + information professionals
      • Different types of expertise and experience – IT can recommend software based on institutional licensing agreements, cost, hardware requirements but may have limited experience using the software within the context of research projects. Information professionals (i.e. librarians, archivists, research facilitators) are more likely to have a working knowledge of implementing different software or systems for data management and the legal, ethical, and practical restrictions surrounding information in scholarly work.
        • Ideally consult with both together!
    • Portability of data is the most important factor
      • Avoid proprietary file formats whenever possible – some instruments will require that you use their specific file format but frequently you are able to convert the data after to a nonproprietary format. The ADS Guides to Good Practice includes recommendations for file formats that are both accessible and suitable for preservation.
      • Avoid database systems and software that do not provide an easy way to export ALL of the data in a structured format. Expect to have to switch systems at some point during your project and run tests to make sure you won’t have to re-enter, re-run experiments or lose anything if you do have to switch. Test both exporting and importing!

Make sure that whatever software or system you use to store data, you are able to get all of the data out without losing anything. Test exports (and imports) before committing!

Planning

  • Plan for publication and reuse from the start
    • Funding requirements – frequently require data management plans that detail (or briefly describe) how you will be storing, processing, sharing/disseminating, and archiving the data you collect. They also prompt you to consider any sensitive aspects of your data and to plan for how you will manage protecting those appropriately

Everything is driven by your research question.

Building

  • Starting point should be determining what your unit of analysis needs to be aka what does a record represent in your database and how granular do your records need to be to answer your research questions?
    • Artifact vs trench – store data about an artifact in an artifact record and data about a trench in a trench record, don’t combine into a single table
      • Tables are structured so that columns are the “fields” and each row is a single “record”
      • An artifact record can reference the specific trench where the artifact came from without having to reproduce all of the data in that trench’s record within the Trench Table. This is achieved through unique identifiers. What the identifier is or how it is structured is less important than ensuring that they are completely unique and always assigned to a new record.

Each record in a database should have a unique identifier.

  • Normalization and other terms…
    • It can be helpful to know some of the jargon surrounding data management but understanding the concepts and why they are important is far more useful
TermDefinition / DescriptionSignificance
NormalizationDatabase normalization is the process of organizing the attributes of the database to reduce or eliminate data redundancy (having the same data but at different places). ReferenceRedundancy not only wastes time because you are entering the same data in multiple spots, it also increases the chances for introducing errors into the dataset.
Primary keyThe primary key of a relational table holds a unique value, which identifies a single record in the table. Primary keys are used to connect tables so that you do not have to enter redundant information (see above) but even more importantly, they are the unique identifiers that allow you to attach all kinds of information to a physical object. They are the bedrock of a database and analyses.
Foreign keyA foreign key links records in one table to another by matching a primary key in that other table. This is the field that is storing the unique identifier for a record in a different table – so it is how you are able to get all of the other information that would have been redundant to enter in this table.
RelationshipHow different tables are linked to each other – there are multiple options including one-to-one, one-to-many, and many-to-many.If you are building out a database, think through what type of relationship you are going to need. For example, an artifact can only be from one trench but a trench will most likely have multiple artifacts. Similarly, a petrography record can only be related to a single sherd.
Controlled vocabularyA standardized set of words or phrases.Think dropdown lists – these are great ways to speed up data entry but also to ensure that words are being spelled the same way each time. Published controlled vocabularies or thesauri can be ways to support research across datasets and collections. They also remove a lot of the work that goes into building a vocabulary and promote collaboration.

Fields and what goes into them

  • What form does the data need to be in when collected and stored in so that you can run analyses later?
    • Type of data field (text, numbers, etc)
    • Integers and decimal fields – precision and storing leading/trailing zeroes
    • Consider whether you should use a controlled vocabulary
      • If you want to be able to query or filter easily
      • If you want to speed up data entry
      • If you have multiple people entering data (or really just know that consistency isn’t easy for anyone)
      • If you are collecting data across multiple sessions/seasons (see above regarding consistency)
      • If your data can be categorized into a vocabulary
    • Spatial data – instrument/source (accuracy)

Mock up a dataset and try to run the analyses you want to use in the future on your real data – can you? Or do you need to store the data differently?

Gotchas! (Or tips to avoid headaches)

  • Excel + dates
    • Different versions will use different systems for calculating the number that Excel will frequently convert your date to so it is easy to introduce errors unintentionally and back calculating to a formatted date may not always be possible.
      • To avoid: format the column to text prior to entering any dates
    • Other issues to watch out for: if you have any identifiers or values that might be interpreted as dates, they may also be converted to a serial number by Excel unless you preformat the column to text. For example: an identifier of 1-12 may be interpreted as January 12th even though there isn’t a year.
  • Dates in general (how granular are you actually going to be? circa? text field, periods/ranges, etc)
    • Be realistic from the start
    • Consider using a chronology or resource like PeriodO so that your period or chronological designation is tied to a specific publication/source
  • Do not store information through design (i.e. hightlights, color coding text or groupings)
    • This type of formatting is frequently lost when you convert to different file types
    • It is easy to lose or forget to document what the formatting indicates, or is just not translatable to others
    • It can also be an accessibility issue for collaborators
  • Data backups – automated ideally, but don’t forget that a data backup should not rely on a specific service or software continuing to operate (you need to be able to get the data out)
  • Data entry differences – let’s be honest, nobody reads the manual
    • don’t rely on a guide or manual or single training session
    • include units in the field name
    • include a description of the field with an example of how the data should be entered next to field
    • regularly review data entered by others and hold discussions to ensure that everyone is on the same page
  • Measurement diagrams for the win
    • Quick sketches or standardized diagrams that show where a measurement was taken facilitate consistency and running analyses later
  • Sensitive data – flag fields that contain or may contain sensitive data from the start and plan how to handle them prior to publishing and/or depositing in a long-term preservation repository
  • Images, files and attachments being moved or stored in cloud syncing applications like Dropbox, Google Photos, Box
    • Databases frequently store a reference to the physical location of a file on a drive, not the actual file. If a file is moved or if a cloud storage system adjusts their conventions, the database will not be able to find the file without work on your part.
  • File naming and storage conventions
    • Keep an archive of raw data that is never edited aka make a copy of the file before doing any processing or running analyses
    • Make it clear what has been processed (and document how it was processed)
    • Consider whether you will be incorporating data from other sources and whether it should be stored differently (i.e. if you are going to be depositing this data also, do you need permission or the rights to deposit?)

Additional resources

Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989

Hertz MI, McNeill AS (2024) Eleven quick tips for properly handling tabular data. PLOS Computational Biology 20(11): e1012604.https://doi.org/10.1371/journal.pcbi.1012604

Data Carpentry lesson on Data Organization in Spreadsheets for Social Scientists