Spreadsheets and data entry

This page describes Mike’s rules (suggestions) for data entry with spreadsheets.

Organizing the data file
Data entry checklist
General tips
Examples of spreadsheets for a simple data set
Conclusions

These suggestions are general and apply to Microsoft Excel, LibreOffice Calc, and Google Sheets. General principles outlined apply to all spreadsheet apps, including Apple Numbers. The suggestions assume the user intends to import data into R for additional data processing and analysis.

As a team, decide and record roles for each team member: roles may include reporter, data entry, data verification, data calculator

Before collecting data, discuss and come to agreement — What is the primary aim (question) of the project? What is to be measured? How do we record data (paper lab notebook, digital notebook, spreadsheet)?

If multiple reporters of data are involved, then spend time with a data entry form. Google Forms is easy to use and it permits linking to Google Sheets. Once the form is done, simply add data via the form and the spreadsheet file is automatically populated with your data.

General scheme for a spreadsheet data file

Figure 1. A general scheme for a spreadsheet data file with five sheets (workbooks). For this example, the workbooks consist of a single metadata sheet and sheets for data of the two projects. Each project is organized into one sheet for raw data and one sheet for any processed data.

Data entry and data file checklist

Note: Lots of items here and clearly, some or more essential to include than others.

Description	Pertains to	Criteria
File extension	File	File saved as proper file type
Filename	File	Follows proper file naming protocols
One project per worksheet	File	If multiple projects, then each reported in its own sheet
Worksheets	File	Each sheet renamed
Censored data	Metadata	How are censored data marked
Describe measure	Metadata	For each measure or observation, include brief description of how measure was made
Instruments used	Metadata	Include with description the instrument used, including manufacturer if relevant
Metadata sheet included	Metadata	Includes one data dictionary sheet with appropriate documentation
Missing value defined	Metadata	Report how missing values treated. Do not use 0 for missing value
Numerical entry	Metadata	Report, cell type (decimal, fraction)
Project date	Metadata	Date of project recorded
Project description	Metadata	Identify the project and briefly describe purpose or question(s) asked
Project time	Metadata	If relevant, start and end time of project recorded
Primary data recorded	Metadata	Clearly identify raw vs processed numbers
Record keeping	Metadata	Identify format of original data and archiving strategy.
Significant figures	Metadata	Specifies number of significant figures used
Subject ID	Metadata	Data entry used subject ID
Team members	Metadata	Metadata contains list of personnel and roles
Truncated data	Metadata	How are truncated data marked?
Units defined	Metadata	Metadata contains list of units for each observation or measurement type
Consistent data organization	Sheet	Data are organized in stacked (narrow) or unstacked (wide) format
No formatting	Sheet	No borders, no color, no bold type, etc.
No mixed data types	Sheet	Each column contains numbers or text, not mix of both
Observation Comments	Sheet	Include notes needed to clarify observations
Observation date	Sheet	If relevant, date observation recorded
Observation time	Sheet	If relevant, time observation recorded
Rows by Columns	Sheet	Data oriented by rows (individual subject) and columns: (variables)
Subject ID vs Observer	Sheet	Observer and subject clearly defined
Variable names	Sheet	Proper variable names, one word or acronym, no spaces

Tips for using spreadsheet apps for data entry

Because this page is about data entry and spreadsheets, we’ll move past general discussions about how to measure and to record data to focus on how to do data entry.

Before working with spreadsheet data, turn off
- automatic completion/input features.
  LibreOffice Calc: Tools → AutoInput;
  Microsoft Excel: File → Options → Advanced and clear the Enable AutoComplete for cell values check box;
  Google Sheets: Tools → Autocomplete, then deselect Enable autocomplete.
- automatic correction features.
  LibreOffice Calc: Tools → Auto Correct options → Options and Local options tabs, then deselect selections;
  Microsoft Excel: File → Options → Proofing and clear the various check boxes;
  Google Sheets apparently lacks auto correct (?).
- Why? Because you need complete control of data entry. Auto complete functions GUESS your intent; Auto correct may change entries like DAta to Data. Yes, that’s probably what you intended, but these “predictive text” features are known to cause data entry problems (see Ziemann and Abeysooriya 2021).
  - So yeah, proofing your data entries needs to be done and so it’s a good idea to share the process of data entry with some one — one person types the other proof reads each entry.
    - AI tools may provide assistance — real-time data validation via vision tools that compare a raw data source against the entries made by data entry in the spreadsheet.
    - AI can use advanced OCR to pull data from tables on an image.
  - After data entry is complete, you can turn back on Spell check etc and partially automate proof reading.
Spend time with the metadata, also known as the data dictionary or codebook.
- Enter names of observers and their contact information
- Enter date and time of observations
- Include statement about the time format used, e.g., month-day-year is common in USA, but not other countries.
- Similarly for time, include statement of time format. Spreadsheets are notorious for
- List the variable by name (remember, short names are best, no spaces, no special characters) and include a description of the variable name, even if it seems obvious.
- Declare the data type and units for each variable.
- Enter date and time of data entry — which can be different than date/time of observation and data collection.
- Provide a brief overview of what the data is about — what purpose was the data collected for?
- Record names of persons who did data entry and proofing.
After creating the sheet with metadata, create a second sheet for the raw data or primary data.
- Set up the spreadsheet as rows and columns. Columns are for variables, rows are for observations. Observations typically are the sampling unit.
  - For example, repeat observations of blood pressure on a subject, the row starts with the subject id, then each column is a single reading of blood pressure.
    - For composite measures like blood pressure, there should be a Systolic column and a Diastolic column — avoid writing composite measures as 120/80.
- Do not mix type: a column should have either numbers or text, not both.
  - For example, 120 mmHg is not a good entry because both number and text are included. Thus, the entire column is treated as text.
    - And besides, you reported units in the metadata, didn’t you?
Create a third worksheet for data wrangling, the process of converting raw data into a usable form. For example, our raw data for calorimeter includes mass of water, initial and final water temperatures, initial and final mass of food sample. We want the kilocalories per gram of food stuff; formulas to accomplish these calculations should be in this third worksheet.
- Continuing on this theme, plots belong in their own worksheets, not the sheets with raw data.
Decide how missing data will be identified in the data set.
- Missing values can be identified in the spreadsheet as an empty or blank cell; during import into R, R can be instructed to treat empty cells as missing values as NA for Not Applicable.
  - Python uses None and NaN for Not a Number in the pandas library.
This is just a start, and rather than reading me, go to Preparing Data in Excel at College of Public Health at University of Nebraska Medical Center. Their site, and many others, provide good guidelines that will stand you well.

Examples of spreadsheets for a simple data set.

It goes without saying — it’s better to get the raw data into a spreadsheet file than not at all. There’s plenty of guidance out there for applying data wrangling approaches to cleaning and transforming raw data into usable form. It’s also obvious — it’s easier to anticipate the structure and perform data entry according to how the structured data need to be for subsequent analytical steps. Figure 2 through Figure 4 show examples of simple data sets, but with increasing data cleaning demands.

Figure 2 shows an ideal, simple spreadsheet entry. No mixed data types in columns, the column header is in the first row with simple, descriptive one word entries, and data begins in the next row. We assume that the data dictionary, including units and descriptions of Brand and other column headers are in a separate worksheet in the spreadsheet file.

Figure 2. Screenshot of a typical data sheet, just rows and columns with short, descriptive words as column headers. LibreOffice Calc. Image included in Part 05. R packages: Making R do more.

Figure 3 shows the same spreadsheet entry Fig 2. However, rows one through four contain elements of a data dictionary: units and descriptions of Brand and other column headers. This format is acceptable; it just requires the user to enter additional commands to specify the format and location of the data.

Figure 3. Screenshot of a data sheet with a few rows of meta information and the data in row 5 with short, descriptive words as column headers. LibreOffice Calc. Image included in Part 05. R packages: Making R do more.

Figure 4 shows the same data, but entered in a way that will require substantial data wrangling. Columns B through K now contain mixed data types in the cells, a mix of numbers and text. Header information in column B and C now also include spaces and special characters that are likely to cause problems during import. For example, during data import, you can force R to import names like bagWeight (g), but it would require enclosing the name in single quotes, like ‘bagWeight (g)’. However, including () in an object or variable name will cause a syntax error when called.

Figure 4. Screenshot of a data sheet with mixed data types — numbers and text in the same cell — and column headers with spaces, units, and other characters — bagEmpty (g) — that will cause R problems during import. LibreOffice Calc.

Conclusions

Enforcing strict data entry rules is probably better for long-term data integrity and operational efficiency, while data wrangling after the fact is often necessary for disparate or legacy data sources. The ideal approach is to combine both: implementing validation rules at the source (pre-entry) and using automated, documented workflows for necessary cleaning later (post-entry). Modern spreadsheet apps include many features designed to help at both stages; R programming language includes many packages to assist data wrangling.