Information Regarding Data Collection Software JCOIN Methodology and Advanced Analytics Resource Center (MAARC)

JCOIN Methodology and Advanced Analytics Resource Center (MAARC) Version 1.1

October 7, 2024

Selecting which software to use when collecting demographic, clinical and survey data—especially data obtained via chart review, interview or self-completed questionnaire—is one of the most important decisions you make when executing a study. Often this decision is made based merely on the availability of an institutional license, the familiarity of a study staff member, or simple inertia (i.e., “We have always used this before.”). Since switching software after a study has begun is costly and risks both substantial disruption and a reduction in data quality (including possibly even data loss), spending some time up front to be sure that your choice is carefully considered is worth doing.

Good research, including collecting high quality data, can be done with most, if not all, commonly used, modern software packages. What primarily distinguishes them is the amount of time and effort required, as well as the flexibility they provide. Thus, this note is not intended as a zealous endorsement of a single software application. Rather, we start by providing a brief overview of what we believe are the most important requirements to consider when selecting a data collection application for use in collecting quantitative data (collection of qualitative data is outside the scope of this note). We then go on to explain how Research Electronic Data Capture (REDCap) meets these requirements, as well as listing several additional features (and some limitations) of REDCap. Readers are encouraged to evaluate other software packages they may be considering according to these criteria.

Some data collection software, including REDCap, combine features for data collection, study administration, data management, and data analysis (including report generation and visualization) all within the same system. We believe that while it can be beneficial to use the same system for data collection and study administration, data management and analysis are best handled using different software after the data have been exported. Software packages such as Stata or R, or languages such as Python or Julia, are much more powerful and efficient for performing data management and analysis. Moreover, these environments permit the work of data curators and analysts to be easily documented and reproduced by other researchers—a key component of NIH’s new 2023 Data Sharing Policy. For this reason, we will not comment on data management and/or analysis capabilities here.

Key Requirements

The following are key requirements of any software system used to collect quantitative data for biomedical or social research:

1. The system should explicitly support your study design and your data collection protocol(s); this not only facilitates conducting the study, but also leads to better data quality. While the flexibility of a general-purpose system may seem attractive, using one requires much more experience and effort to implement an instrument well suited to your study. In addition, a purpose-built system will typically have a more appropriate metadata model, which can greatly facilitate generating data products for analysis and archiving that are consistent with good data sharing practices.

2. The system should include the following core features for form design: (a) adequate field types (e.g., integer, float, fixed choice, string, date and datetime, file upload, etc.) with modern interface widgets to facilitate and validate data entry; (b) the ability to organize a questionnaire into sections, adjust the layout to facilitate completion, and add instructions and notes for the user; (c) the ability to skip individual items or entire sections based on characteristics of the participant and/or responses to previous items; and (d) the ability to readminister the instrument or selected sections such as when doing longitudinal follow-up or collecting information about multiple events, samples, etc.

3. The system should be deployable securely via the internet. This allows simultaneous data collection by multiple people across multiple locations (including self-administration by participants), ensures that data are immediately accessible centrally (e.g., to permit study monitoring), and ensures that the data are secure as soon as they have been collected (e.g., even if a study laptop is stolen).

4. The system should be secure, should meet federal requirements for the collection of research data (e.g., HIPAA-compliant), and should be managed and deployed in such a way to provide reliable service and ensure data security (e.g., by creating regular, dependable backups).

5. The system should provide a fine-grained and flexible role-based permissions service. This permits restricting access for individual study personnel to only those pieces of information or data they need to do their job. For example, staff involved only in screening should not be able to view data collected after enrollment, and staff involved in monitoring data quality and completeness should not have access to identifying information about study participants.

6. If your study involves recruiting participants and collecting data at multiple sites, then the data collection system must support this explicitly. Specifically, it must restrict staff at each site to view only those data collected at that site, and if necessary, permit a site to manage its own secondary (i.e., satellite) sites.

7. Metadata should be easily accessible, ideally programmatically (e.g., via an open Application Programming Interface (API); see paragraph immediately below). This should include both study-level metadata (e.g., definition of study arms, study events, etc.) and variable-level metadata (e.g., field names, types, descriptions, response options, etc.). Being able to extract these metadata in a usable format greatly facilitates study documentation and the creation of stand-alone data products for analysis and archiving.

8. Finally, the data themselves should be exportable in an open format that preserves all the information collected and facilitates downstream curation and manipulation, also ideally via an open API (see paragraph immediately below). Some systems target a specific analytic package (e.g., exporting the data in Stata or SAS format) to increase convenience, but these often make setting up reproducible, efficient data management and curation procedures more difficult, and the use of non-open formats is inconsistent with the FAIR principles (i.e., that data should be Findable, Accessible, Interoperable and Reusable).

For those unfamiliar with the concept of an Application Programming Interface (API), this permits third-party developers to interact programmatically with a software system such as REDCap or Qualtrics to extend the functionality of the system and/or to integrate it into an existing workflow. For example, in the case of software for data collection, an API could be used as part of a workflow to update a dashboard for monitoring study progress without having to repeatedly export the data manually, or for pushing information into the system such as participant data collected through a different mechanism. Importantly, while APIs are typically intended for software engineers, data managers or data scientists often use them effectively to increase efficiency and reproducibility.

We recognize that in certain applications, additional features may be important or even required. However, if you are considering using data collection software that does not meet the requirements listed above, make sure to consider carefully the implications of this and how you will work around it, if necessary.

Research Electronic Data Capture (REDCap)

REDCap is a data collection system that was created at Vanderbilt University and is now developed and maintained by the REDCap Consortium; it is widely used by the academic research community, with an estimated 3.3 million end-users in 159 countries. Although the software is free for non-commercial use, it is not open-source and requires membership in the Consortium. REDCap was initially designed for use in biomedical research, especially for collecting data during clinical trials. However, it now includes extensive survey capabilities which can be used for epidemiological and social research and is also used to support research activities (e.g., collecting metadata on laboratory samples, administering experiments through use of its randomization capabilities, etc.).

REDCap does an excellent job of meeting the requirements listed above:

1. REDCap’s primary focus is biomedical research, and it includes built-in support for many common study designs (e.g., single and multiple arm studies including both parallel and other designs, longitudinal studies, etc.). In addition, its design and flexibility permit easy implementation of additional studies such as crossover designs, complicated multi-stage experiments, etc., and its survey features accommodate not only clinical but also population-based and social research. It includes a rich set of features to facilitate study execution such as a randomization module that meets the strict requirements of a randomized clinical trial (i.e., not just A/B testing) and features for study management and monitoring. Finally, its metadata model is based on CDISC, a widely used and accepted standard for clinical research.

2. REDCap’s forms and fields have evolved considerably over time and are now quite full featured, including a wide range of field types, modern interface widgets with helpers and validation, the ability to define and use multiple reserve (i.e., missing value) codes, and the ability to customize a form’s layout easily without advanced programming knowledge. Perhaps most importantly, REDCap provides flexible and easy-to-use mechanisms for defining different paths through an instrument, including the ability to skip items or forms based on complicated conditions, repeat forms at different events or even within the same event (e.g., as when repeating a form for each of a number of biological samples from a single participant), and define complicated follow-up schedules which may even vary within a study (e.g., different follow-up for different study groups).

3. REDCap fully supports data collection across multiple sites, including permitting a given site to manage its own secondary (i.e., satellite) sites.

4. Data may be exported from REDCap in a wide range of formats, including open formats (e.g., CSV), either through the web interface or via the API. When combined with the metadata in Item (5) above, this permits the automatic creation of an open, portable, and fully documented data package that may be used internally, shared with collaborators, and submitted to a data repository. The open-source software package dataforge includes functions for this purpose, as well as to facilitate data management and curation following export through the REDCap API. This can save an enormous amount of time as well as reduce the chance of data manipulation errors.

5. Individual REDCap instances are hosted and maintained by institutional members of the REDCap Consortium. This includes most academic medical centers and research universities. The software itself includes several security features (e.g., two-factor authentication) and is HIPAA compliant, and individual instances are typically managed in a way that meets applicable federal, state, and institutional requirements. Individual institutions may impose additional security measures at their discretion (e.g., limiting access to the campus network or VPN). As described below, the MAARC can provide access to the REDCap instance hosted by the University of Chicago for those JCOIN investigators who would not have access to REDCap. This REDCap instance is validated for 21 CFR Part 11 compliance. Data are stored in a secure data center at the University of Chicago equipped to house systems falling under federal guidelines including the Health Insurance Portability and Accountability Act (HIPAA) and the Federal Information Security Management Act (FISMA). Only MAARC staff have access to the data and the dataset can be shared with the JCOIN Hub providing the data at a mutually agreeable frequency of updates.

6. REDCap is a secure, web-based system that may be accessed via a standard browser or with a mobile app on a phone or tablet, available for both iOS and Android.

7. REDCap has a role-based permissions service that permits restricting access at the level of individual data collection forms and to individual software functions.

8. REDCap’s metadata model is an extension of the CDISC standard, and both study-level and variable-level metadata may be exported in CDISC ODM format (version 1.3.1), making them easily usable by existing software. Variable level metadata may also be exported in the form of a comprehensive data dictionary, which can serve as human-readable documentation and may be used to review and even modify an existing questionnaire. Metadata may be exported through the web interface or via an open Application Programming Interface (API).

In sum, REDCap’s focus on biomedical research, including clinical trials and other types of studies, makes it easy to implement and deploy an appropriate and professional data collection instrument in most cases, while at the same time being sufficiently flexible to accommodate many unique study elements.

Additional Features

Here are some of REDCap’s additional features (more information is available on their website):

• Although REDCap is a web application, the mobile app may be used to enter data offline on either a phone or tablet (both iOS and Android); data may then be uploaded when an internet connection is available. This permits collection of data in situations where internet access is not available (e.g., in correctional facilities or rural areas).

• Many existing measures already have REDCap forms available which can be reused, thereby reducing the amount of work required to prepare for data collection (e.g., REDCap forms are available for all of the common data elements (CDEs) in the HEAL CDE Repository).

• REDCap has a well-designed consent module that may be used to screen and consent research participants, including doing so remotely (e.g., when a participant is at home). This is easy to set up and is consistent with best practices for obtaining consent from human subjects.

• As noted above, REDCap has a randomization module designed to support the rigorous requirements of a randomized clinical trial. Although at present you can only define a single randomization per project, multiple projects can be combined to support more complicated randomization schemes. This module may also be used to support randomized laboratory or psychological experiments.

• REDCap’s survey mode may be used to administer self-completed questionnaires permitting remote data collection; forms may be designed or modified to make them easier for research participants to complete themselves, and participant-specific codes may be generated for additional confidentiality and to ensure that only the designated participant may complete the questionnaire. This process can then be driven using automated emails, making it efficient and scalable for large studies. Questionnaires may be set to allow the participant to come back and complete and/or update the information later, if desired.

• REDCap includes several functions to facilitate administering a study, such as sending automated emails when certain events occur. These functions can be leveraged to increase efficiency and reduce the workload for study personnel.

• There is considerable (and growing) familiarity with REDCap among academic research institutions; this makes things easier for the staff administering a study, and facilitates certain administrative tasks (e.g., interactions with the IRB, with IT groups, etc.).

• REDCap permits rigorous control over editing previously collected data (e.g., when correcting or cleaning data), including a mechanism for requiring supervisor approval, when necessary.

• Finally, REDCap has an open, well-documented RESTful Application Programming Interface (API), as well as data entry triggers (aka webhooks) that permit the development of automated processes and interoperability with other data collection systems or web-based services, thereby extending its capabilities.

Limitations

Despite the many strengths and features described above, REDCap does have some limitations that are worth noting (as of REDCap version 13.1.27):

• While recent versions of REDCap have considerably increased the extent to which the layout and visual appearance of data collection forms and questionnaires can be customized, there remain limits to what can be done. Users who need to implement highly customized layouts and/or interface elements from the ground up and have the expertise to do so may wish to consider other systems. While the MAARC can provide general advice on alternatives for implementing and deploying custom interfaces, we do not have the resources or expertise to provide specific guidance for systems other than REDCap.

• Although REDCap supports the full UTF-8 character set and therefore permits data collection forms to be created in most languages, it does not have a built-in mechanism for simultaneously supporting multiple languages (i.e., internationalization). In cases where questionnaires need to be administered in two or three languages only (e.g., English and Spanish), it is straightforward to create the questionnaire in the first language, translate it to the second (e.g., by having a bilingual translator work directly with the exported data dictionary), and then deploy both questionnaires in parallel within the same project using REDCap’s skip logic to administer the appropriate one to each participant once he or she has indicated a preferred language. However, if you need to administer a questionnaire in more than two or three languages, you may want to consider using a third-party REDCap extension (at least one extension has been written for this purpose) or a different system altogether.

• REDCap’s randomization module does not support response adaptive designs, i.e., those in which the assignment probabilities vary over the course of the study based on the characteristics and/or outcomes of those who have been randomized thus far. Note that this is not a reason to avoid using REDCap for data collection, however, since response adaptive randomization may be implemented by using REDCap’s API and webhooks to interoperate with an external web-based service providing adaptive randomization (e.g., the MAARC has written an open source, web-based application to perform urn randomization and can provide support for it upon request).

• Finally, while it is possible to extend REDCap’s functionality, such extensions must be written in PHP and deployed by the REDCap system administrator, and many institutions are understandably reluctant to enable third-party extensions for REDCap. Note that the main limitation here is in developing new interface elements (e.g., multilingual support), since new backend capabilities (e.g., interoperability with other systems) can often be added through use of the API and webhooks.

MAARC REDCap Services

The MAARC has considerable experience in the use of REDCap, including implementing complicated study designs, and in writing software and services to extend REDCap’s capabilities to accommodate unique study requirements. We have also written software to facilitate the management, curation and packaging of data collected using REDCap. Note that while we can provide guidance on the general use of REDCap and on implementing a specific study, we do not have sufficient resources to create data collection forms or reports, to export and/or manipulate data, or to provide REDCap user support.

In cases where your institution hosts its own REDCap instance, we strongly advise using that as opposed to a REDCap instance hosted elsewhere. Your local instance will be more familiar to your IRB and other administrative units, and it is likely that you will be able to obtain timelier, more direct assistance when needed (e.g., many institutions have dedicated REDCap support staff and provide training, office hours, etc.). However, if you would like to use REDCap but do not have access to it through your institution or otherwise, the MAARC can provide access to the REDCap instance hosted at the University of Chicago for JCOIN-funded studies. Data are stored in a secure data center at the university equipped to house systems falling under federal guidelines including the Health Insurance Portability and Accountability Act (HIPAA) and the Federal Information Security Management Act (FISMA); this REDCap instance is also validated for 21 CFR Part 11 compliance.

Please contact us if you would like to schedule a pre-application REDCap consultation and/or are interested in any of these services. The MAARC will provide a one-hour initial consultation to interested applicants to answer questions they might have about REDCap’s capabilities, use of REDCap at the University of Chicago, and the supporting software and services provided by the MAARC. If necessary, a one-hour follow-up consultation will also be provided.

FAQs

Does our data collection program need to be HIPAA compliant?

My institution does not support REDCap as we do not have the staff to join the REDCap consortium, how can we still use REDCap for JCOIN?

If you do not have access to REDCap through your institution, the MAARC is able to provide access to REDCap hosted at the University of Chicago for JCOIN-funded studies. When necessary, the MAARC will also broker assistance between study staff and the University of Chicago’s REDCap user services; this will be limited to resolving problems due to the way the system is administered such as service disruptions, problems following software upgrades, problems with user access, or more generally instances where the system is not behaving in the expected manner. Note that the MAARC does not have sufficient resources to create data collection forms or reports, to export and/or manipulate data, or to provide general REDCap user support.

We have a community partner who will be collecting data, can other users have access to the data collection system (i.e. REDCap)?

For biomarkers and other laboratory data, does REDCap interface with those systems to bring that data in, or does it have to be manually entered?

REDCap’s strength is as a data collection system, not a data management system. Thus, in nearly all cases where data are being obtained from additional sources (e.g., biomarker data from one or more laboratories), we strongly suggest that those data be combined with data collected in REDCap after they have been exported and are glad to provide guidance in creating appropriate pipelines and workflows for doing so. The only exception would be if the data from other sources are necessary to administer the study, in which case we can provide guidance on how to set up an automated procedure to bring those data into REDCap using its API.

We will be in a location that does not have wifi or internet access, can we use REDCap?

We work with a Spanish speaking population, can REDCap work with other languages?

We want to collect social network data, does REDcap allow for that?

While REDCap does not have built in functionality for collecting social network data, its existing features can often be used to simulate such functionality. The MAARC has experience in doing this and has successfully used REDCap to collect social network data for several studies, some with quite specialized requirements. We are glad to share our experiences and advise on strategies for using REDCap to collect social network data. We are also glad to provide guidance on the use of other purpose-built systems (e.g. Network Canvas), either alone or in conjunction with REDCap, when necessary.

We have a subscription to Qualtrics, can we still use that for data collection in JCOIN?

Yes, Qualtrics can be used to collect data for your JCOIN study. Here is a feature comparison between REDCap and Qualtrics focusing on those features most important for data collection.

Feature	REDCap	Qualtrics
Explicit support for common study designs including longitudinal studies	Yes	No
Rich data entry widgets	Yes	No
Custom form layout and design	Yes²	Yes (requires programming)
Skip logic	Yes	Yes
Arbitrary flow through questionnaire	Limited³	Yes (“survey flow”)
Existing forms for common biomedical measures	Yes	No
Consent module	Yes	No⁴
Randomization module	Yes	Limited⁵
Self-completed questionnaires	Yes (survey mode)	Yes
Support for multiple data collection sites	Yes	Limited (“collaborations”)
Detailed, role-based permissions	Yes	No
Require approval for data edits	Yes	No
Export data formats	CSV, Stata, R, SAS, SPSS, CDISC ODM	CSV, SPSS, XML, JSON, Tableau
Established metadata model	Yes (CDISC ODM)	No (proprietary QSF file)
HIPAA compliant	Yes	Yes
Secure, web-based interface	Yes	Yes
Mobile app for offline use	Yes	Yes
API	Yes	Yes
Existing forms for common biomedical measures	Yes	No
Supports multiple languages	Limited⁶	Yes (“translations”)

¹It is easier to create new data entry widgets in Qualtrics.

²Recent versions of REDCap have added considerably more functionality in this area, though complete control (as with Qualtrics) is not possible.

³Limited to what can be done by assigning forms to events and utilizing skip logic.

⁴Signatures may be entered into a questionnaire, but no support is provided for a standard consent workflow.

⁵Limited to A/B testing; does not support best-practice RCT workflows.

⁶Can handle full UTF-8 character set for non-English languages; simultaneous multiple language support available via third-party plugin, though institutions may be unwilling to install this.

As shown in the table above, the primary advantages of REDCap over Qualtrics are its support for common biomedical study designs and their components (e.g., existing forms for standard measures, consent module, randomization module, multiple site support, and detailed, role-based permissions), and the ability to export the resulting data and metadata in a format that facilitates creating analytical files and shareable data packages (via the dataforge package). On the other hand, the primary advantages of Qualtrics are its flexibility in designing questionnaires (e.g., arbitrary paths through the instrument, designing new data entry widgets and layouts from scratch) and support for multiple simultaneous languages, though the former requires programming expertise. Thus, use of Qualtrics may involve some additional work to package data for sharing via the MAARC’s infrastructure. If Qualtrics is used, please plan for necessary staff time to support this work. The MAARC is available for pre-application consultation to support planning for data management tasks necessary to meet the RFA requirements around data sharing for successful applicants. A one-hour consultation is available to all interested applicants to provide information about the MAARC’s infrastructure and answer questions about how data collection systems may interact with it. If necessary, a one-hour follow-up consultation will also be provided.