monoprint 2.JPG

Blogs and Articles

Blogs and Articles

Understand What I Mean, Not What I Say

 

October 2, 2017 | I think we can all agree that word choice is important when trying to communicate one’s intention. Most of us likely have had a good chuckle after our phones inappropriately auto-corrected a text message. In personal communications, we can explain and clarify in order to make ourselves understood.  But in information technology, the wrong term, or wrong meaning can have damaging implications, for querying, reporting, and analyzing.

In the field of biobanking, where it is common for biobanks to evolve for different purposes, in disparate locations, and with isolated informatics systems, it’s no surprise that communication within and across biobanks, even within one institution, is challenging and fraught with miscommunications. I’ve been part of countless conversations in which the meanings of the terms sample, specimen, and biospecimen are endlessly debated. If one’s goal is to share specimens and combine information from biobanks across multiple institutions, the difficulties escalate.

Biobanks and clinical research organizations have two general options to choose from if attempting to harmonize or standardize their data management practices. 1) Everyone can adopt a common terminology, or 2) biobanks can continue to use their own terms and map to an agreed upon standard.

In option one, using a common terminology, all users of the database use the same terms and definitions. The terms and definitions may be put forth by an external standards organization or may be an internally-developed common terminology. This approach does not require mapping or translation and is straightforward and orderly. It is best suited for an organization that has several biobanks under a single governance structure. Querying should result in accurate reports since the meaning of each term is clearly defined, and commonly accepted. If starting a biobank from scratch, planning ahead and establishing a transparent policy requiring the use of a centrally maintained, standard terminology makes this route easier to employ.

Frequently, especially at academic medical centers, it is common to have dozens if not hundreds of existing biobanks, and an equal number of information systems. To migrate these legacy data into a standardized single system means that many or most existing terms will need to change. Changing the terms is a technical/implementation issue. Changing the way people talk and use the words is another thing all together. An old “change the light bulb” joke comes to mind: “It only takes one psychiatrist to change a light bulb, but the light bulb has to want to change.” Similarly, transition of an institution to standard biobanking terminology is unlikely to succeed unless the biobanks themselves desire the change and are invested in the outcome.

Setting Definitions

The second option, mapping to a set of terms or definitions to aggregate data, has the benefit of allowing individual banks to continue to use the terms with which they are comfortable, while allowing merging and querying of aggregated data. This option would be best suited for collaborations and at institutions that share samples. However, mapping takes significant time and information can be lost.

In my experience, it is rare for biobanks to create and maintain a detailed, easily understood data dictionary, with a comprehensive list of terms and adequate definitions. Therefore, a mapping process requires at least one person, possibly more, who is intimately aware of the term meanings, and a data manager who can carry out the mapping and translation process. Inconsistencies in how the existing terms are used and managed may lead to the need for further data cleaning. 

Mapping does not always result in a one-to-one relationship; one term may become two. For example, a common way to record a sample type may be “EDTA Plasma”, combining both the preservative (EDTA) and biological material (Plasma) in a single field. This single field would map to two separate fields in order to more clearly interpret its meaning and not lose information. Information about the relationships between the terms may be lost if mapping only term-to-term.

A better alternative is to map to an ontology, which not only contains terms and definitions, but also contains the relationships between them. A comprehensive biobanking ontology is human readable and machine computable and can be used repetitively to bring disparate data sources together (DOI: 10.1186/s13326-016-0068-y). The use of a well-constructed ontology gets us closer to the goal of speaking the same language.  

Both approaches are time-consuming and require substantial stakeholder engagement. At minimum, data managers, biobank subject matter experts, and the end-users of the data must provide input. A project leader needs to drive the project, help prioritize decisions and curate input from stakeholders. In option one the hard work is done up front, before biobank implementation, and in option two the work is done after the biobank databases are in use, slowing down the querying. When planning a project, I recommend doubling—if not tripling—the initial time estimate. With the prevalent goal for institutions to maximize specimen and data utility, decisions like these are at the forefront of major biobanks working to leverage their specimen collections and the valuable annotation stored with them, critical to quality and downstream analyses.

This article was first published at http://www.bio-itworld.com/2017/10/02/understand-what-i-mean-not-what-i-say.aspx