Navigating uncharted territories

The unexpected yet much-needed emergence of data science at the National Institutes of Health

Yuyang Zhong
Coding it Forward

--

This article is co-authored by Yuyang Zhong & William Huang.

Aerial view of the Clinical Center, NIH Campus, Bethesda, MD | Credit: National Institutes of Health

With few expectations of the summer ahead of us, William and I entered the Civic Digital Fellowship in its second year as a fully-remote program. With the pandemic still in full swing, I graduated from UC Berkeley with a degree in psychology and data science, while William wrapped up his first year at UCLA studying electrical engineering. Our paths crossed when we started as Civic Digital Fellows with the Office of AIDS Research (OAR) at the National Institutes of Health (NIH), the central coordinating office for all HIV/AIDS-related research across the NIH.

As a research granting agency, the NIH funds both internal and external research projects advancing science and innovation in all areas of health. Funds are specifically designated for a particular research focus, and in our case, HIV/AIDS research (referred to at NIH as “AIDS dollars”). AIDS dollars carry a specific grant review protocol that requires the manual review and categorization of research by the different areas of emphasis. This is a labor-intensive process in which research scientists must manually classify and encode thousands of grants so that a group of experts may review each grant.

This process is where William and I stepped in. We were tasked with creating predictive models that automatically tag the correct research codes based on the grant’s content. Throughout the summer, we iterated through various natural language techniques and arrived at Doc2Vec–a document-based topic model that transforms each document into a high-dimensional vector which is then compared against other documents using simple linear algebra. Using ten years’ worth of grants, we created our Doc2Vec model. We used the vectors as the basis of our classification models, tackling this from both supervised and unsupervised approaches to predict labels for each grant. We finished with a final test accuracy of close to 80%: four out of five grants were correct overall, with the potential to save thousands of hours of manual labor!

While it was fascinating to explore the outputs of our model from the perspective of data science, it was imperative that we could easily communicate our findings to the brilliant scientists and researchers across the OAR so that anyone could leverage and explore the model. In that spirit, we created visualizations of the topic clusters from our corpus of grants. Using a dimensionality reduction technique called t-Stochastic Neighbor Embedding, we created the following scatter plot that reveals how grants relate to one another. We then used the scatter plot to develop an interactive web application that allowed anyone to filter and explore our findings easily. Some notable distinctions to point out:

  • Principal Component 1 (horizontal axis) shows a clear distinction between basic research (orange and green dots) vs. behavioral and applied research (purple dots).
  • Specific topics and modality (e.g., novel gene therapy for HIV, conference grants) were recognized by the model and thus neighboring each other. They were revealed as smaller, distinctive clusters across the plot.
  • Repeated trials and experiments were paired close together for easy identification (some overlapping dots).
Visualization of the t-SNE output from our Doc2Vec model of HIV/AIDS-related grants.
Visualization of the t-SNE output from our Doc2Vec model of HIV/AIDS-related grants. The colors represent the different research areas of emphasis from the original grant metadata.

Even with these significant distinctions, it was fascinating to all of us–Fellows, staff, and leadership alike–of the power and potential natural language processing has. Not only did this model help power predictive models in a later part of our project, but it also contextualized the data science we were doing by transforming sheer numbers into visual and comprehensible graphs easily digestible by non-technical audiences.

Our short time working in the government offered us a glimpse of the impact we could make as young technologists working and left us incredibly proud of what we accomplished together. This work, however, was not without its challenges.

Data science is a relatively new discipline, and the uncharted territory presents some challenges for existing government structures to implement data strategies effectively. During our experience, basic tools like version control and standardized environment setups which have come to be expected in software and data science development were completely absent. Moreover, many of the databases were decentralized and compartmentalized, with each one containing partial information that we had to work to join together. Disconnects were easily created amongst offices, institutes, and centers tackling similar problems within the same agency.

These minor challenges did not stop us from accomplishing our goals — but they do represent room for growth in government. To us, it means that working in civic technology spaces, using data science for the public good, is a viable option for any tech professional, whether you just graduated from college or you are pivoting after a decorated career in the industry. The government needs us to continue to improve and innovate. Civic technology presents us with an opportunity that intertwines technology and social good, and our fellowship experience is a testament to that. There’s something to be hopeful for: A new generation of talented technologists from across the globe eager to create meaningful social change in the uncertain modern era.

Yuyang Zhong and William Huang were 2021 Civic Digital Fellows at the National Institutes of Health. To learn more about Coding it Forward and the Civic Digital Fellowship, visit our website.

--

--

Program Manager at Coding it Forward | Still learning how psychology and data science works. Boba Connoisseur.