Building for the baton pass

What I learned from building an ETL pipeline at the U.S. Census Bureau

Wren McQueary
Coding it Forward

--

Photo by Zach Lucero on Unsplash

“How tricky can it be?” I wondered as I planned an approach for the beginning of my Census Bureau fellowship with Coding it Forward. A few days later, the question returned as I drafted a flowchart. Later, as I opened up PyCharm and typed out the first line of code, I thought about it again. But oh, how the most innocuous challenges can humble!

This summer, I’ve been working with the Business Frame Team at the U.S. Census Bureau on a data engineering project to format and extract data from PDF files. Specifically, the files are Franchise Disclosure Documents (FDDs), which contain information about a franchised enterprise, such as McDonald’s, Carvel, or Hertz. The documents are often 300–500 pages long, and roughly 15–150 pages are taken up by a roster table that lists every physical store under the brand. My project has been to extract those tables and convert them to formatted database entries. Once completed, the project will help automate data ingest into the Census Bureau’s newly created Business Frame: a collection of rich, harmonized business data from multiple sources linked back to the Business Register, a comprehensive list of businesses in the U.S. The Frames Program improves the Census Bureau’s ability to explore and compile data into new measurements for external policymakers and researchers.

Franchise data comes in PDFs and needs to be parsed into tables formatted for the Census Bureau’s Business Register. The franchise disclosure document in this picture is already publicly available.

Ask questions early and often.

I consider myself socially anxious, but I’ve learned from experience the importance of asking questions at work — especially at the beginning when joining a new team or project. I made it a point to ask lots of questions at the start of the fellowship, and that choice remains one of my proudest from the fellowship. I got to know my team better and built personal connections, learned more about the project’s context and motivation, and made a more complete plan for my project. I set up the conditions to produce better work by asking questions early and often.

The competitiveness built into applying to schools and jobs often trains us to over-compare ourselves to others. I think most of my peers can relate to the anxiety of wondering whether you’re enough during these periods, shutting down, and hiding ignorance by not asking questions. Fortunately, it’s been safe to let go of that competitiveness in all the public interest spaces I’ve inhabited so far. People stick around in these spaces because they’re compassionate and seek to nurture others. It’s also beneficial for newcomers to adopt this outlook because it encourages learning from experienced folks. And if your anxious side worries that you’re going to reveal an inadequacy by asking so many questions, remember that being inquisitive also signals a personality fit. The conscientiousness that drives a person to ask about the details and care about the “why” is so important in public interest work.

Be judicious about when to get stuck in the weeds.

Having asked enough questions to develop a clear vision of what my tool needed to do, I set out to build it in Python. I attempted to use two main packages: pdfminer.six to parse nontabular content from PDFs and camelot to extract tabular content. My main challenges would be FDDs’ lack of standardization, and PDF files have no internal structure, unlike HTML or other markup documents. I knew I needed my code to be robust enough to handle a wide range of FDD layouts with little structural information, so generalizability was at the front of my mind as I began developing the tool. But, as I continued working on the script, I realized that my preemptive focus on generalizability made it take longer to have a proof of concept. I was also getting stuck in the weeds, focusing on solving issues I had created rather than learning more about the context and purpose my tool needed to fill.

A flowchart highlighting the many circumstances that could break the pipeline at one point in my development process.

I chose to pivot, building a simple, brittle tool that worked end-to-end on just a small percentage of FDDs. By building a prototype early, I could illustrate my project and solicit meaningful feedback from the rest of the team. That empowered me both to improve on the existing tool and to have a good, functional starting point to which I could add robustness. I ran the tool on different FDD layouts, adding functionality for each one at a time.

Don’t lose sight of the contexts surrounding your work. Don’t add generality without knowing what you’re generalizing for. Instead, use test cases to determine what needs to be generalized.

Be kind to yourself.

As the ten weeks of my fellowship neared their end, it became clear that the approach I had spent most of the fellowship developing wasn’t panning out. The structural gamut of FDDs was too vast for my approach to handle reliably. I found myself anxious and frustrated. Impostor syndrome reared its head, and I grew slower and less enthusiastic about applying for jobs. I socially withdrew from the CIF community I had participated in so enthusiastically. I started to ruminate about all the time I had spent — seemingly wasted — pursuing that approach. And I started overworking myself, fixating on making up for lost time. It turned into a self-imposed punishment — one that needed to stop.

As often happens when working on a project whose mission you care about, I realized I hadn’t been tending to my emotional boundaries surrounding work. I reminded myself that, yes, my tool may be unsatisfactory now, but I’m a single person working as hard as I reasonably can, and this is a large challenge that can’t be solved in just ten weeks. I stopped kicking myself for being so shortsighted and reminded myself that hindsight is always clearer. And the underwhelming distance I’d covered is still okay. Perfect is the enemy of good.

I remembered how during Coding it Forward’s welcome ceremony, Federal CIO Advisor Noreen Hecmanczuk and CFPB Chief Technologist Erie Meyer emphasized the baton pass: that our ten weeks of work rarely constitute a complete project from start to finish, but rather one chapter in the story of a longer project. My team at the Census Bureau graciously offered for me to stay with them for another ten weeks, and I plan to use that time to build the tool into something I’m proud of, but even then, I accept that I probably won’t see the end of its development.

Maintain healthy emotional boundaries around work. Perfect is the enemy of good. You are enough, and your contribution is enough. Build for the baton pass.

Thank you to Jessica Wellwood, Erica Marquette, Mike Kornbau, Christine Cai, Michael Booker, Lori Zehr, Jess Pearson, and everyone else on the Business Frame team for creating a welcoming space, helping me learn and contribute, and being the best team a gal could ask for. Thanks to Marcelle Goggins, Rachel Dodell, Ariana Soto, and Yuyang Zhong for guiding me through the world of job applications for public interest tech. Thank you to Sarah Bier for reminding me that I’m enough.

Wren McQueary is a data scientist and software engineer with an M.S. in Computer Science from George Mason University. This summer, she worked at the U.S. Census Bureau on the Business Frame Team, building an end-to-end data pipeline from scratch in Python to convert millions of pages of minimally structured PDF files into formatted database entries, to better inform policymaking and research, enhance data discoverability, and reduce false matches.

--

--

Public interest technologist, data scientist, software engineer, ML technologist, UX researcher.