top of page

Research Projects.

research projects

Throughout my undergrad and post-bacc experience, my research has two major themes: AI accountability and socio-political impact. I believe that the algorithms that often govern our lives unseen should be accessible and understood by the public. Regardless of understanding the inner workings of such algorithms, citizens have a right to know how their data may be purposed and deserve access to these tools to perform their own analysis. 

 

Consent in Crisis: The Rapid Decline of the AI Data Commons

2024 | Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund

Abstract

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

Measuring Western-Centricity Within AI Text Data

In progress

Overview

Natural Language Processing (NLP) data’s bias towards Western, educated, and English representation is widely (and correctly) known, but there isn’t a good reference that rigorously quantifies this skew. Especially not one that is systematic, and addresses popular new pretraining and finetuning resources.

In response to the Data Provenance Initiative, I have been collaborating on this project with members of the MIT Media Lab within the Center for Constructive Communication and Human Dynamics. 

Since most datasets are evaluated by the improvements they bring to downstream applications – not on the actual contents of the data itself – we’re in a unique position to examine the state of Western-centricity and to address the problem from the ground up.

Building a Context-Aware Question Answering System to Automate Interview Sense-Making

2023 | Campbell Lund

Overview

How can we use Large Language Models (LLMs) to analyze text at scale? In this project, I utilized context-aware text embeddings to build a document storage and retrieval system with interview transcripts. I was able to retrieve the interviewee's answers to questions despite the organic nature of how a question is phrased by the interviewer. The project was conducted as a part of my Research Fellowship at Wellesley and in relation to establishing a technical Center for Ethics and Equity at the college. Due to data privacy, I cannot share source code. 

Surveying LLMs to Improve NLP Techniques

2023 | Campbell Lund

Overview

Large Language Models (LLMs) have incredible power to improve traditional Natural Language Processing (NLP) tasks. In my research position at the Wellesley Lab for Ethics and Equitable Digital Technology, I explored how to utilize public models for topic modeling and classification tasks. I've created accessible tutorials for getting started with the OpenAI API, and the Google Cloud Vertex AI which are linked below.

2023 | Campbell Lund,

Neel Dhulipala

Abstract

This paper explores how algorithms can be designed and implemented to solve the problem of political redistricting. Specifically, we examine and explain the inner workings of Brian Olson’s BDistricting algorithm applied to the state of Pennsylvania (PA). Using census data, we compare the output of the BDistricting algorithm with the current Congressional Districts of PA. We scored them based on two fairness metrics: Efficiency Gap and the number of Minority Opportunity Districts, which are defined in this paper. We found that the BDistricting algorithm performed better at

representing minorities but had a higher Efficiency Gap than the
current map. A main takeaway from our study is the importance of community context in interpreting the results of redistricting algorithms.

2018 | Campbell Lund,

Shirui Zhong

Abstract

Political polarization has increased dramatically in the United States
throughout the last decade [1]. Similarly, social media applications
have seen an increase in the percentage of users who turn
to the platform as a regular source of news [4]. Naturally, as social
media platforms grow, the algorithms that suggest content become more efficient at detecting what users are likely to interact with. In this paper, we aim to study how engagement algorithms play a role in political polarization through the creation of echo chambers. Specifically, we will focus on the speed and percentage that TikTok curates one’s feed to contain political content based on user interest. We will analyze these differences for three user cases: a liberal user, a conservative user, and an independent user. We hope to measure how political affiliation impacts the rate at which echo chambers form and to classify which side of the political spectrum – if any – TikTok falls on. Our findings are consistent with user demographics data that TikTok is a “liberal app” and we observed a stronger echo chamber effect for a conservative user case.

The Impact of TikTok’s Engagement Algorithm on Political Polarization
Political Redistricting and Assessing Fairness
creative projects

Creative Projects.

I enjoy building things both digital and tangible. In addition to the often theoretical, technical work I produce, I flex my creativity through art and craft. I have experience in ceramics, sculpture, printmaking, and digital design. View my curated portfolio or browse the assorted images below.

bottom of page