b2o: boundary 2 online

Tag: data science

  • Alexander R. Galloway — Big Bro (Review of Wendy Hui Kyun Chun, Discriminating Data Correlation, Neighborhoods, and the New Politics of Recognition)

    Alexander R. Galloway — Big Bro (Review of Wendy Hui Kyun Chun, Discriminating Data Correlation, Neighborhoods, and the New Politics of Recognition)

    a review of Wendy Hui Kyun Chun, Discriminating Data Correlation, Neighborhoods, and the New Politics of Recognition (MIT Press, 2021)

    by Alexander R. Galloway

    I remember snickering when Chris Anderson announced “The End of Theory” in 2008. Writing in Wired magazine, Anderson claimed that the structure of knowledge had inverted. It wasn’t that models and principles revealed the facts of the world, but the reverse, that the data of the world spoke their truth unassisted. Given that data were already correlated, Anderson argued, what mattered was to extract existing structures of meaning, not to pursue some deeper cause. Anderson’s simple conclusion was that “correlation supersedes causation…correlation is enough.”

    This hypothesis — that correlation is enough — is the thorny little nexus at the heart of Wendy Chun’s new book, Discriminating Data. Chun’s topic is data analytics, a hard target that she tackles with technical sophistication and rhetorical flair. Focusing on data-driven tech like social media, search, consumer tracking, AI, and many other things, her task is to exhume the prehistory of correlation, and to show that the new epistemology of correlation is not liberating at all, but instead a kind of curse recalling the worst ghosts of the modern age. As Chun concludes, even amid the precarious fluidity of hyper-capitalism, power operates through likeness, similarity, and correlated identity.

    While interleaved with a number of divergent polemics throughout, the book focuses on four main themes: correlation, discrimination, authentication, and recognition. Chun deals with these four as general problems in society and culture, but also interestingly as specific scientific techniques. For instance correlation has a particular mathematical meaning, as well as a philosophical one. Discrimination is a social pathology but it’s also integral to discrete rationality. I appreciated Chun’s attention to details large and small; she’s writing about big ideas — essence, identity, love and hate, what does it mean to live together? — but she’s also engaging directly with statistics, probability, clustering algorithms, and all the minutia of data science.

    In crude terms, Chun rejects the — how best to call it — the “anarcho-materialist” turn in theory, typified by someone like Gilles Deleuze, where disciplinary power gave way to distributed rhizomes, schizophrenic subjects, and irrepressible lines of flight. Chun’s theory of power isn’t so much about tessellated tapestries of desiring machines as it is the more strictly structuralist concerns of norm and discipline, sovereign and subject, dominant and subdominant. Big tech is the mechanism through which power operates today, Chun argues. And today’s power is racist, misogynist, repressive, and exclusionary. Power doesn’t incite desire so much as stifle and discipline it. In other words George Orwell’s old grey-state villain, Big Brother, never vanished. He just migrated into a new villain, Big Bro, embodied by tech billionaires like Mark Zuckerberg or Larry Page.

    But what are the origins of this new kind of data-driven power? The reader learns that correlation and homophily, or “the notion that birds of a feather naturally flock together” (23), not only subtend contemporary social media platforms like Facebook, but were in fact originally developed by eugenicists like Francis Galton and Karl Pearson. “British eugenicists developed correlation and linear regression” (59), Chun notes dryly, before reminding us that these two techniques are at the core of today’s data science. “When correlation works, it does so by making the present and future coincide with a highly curated past” (52). Or as she puts it insightfully elsewhere, data science doesn’t so much anticipate the future, but predict the past.

    If correlation (pairing two or more pieces of data) is the first step of this new epistemological regime, it is quickly followed by some additional steps. After correlation comes discrimination, where correlated data are separated from other data (and indeed internally separated from themselves). This entails the introduction of a norm. Discriminated data are not simply data that have been paired, but measurements plotted along an axis of comparison. One data point may fall within a normal distribution, while another strays outside the norm within a zone of anomaly. Here Chun focuses on “homophily” (love of the same), writing that homophily “introduces normativity within a supposedly nonnormative system” (96).

    The third and fourth moments in Chun’s structural condition, tagged as “authenticity” and “recognition,” complete the narrative. Once groups are defined via discrimination, they are authenticated as a positive group identity, then ultimately recognized, or we could say self-recognized, by reversing the outward-facing discriminatory force into an inward-facing act of identification. It’s a complex libidinal economy that Chun patiently elaborates over four long chapters, linking these structural moments to specific technologies and techniques such as Bayes’ theorem, clustering algorithms, and facial recognition technology.

    A number of potential paths emerge in the wake of Chun’s work on correlation, which we will briefly mention in passing. One path would be toward Shane Denson’s recent volume, Discorrelated Images, on the loss of correlated experience in media aesthetics. Another would be to collide Chun’s critique of correlation in data science with Quentin Meillassoux’s critique of correlation in philosophy, notwithstanding the significant differences between their two projects.

    Correlation, discrimination, authentication, and recognition are the manifest contents of the book as it unfolds page by page. At the same time Chun puts forward a few meta arguments that span the text as a whole. The first is about difference and the second is about history. In both, Chun reveals herself as a metaphysician and moralist of the highest order.

    First Chun picks up a refrain familiar to feminism and anti-racist theory, that of erasure, forgetting, and ignorance. Marginalized people are erased from the archive; women are silenced; a subject’s embodiment is ignored. Chun offers an appealing catch phrase for this operation, “hopeful ignorance.” Many people in power hope that by ignoring difference they can overcome it. Or as Chun puts it, they “assume that the best way to fight abuse and oppression is by ignoring difference and discrimination” (2). Indeed this posture has been central to political liberalism for a long time, in for instance John Rawls’ derivation of justice via a “veil of ignorance.” For Chun the attempt to find an unmarked category of subjectivity — through that frequently contested pronoun “we” — will perforce erase and exclude those structurally denied access to the universal. “[John Perry] Barlow’s ‘we’ erased so many people,” Chun noted in dismay. “McLuhan’s ‘we’ excludes most of humanity” (9, 15). This is the primary crime for Chun, forgetting or ignoring the racialized and gendered body. (In her last book, Updating to Remain the Same, Chun reprinted a parody of a well-known New Yorker cartoon bearing the caption “On the Internet, nobody knows you’re a dog.” The posture of ignorance, of “nobody knowing,” was thoroughly critiqued by Chun in that book, even as it continues to be defended by liberals).

    Yet if the first crime against difference is to forget the mark, the second crime is to enforce it, to mince and chop people into segregated groups. After all, data is designed to discriminate, as Chun takes the better part of her book to elaborate. These are engines of difference and it’s no coincidence that Charles Babbage called his early calculating machine a “Difference Engine.” Data is designed to segregate, to cluster, to group, to split and mark people into micro identities. We might label this “bad” difference. Bad difference is when the naturally occurring multiplicity of the world is canalized into clans and cliques, leveraged for the machinations of power rather than the real experience of people.

    To complete the triad, Chun has proposed a kind of “good” difference. For Chun authentic life is rooted in difference, often found through marginalized experience. Her muse is “a world that resonates with and in difference” (3). She writes about “the needs and concerns of black women” (49). She attends to “those whom the archive seeks to forget” (237). Good difference is intersectional. Good difference attends to identity politics and the complexities of collective experience.

    Bad, bad, good — this is a triad, but not a dialectical one. Begin with 1) the bad tech posture of ignoring difference; followed by 2) the worse tech posture of specifying difference in granular detail; contrasted with 3) a good life that “resonates with and in difference.” I say “not dialectical” because the triad documents difference changing position rather than the position of difference changing (to paraphrase Catherine Malabou from her book on Changing Difference). Is bad difference resolved by good difference? How to tell the difference? For this reason I suggest we consider Discriminating Data as a moral tale — although I suspect Chun would balk at that adjective — because everything hinges on a difference between the good and the bad.

    Chun’s argument about good and bad difference is related to an argument about history, revealed through what she terms the “Transgressive Hypothesis.” I was captivated by this section of the book. It connects to a number of debates happening today in both theory and culture at large. Her argument about history has two distinct waves, and, following the contradictory convolutions of history, the second wave reverses and inverts the first.

    Loosely inspired by Michel Foucault’s Repressive Hypothesis, Chun’s Transgressive Hypothesis initially describes a shift in society and culture roughly coinciding with the Baby Boom generation in the late Twentieth Century. Let’s call it the 1968 mindset. Reacting to the oppressions of patriarchy, the grey-state threats of centralized bureaucracy, and the totalitarian menace of “Nazi eugenics and Stalinism,” liberation was found through “‘authentic transgression’” via “individualism and rebellion” (76). This was the time of the alternative, of the outsider, of the nonconformist, of the anti-authoritarian, the time of “thinking different.” Here being “alt” meant being left, albeit a new kind of left.

    Chun summons a familiar reference to make her point: the Apple Macintosh advertisement from 1984 directed by Ridley Scott, in which a scary Big Brother is dethroned by a colorful lady jogger brandishing a sledge hammer. “Resist, resist, resist,” was how Chun put the mantra. “To transgress…was to be free” (76). Join the resistance, unplug, blow your mind on red pills. Indeed the existential choice from The Matrix — blue pill for a life of slavery mollified by ignorance, red pill for enlightenment and militancy tempered by mortal danger — acts as a refrain throughout Chun’s book. In sum the Transgressive Hypothesis “equated democracy with nonnormative structures and behaviors” (76). To live a good life was to transgress.

    But this all changed in 1984, or thereabouts. Chun describes a “reverse hegemony” — a lovely phrase that she uses only twice — where “complaints against the ‘mainstream’ have become ‘mainstreamed’” (242). Power operates through reverse hegemony, she claims, “The point is never to be a ‘normie’ even as you form a norm” (34). These are the consequences of the rise of neoliberalism, fake corporate multiculturalism, Ronald Reagan and Margaret Thatcher but even more so Bill Clinton and Tony Blaire. Think postfordism and postmodernism. Think long tails and the multiplicity of the digital economy. Think woke-washing at CIA and Spike Lee shilling cryptocurrency. Think Hypernormalization, New Spirit of Capitalism, Theory of the Young Girl, To Live and Think Like Pigs. Complaints against the mainstream have become mainstreamed. And if power today has shifted “left,” then — Reverse Hegemony Brain go brrr — resistance to power shifts “right.” A generation ago the Q Shaman would have been a leftwing nut nattering about the Kennedy assassination. But today he’s a right wing nut (alas still nattering about the Kennedy assassination).

    “Red pill toxicity” (29) is how Chun characterizes the responses to this new topsy-turvy world of reverse hegemony. (To be sure, she’s only the latest critic weighing in on the history of the present; other well-known accounts include Angela Nagle’s 2017 book Kill All Normies, and Mark Fisher’s notorious 2013 essay “Exiting the Vampire Castle.”) And if libs, hippies, and anarchists had become the new dominant, the election of Donald Trump showed that “populism, paranoia, polarization” (77) could also reemerge as a kind of throwback to the worst political ideologies of the Twentieth Century. With Trump the revolutions of history — ironically, unstoppably — return to where they began, in “the totalitarian world view” (77).

    In other words these self-styled rebels never actually disrupted anything, according to Chun. At best they used disruption as a kind of ideological distraction for the same kinds of disciplinary management structures that have existed since time immemorial. And if Foucault showed that nineteenth-century repression also entailed an incitement to discourse, Chun describes how twentieth-century transgression also entailed a novel form of management. Before it was “you thought you were repressed but in fact you’re endlessly sublating and expressing.” Now it’s “you thought you were a rebel but disruption is a standard tactic of the Professional Managerial Class.” Or as Jacques Lacan said in response to some young agitators in his seminar, vous voulez un maître, vous l’aurez. Slavoj Žižek’s rendering, slightly embellished, best captures the gist: “As hysterics, you demand a new master. You will get it!

    I doubt Chun would embrace the word “hysteric,” a term indelibly marked by misogyny, but I wish she would, since hysteria is crucial to her Transgressive Hypothesis. In psychoanalysis, the hysteric is the one who refuses authority, endlessly and irrationally. And bless them for that; we need more hysterics in these dark times. Yet the lesson from Lacan and Žižek is not so much that the hysteric will conjure up a new master out of thin air. In a certain sense, the lesson is the reverse, that the Big Other doesn’t exist, that Big Brother himself is a kind of hysteric, that power is the very power that refuses power.

    This position makes sense, but not completely. As a recovering Deleuzian, I am indelibly marked by a kind of antinomian political theory that defines power as already heterogenous, unlawful, multiple, anarchic, and material. However I am also persuaded by Chun’s more classical posture, where power is a question of sovereign fiat, homogeneity, the central and the singular, the violence of the arche, which works through enclosure, normalization, and discipline. Faced with this type of power, Chun’s conclusion is, if I can compress a hefty book into a single writ, that difference will save us from normalization. In other words, while Chun is critical of the Transgressive Hypothesis, she ends up favoring the Big-Brother theory of power, where authentic alternatives escape repressive norms.

    I’ll admit it’s a seductive story. Who doesn’t want to believe in outsiders and heroes winning against oppressive villains? And the story is especially appropriate for the themes of Discriminating Data: data science of course entails norms and deviations; but also, in a less obvious way, data science inherits the old anxieties of skeptical empiricism, where the desire to make a general claim is always undercut by an inability to ground generality.

    Yet I suspect her political posture relies a bit too heavily on the first half of the Transgressive Hypothesis, the 1984 narrative of difference contra norm, even as she acknowledges the second half of the narrative where difference became a revanchist weapon for big tech (to say nothing of difference as a bonafide management style). This leads to some interesting inconsistencies. For instance Chun notes that Apple’s 1984 hammer thrower is a white woman disrupting an audience of white men. But she doesn’t say much else about her being a woman, or about the rainbow flag that ends the commercial. The Transgressive Hypothesis might be the quintessential tech bro narrative but it’s also the narrative of feminism, queerness, and the new left more generally. Chun avoids claiming that feminism failed; but she’s also savvy enough to avoid saying that it succeeded. And if Sadie Plant once wrote that “cybernetics is feminization,” for Chun it’s not so clear. According to Chun the cybernetic age of computers, data, and ubiquitous networks still orients around structures of normalization: masculine, white, straight, affluent and able-bodied. Resistant to such regimes of normativity, Chun must nevertheless invent a way to resist those who were resisting normativity.

    Regardless, for Chun the conclusion is clear: these hysterics got their new master. If not immediately they got it eventually, via the advent of Web 2.0 and the new kind of data-centric capitalism invented in the early 2000s. Correlation isn’t enough — and that’s the reason why. Correlation means the forming of a general relation, if only the most minimal generality of two paired data points. And, worse, correlation’s generality will always derive from past power and organization rather than from a reimagining of the present. Hence correlation for Chun is a type of structural pessimism, in that it will necessarily erase and exclude those denied access to the general relation.

    Characterized by a narrative poignancy and an attention to the ideological conditions of everyday life, Chun highlights alternative relations that could hopefully replace the pessimism of correlation. Such alternatives might take the form of a “potential history” or a “critical fabulation,” phrases borrowed from Ariella Azoulay and Saidiya Hartman, respectively. For Azoulay potential history means to “‘give an account of diverse worlds that persist’”; for Hartman, critical fabulation means “to see beyond numbers and sources” (79). A slim offering covering a few pages, nevertheless these references to Azoulay and Hartman indicate an appealing alternative for Chun, and she ends her book where it began, with an eloquent call to acknowledge “a world that resonates with and in difference.”

    _____

    Alexander R. Galloway is a writer and computer programmer working on issues in philosophy, technology, and theories of mediation. Professor of Media, Culture, and Communication at New York University, he is author of several books and dozens of articles on digital media and critical theory, including Protocol: How Control Exists after Decentralization (MIT, 2006), Gaming: Essays in Algorithmic Culture (University of Minnesota, 2006); The Interface Effect (Polity, 2012), Laruelle: Against the Digital (University of Minnesota, 2014), and most recently, Uncomputable: Play and Politics in the Long Digital Age (Verso, 2021).

    Back to the essay

     

  • Data and Desire in Academic Life

    Data and Desire in Academic Life

    a review of Erez Aiden and Jean-Baptiste Michel, Uncharted: Big Data as a Lens on Human Culture (Riverhead Books, reprint edition, 2014)
    by Benjamin Haber
    ~

    On a recent visit to San Francisco, I found myself trying to purchase groceries when my credit card was declined. As the cashier is telling me this news, and before I really had time to feel any particular way about it, my leg vibrates. I’ve received a text: “Chase Fraud-Did you use card ending in 1234 for $100.40 at a grocery store on 07/01/2015? If YES reply 1, NO reply 2.” After replying “yes” (which was recognized even though I failed to follow instructions), I swiped my card again and was out the door with my food. Many have probably had a similar experience: most if not all credit card companies automatically track purchases for a variety of reasons, including fraud prevention, the tracking of illegal activity, and to offer tailored financial products and services. As I walked out of the store, for a moment, I felt the power of “big data,” how real-time consumer information can be read as be a predictor of a stolen card in less time than I had to consider why my card had been declined. It was a too rare moment of reflection on those networks of activity that modulate our life chances and capacities, mostly below and above our conscious awareness.

    And then I remembered: didn’t I buy my plane ticket with the points from that very credit card? And in fact, hadn’t I used that card on multiple occasions in San Francisco for purchases not much less than the amount my groceries cost. While the near-instantaneous text provided reassurance before I could consciously recognize my anxiety, the automatic card decline was likely not a sophisticated real-time data-enabled prescience, but a rather blunt instrument, flagging the transaction on the basis of two data points: distance from home and amount of purchase. In fact, there is plenty of evidence to suggest that the gap between data collection and processing, between metadata and content and between current reality of data and its speculative future is still quite large. While Target’s pregnancy predicting algorithm was a journalistic sensation, the more mundane computational confusion that has Gmail constantly serving me advertisements for trade and business schools shows the striking gap between the possibilities of what is collected and the current landscape of computationally prodded behavior. The text from Chase, your Klout score, the vibration of your FitBit, or the probabilistic genetic information from 23 and me are all primarily affective investments in mobilizing a desire for data’s future promise. These companies and others are opening of new ground for discourse via affect, creating networked infrastructures for modulating the body and social life.

    I was thinking about this while reading Uncharted: Big Data as a Lens on Human Culture, a love letter to the power and utility of algorithmic processing of the words in books. Though ostensibly about the Google Ngram Viewer, a neat if one-dimensional tool to visualize the word frequency of a portion of the books scanned by Google, Uncharted is also unquestionably involved in the mobilization of desire for quantification. Though about the academy rather than financialization, medicine, sports or any other field being “revolutionized” by big data, its breathless boosterism and obligatory cautions are emblematic of the emergent datafied spirit of capitalism, a celebratory “coming out” of the quantifying systems that constitute the emergent infrastructures of sociality.

    While published fairly recently, in 2013, Uncharted already feels dated in its strangely muted engagement with the variety of serious objections to sprawling corporate and state run data systems in the post-Snowden, post-Target, post-Ashley Madison era (a list that will always be in need of update). There is still the dazzlement about the sheer magnificent size of this potential new suitor—“If you wrote out all five zettabytes that humans produce every year by hand, you would reach the core of the Milky Way” (11)—all the more impressive when explicitly compared to the dusty old technologies of ink and paper. Authors Erez Aiden and Jean-Baptiste Michel are floating in a world of “simple and beautiful” formulas (45), “strange, fascinating and addictive” methods (22), producing “intriguing, perplexing and even fun” conclusions (119) in their drive to colonize the “uncharted continent” (76) that is the English language. The almost erotic desire for this bounty is made more explicit in their tongue-in-cheek characterization of their meetings with Google employees as an “irresistible… mating dance” (22):

    Scholars and scientists approach engineers, product managers, and even high-level executives about getting access to their companies’ data. Sometimes the initial conversation goes well. They go out for coffee. One thing leads to another, and a year later, a brand-new person enters the picture. Unfortunately this person is usually a lawyer. (22)

    There is a lot to unpack in these metaphors, the recasting of academic dependence on data systems designed and controlled by corporate entities as a sexy new opportunity for scholars and scientists. There are important conversations to be had about these circulations of quantified desire; about who gets access to this kind of data, the ethics of working with companies who have an existential interest in profit and shareholder return and the cultural significance of wrapping business transactions in the language of heterosexual coupling. Here however I am mostly interested in the real allure that this passage and others speaks to, and the attendant fear that mostly whispers, at least in a book written by Harvard PhDs with Ted talks to give.

    For most academics in the social sciences and the humanities “big data” is a term more likely to get caught in the throat than inspire butterflies in the stomach. While Aiden and Michel certainly acknowledge that old-fashion textual analysis (50) and theory (20) will have a place in this brave new world of charts and numbers, they provide a number of contrasts to suggest the relative poverty of even the most brilliant scholar in the face of big data. One hypothetical in particular, that is not directly answered but is strongly implied, spoke to my discipline specifically:

    Consider the following question: Which would help you more if your quest was to learn about contemporary human society—unfettered access to a leading university’s department of sociology, packed with experts on how societies function, or unfettered access to Facebook, a company whose goal is to help mediate human social relationships online? (12)

    The existential threat at the heart of this question was catalyzed for many people in Roger Burrows and Mike Savage’s 2007 “The Coming Crisis of Empirical Sociology,” an early canary singing the worry of what Nigel Thrift has called “knowing capitalism” (2005). Knowing capitalism speaks to the ways that capitalism has begun to take seriously the task of “thinking the everyday” (1) by embedding information technologies within “circuits of practice” (5). For Burrows and Savage these practices can and should be seen as a largely unrecognized world of sophisticated and profit-minded sociology that makes the quantitative tools of academics look like “a very poor instrument” in comparison (2007: 891).

    Indeed, as Burrows and Savage note, the now ubiquitous social survey is a technology invented by social scientists, folks who were once seen as strikingly innovative methodologists (888). Despite ever more sophisticated statistical treatments however, the now over 40 year old social survey remains the heart of social scientific quantitative methodology in a radically changed context. And while declining response rates, a constraining nation-based framing and competition from privately-funded surveys have all decreased the efficacy of academic survey research (890), nothing has threatened the discipline like the embedded and “passive” collecting technologies that fuel big data. And with these methodological changes come profound epistemological ones: questions of how, when, why and what we know of the world. These methods are inspiring changing ideas of generalizability and new expectations around the temporality of research. Does it matter, for example, that studies have questioned the accuracy of the FitBit? The growing popularity of these devices suggests at the very least that sociologists should not count on empirical rigor to save them from irrelevance.

    As academia reorganizes around the speculative potential of digital technologies, there is an increasing pile of capital available to those academics able to translate between the discourses of data capitalism and a variety of disciplinary traditions. And the lure of this capital is perhaps strongest in the humanities, whose scholars have been disproportionately affected by state economic retrenchment on education spending that has increasingly prioritized quantitative, instrumental, and skill-based majors. The increasing urgency in the humanities to use bigger and faster tools is reflected in the surprisingly minimal hand wringing over the politics of working with companies like Facebook, Twitter and Google. If there is trepidation in the N-grams project recounted in Uncharted, it is mostly coming from Google, whose lawyers and engineers have little incentive to bother themselves with the politically fraught, theory-driven, Institutional Review Board slow lane of academic production. The power imbalance of this courtship leaves those academics who decide to partner with these companies at the mercy of their epistemological priorities and, as Uncharted demonstrates, the cultural aesthetics of corporate tech.

    This is a vision of the public humanities refracted through the language of public relations and the “measurable outcomes” culture of the American technology industry. Uncharted has taken to heart the power of (re)branding to change the valence of your work: Aiden and Michel would like you to call their big data inflected historical research “culturomics” (22). In addition to a hopeful attempt to coin a buzzy new work about the digital, culturomics linguistically brings the humanities closer to the supposed precision, determination and quantifiability of economics. And lest you think this multivalent bringing of culture to capital—or rather the renegotiation of “the relationship between commerce and the ivory tower” (8)—is unseemly, Aiden and Michel provide an origin story to show how futile this separation has been.

    But the desire for written records has always accompanied economic activity, since transactions are meaningless unless you can clearly keep track of who owns what. As such, early human writing is dominated by wheeling and dealing: a menagerie of bets, chits, and contracts. Long before we had the writings of prophets, we had the writing of profits. (9)

    And no doubt this is true: culture is always already bound up with economy. But the full-throated embrace of culturomics is not a vision of interrogating and reimagining the relationship between economic systems, culture and everyday life; [1] rather it signals the acceptance of the idea of culture as transactional business model. While Google has long imagined itself as a company with a social mission, they are a publicly held company who will be punished by investors if they neglect their bottom line of increasing the engagement of eyeballs on advertisements. The N-gram Viewer does not make Google money, but it perhaps increases public support for their larger book-scanning initiative, which Google clearly sees as a valuable enough project to invest many years of labor and millions of dollars to defend in court.

    This vision of the humanities is transactionary in another way as well. While much of Uncharted is an attempt to demonstrate the profound, game-changing implications of the N-gram viewer, there is a distinctly small-questions, cocktail-party-conversation feel to this type of inquiry that seems ironically most useful in preparing ABD humanities and social science PhDs for jobs in the service industry than in training them for the future of academia. It might be more precise to say that the N-gram viewer is architecturally designed for small answers rather than small questions. All is resolved through linear projection, a winner and a loser or stasis. This is a vision of research where the precise nature of the mediation (what books have been excluded? what is the effect of treating all books as equally revealing of human culture? what about those humans whose voices have been systematically excluded from the written record?) is ignored, and where the actual analysis of books, and indeed the books themselves, are black-boxed from the researcher.

    Uncharted speaks to perils of doing research under the cloud of existential erasure and to the failure of academics to lead with a different vision of the possibilities of quantification. Collaborating with the wealthy corporate titans of data collection requires an acceptance of these companies own existential mandate: make tons of money by monetizing a dizzying array of human activities while speculatively reimagining the future to attempt to maintain that cash flow. For Google, this is a vision where all activities, not just “googling” are collected and analyzed in a seamlessly updating centralized system. Cars, thermostats, video games, photos, businesses are integrated not for the public benefit but because of the power of scale to sell or rent or advertise products. Data is promised as a deterministic balm for the unknowability of life and Google’s participation in academic research gives them the credibility to be your corporate (sen.se) mother. What, might we imagine, are the speculative possibilities of networked data not beholden to shareholder value?
    _____

    Benjamin Haber is a PhD candidate in Sociology at CUNY Graduate Center and a Digital Fellow at The Center for the Humanities. His current research is a cultural and material exploration of emergent infrastructures of corporeal data through a queer theoretical framework. He is organizing a conference called “Queer Circuits in Archival Times: Experimentation and Critique of Networked Data” to be held in New York City in May 2016.

    Back to the essay

    _____

    Notes

    [1] A project desperately needed in academia, where terms like “neoliberalism,” “biopolitics” and “late capitalism” more often than not are used briefly at end of a short section on implications rather than being given the critical attention and nuanced intentionality that they deserve.

    Works Cited

    Savage, Mike, and Roger Burrows. 2007. “The Coming Crisis of Empirical Sociology.” Sociology 41 (5): 885–99.

    Thrift, Nigel. 2005. Knowing Capitalism. London: SAGE.

  • Who Big Data Thinks We Are (When It Thinks We're Not Looking)

    Who Big Data Thinks We Are (When It Thinks We're Not Looking)

    Dataclysm: Who We Are (When We Think No One's Looking) (Crown, 2014)a review of Christian Rudder, Dataclysm: Who We Are (When We Think No One’s Looking) (Crown, 2014)
    by Cathy O’Neil
    ~
    Here’s what I’ve spent the last couple of days doing: alternatively reading Christian Rudder’s new book Dataclysm and proofreading a report by AAPOR which discusses the benefits, dangers, and ethics of using big data, which is mostly “found” data originally meant for some other purpose, as a replacement for public surveys, with their carefully constructed data collection processes and informed consent. The AAPOR folk have asked me to provide tangible examples of the dangers of using big data to infer things about public opinion, and I am tempted to simply ask them all to read Dataclysm as exhibit A.

    Rudder is a co-founder of OKCupid, an online dating site. His book mainly pertains to how people search for love and sex online, and how they represent themselves in their profiles.

    Here’s something that I will mention for context into his data explorations: Rudder likes to crudely provoke, as he displayed when he wrote this recent post explaining how OKCupid experiments on users. He enjoys playing the part of the somewhat creepy detective, peering into what OKCupid users thought was a somewhat private place to prepare themselves for the dating world. It’s the online equivalent of a video camera in a changing booth at a department store, which he defended not-so-subtly on a recent NPR show called On The Media, and which was written up here.

    I won’t dwell on that aspect of the story because I think it’s a good and timely conversation, and I’m glad the public is finally waking up to what I’ve known for years is going on. I’m actually happy Rudder is so nonchalant about it because there’s no pretense.

    Even so, I’m less happy with his actual data work. Let me tell you why I say that with a few examples.

    Who Are OKCupid Users?

    I spent a lot of time with my students this summer saying that a standalone number wouldn’t be interesting, that you have to compare that number to some baseline that people can understand. So if I told you how many black kids have been stopped and frisked this year in NYC, I’d also need to tell you how many black kids live in NYC for you to get an idea of the scope of the issue. It’s a basic fact about data analysis and reporting.

    When you’re dealing with populations on dating sites and you want to conclude things about the larger culture, the relevant “baseline comparison” is how well the members of the dating site represent the population as a whole. Rudder doesn’t do this. Instead he just says there are lots of OKCupid users for the first few chapters, and then later on after he’s made a few spectacularly broad statements, on page 104 he compares the users of OKCupid to the wider internet users, but not to the general population.

    It’s an inappropriate baseline, made too late. Because I’m not sure about you but I don’t have a keen sense of the population of internet users. I’m pretty sure very young kids and old people are not well represented, but that’s about it. My students would have known to compare a population to the census. It needs to happen.

    How Do You Collect Your Data?

    Let me back up to the very beginning of the book, where Rudder startles us by showing us that the men that women rate “most attractive” are about their age whereas the women that men rate “most attractive” are consistently 20 years old, no matter how old the men are.

    Actually, I am projecting. Rudder never actually specifically tells us what the rating is, how it’s exactly worded, and how the profiles are presented to the different groups. And that’s a problem, which he ignores completely until much later in the book when he mentions that how survey questions are worded can have a profound effect on how people respond, but his target is someone else’s survey, not his OKCupid environment.

    Words matter, and they matter differently for men and women. So for example, if there were a button for “eye candy,” we might expect women to choose more young men. If my guess is correct, and the term in use is “most attractive”, then for men it might well trigger a sexual concept whereas for women it might trigger a different social construct; indeed I would assume it does.

    Since this isn’t a porn site, it’s a dating site, we are not filtering for purely visual appeal; we are looking for relationships. We are thinking beyond what turns us on physically and asking ourselves, who would we want to spend time with? Who would our family like us to be with? Who would make us be attractive to ourselves? Those are different questions and provoke different answers. And they are culturally interesting questions, which Rudder never explores. A lost opportunity.

    Next, how does the recommendation engine work? I can well imagine that, once you’ve rated Profile A high, there is an algorithm that finds Profile B such that “people who liked Profile A also liked Profile B”. If so, then there’s yet another reason to worry that such results as Rudder described are produced in part as a result of the feedback loop engendered by the recommendation engine. But he doesn’t explain how his data is collected, how it is prompted, or the exact words that are used.

    Here’s a clue that Rudder is confused by his own facile interpretations: men and women both state that they are looking for relationships with people around their own age or slightly younger, and that they end up messaging people slightly younger than they are but not many many years younger. So forty year old men do not message twenty year old women.

    Is this sad sexual frustration? Is this, in Rudder’s words, the difference between what they claim they want and what they really want behind closed doors? Not at all. This is more likely the difference between how we live our fantasies and how we actually realistically see our future.

    Need to Control for Population

    Here’s another frustrating bit from the book: Rudder talks about how hard it is for older people to get a date but he doesn’t correct for population. And since he never tells us how many OKCupid users are older, nor does he compare his users to the census, I cannot infer this.

    Here’s a graph from Rudder’s book showing the age of men who respond to women’s profiles of various ages:

    dataclysm chart 1

    We’re meant to be impressed with Rudder’s line, “for every 100 men interested in that twenty year old, there are only 9 looking for someone thirty years older.” But here’s the thing, maybe there are 20 times as many 20-year-olds as there are 50-year-olds on the site? In which case, yay for the 50-year-old chicks? After all, those histograms look pretty healthy in shape, and they might be differently sized because the population size itself is drastically different for different ages.

    Confounding

    One of the worst examples of statistical mistakes is his experiment in turning off pictures. Rudder ignores the concept of confounders altogether, which he again miraculously is aware of in the next chapter on race.

    To be more precise, Rudder talks about the experiment when OKCupid turned off pictures. Most people went away when this happened but certain people did not:

    dataclysm chart 2

    Some of the people who stayed on went on a “blind date.” Those people, which Rudder called the “intrepid few,” had a good time with people no matter how unattractive they were deemed to be based on OKCupid’s system of attractiveness. His conclusion: people are preselecting for attractiveness, which is actually unimportant to them.

    But here’s the thing, that’s only true for people who were willing to go on blind dates. What he’s done is select for people who are not superficial about looks, and then collect data that suggests they are not superficial about looks. That doesn’t mean that OKCupid users as a whole are not superficial about looks. The ones that are just got the hell out when the pictures went dark.

    Race

    This brings me to the most interesting part of the book, where Rudder explores race. Again, it ends up being too blunt by far.

    Here’s the thing. Race is a big deal in this country, and racism is a heavy criticism to be firing at people, so you need to be careful, and that’s a good thing, because it’s important. The way Rudder throws it around is careless, and he risks rendering the term meaningless by not having a careful discussion. The frustrating part is that I think he actually has the data to have a very good discussion, but he just doesn’t make the case the way it’s written.

    Rudder pulls together stats on how men of all races rate women of all races on an attractiveness scale of 1-5. It shows that non-black men find their own race attractive and non-black men find black women, in general, less attractive. Interesting, especially when you immediately follow that up with similar stats from other U.S. dating sites and – most importantly – with the fact that outside the U.S., we do not see this pattern. Unfortunately that crucial fact is buried at the end of the chapter, and instead we get this embarrassing quote right after the opening stats:

    And an unintentionally hilarious 84 percent of users answered this match question:

    Would you consider dating someone who has vocalized a strong negative bias toward a certain race of people?

    in the absolute negative (choosing “No” over “Yes” and “It depends”). In light of the previous data, that means 84 percent of people on OKCupid would not consider dating someone on OKCupid.

    Here Rudder just completely loses me. Am I “vocalizing” a strong negative bias towards black women if I am a white man who finds white women and Asian women hot?

    Especially if you consider that, as consumers of social platforms and sites like OKCupid, we are trained to rank all the products we come across to ultimately get better offerings, it is a step too far for the detective on the other side of the camera to turn around and point fingers at us for doing what we’re told. Indeed, this sentence plunges Rudder’s narrative deeply into the creepy and provocative territory, and he never fully returns, nor does he seem to want to. Rudder seems to confuse provocation for thoughtfulness.

    This is, again, a shame. A careful conversation about the issues of what we are attracted to, what we can imagine doing, and how we might imagine that will look to our wider audience, and how our culture informs those imaginings, are all in play here, and could have been drawn out in a non-accusatory and much more useful way.


    _____

    Cathy O’Neil is a data scientist and mathematician with experience in academia and the online ad and finance industries. She is one of the most prominent and outspoken women working in data science today, and was one of the guiding voices behind Occupy Finance, a book produced by the Occupy Wall Street Alt Banking group. She is the author of “On Being a Data Skeptic” (Amazon Kindle, 2013), and co-author with Rachel Schutt of Doing Data Science: Straight Talk from the Frontline (O’Reilly, 2013). Her Weapons of Math Destruction is forthcoming from Random House. She appears on the weekly Slate Money podcast hosted by Felix Salmon. She maintains the widely-read mathbabe blog, on which this review first appeared.

    Back to the essay