Algorithms, archives and evidence of culture

Last week I attended the Australian Society of Archivists Annual Conference. One of my jobs was to talk about the cultural perspectives of technologies. I talked about algorithms, machine learning, and constructions of evidence of culture. I think I scared people.

Below is a video of the extended talk. It goes for just over 10 mins. At the end, I talk about my Mediated recordkeeping model and how it might be useful in exploring these expanding contexts and complexities of culture.

I am keen to explore the role of machine learning in cultural heritage spaces. Who wants to help?

Transcript:

Slide 1:

Image: http://www.npr.org/sections/13.7/2017/03/22/521059752/would-you-become-an-immortal-machine

Hello, I am Dr. Leisa Gibbons from Curtin University. I teach archives and preservation to undergrad and post grad students. In my research, I explore sociotechnical issues, impacts and implications of acquisition and preservation of online content and the role that archivists can, do and might play in the formation of digital cultural heritage.

In this presentation I am going to share with you some intriguing information about algorithms and machine learning I have been collecting over the last year or so, so that I might talk about the nature and purpose of web archiving and how it is possible to understand evidence of culture as it is being valued and formed over spacetime.

Originally, this presentation was designed in PechaKucha style where 20 slides are shown for 20 seconds each. This presentation has 13 slides with the last one being quite a lot longer than 20 seconds.

Slide 2

Image: http://theconversation.com/no-more-playing-games-alphago-ai-to-tackle-some-real-world-challenges-78472

This year Professor Geoff Goodhill, from the University of Queensland wrote about AlphaGo, an AI program designed to learn to play Go. AlphaGo learns via use of neural networks and extraction of key ideas.

Slide 3

Image: https://www.forbes.com/sites/kalevleetaru/2017/09/16/ai-gaydar-and-how-the-future-of-ai-will-be-exempt-from-ethical-review/#3f3e339e2c09

You’ve probably heard about the algorithm created by Standford researchers that predicts sexual orientation from photographs of a person’s face? This is also generated with learning neural network technology.

Yet, as Professor Geoff Goodhill mentioned, there is no known way to interrogate the network to directly read out what these key ideas are that help the algorithm make decisions. Instead they can only study its outputs and hope to learn from these.

Slide 4

Image: https://ichef.bbci.co.uk/news/624/cpsprodpb/D8FD/production/_96194555_3c7a28f4-bf98-4df5-96ea-476616b896cd.jpg

A couple of years ago, Vladan Joler and colleagues in Belgrade began investigating the inner workings of Facebook. This image is a flow chart that they created on how our interactions with Facebook create data – which show how we, as Facebook users, are in fact doing unpaid work for Facebook – so they can sell us stuff.

We all know this of course, but perhaps we think less about what this might mean in 20 or 150 years time related to data privacy and surveillance when you think about the data we give Facebook is used to calculate our ethnic affinity (Facebook’s term), sexual orientation, political affiliation, social class, travel schedule and much more.

Slide 5

Image: https://aeon.co/essays/judge-jury-and-executioner-the-unaccountable-algorithm?curator=MediaREDEF

In 2013, a community of scholars and activists gathered in the US to examine and discuss the social justice impact of algorithmic accountability or #algacc. Tthey raised more questions than answers about the impact of data surveillance and our right to know what and how data collected about us is being used.

Slide 6

Image: https://pbs.twimg.com/media/CkmaUGgXEAAROUu.jpg

UCLA Assistant Professor Safiya Noble writes about algorithms of oppression and how the data they use to learn reinforces existing structures of racism and sexism. Safiya talks about how a Google search she undertook on the search term “black girls” often suggested porn sites and un-moderated discussions about “why black women are so sassy” or “why black women are so angry” – presenting a disturbing portrait of black womanhood in modern society.

Slide 7

Image: http://assets.pewresearch.org/wp-content/uploads/sites/14/2017/02/06143847/PI_2017.02.08_algorithms_0-01.png

Reseachers at the Pew Research Center identified seven main themes about the algorithm era.

As part of sharing these concerns they tell a story of how Microsoft engineers recently created a Twitter bot named “Tay” in an attempt to chat with Millennials by responding to their prompts, but within hours it was spouting racist, sexist, Holocaust-denying tweets based on algorithms that had it “learning” how to respond to others based on what was tweeted at it.

Slide 8

Image: https://media.newyorker.com/photos/5931822ff7120e02cf40436a/master/w_649,c_limit/Nijhuis-Big-Data-2.jpg

This year, US Professor Ben Shneiderman proposed that there should be a regulatory body called a National Algorithms Safety Board, which would provide oversight for high-stakes algorithms.

Slide 9

Image: http://www.abc.net.au/news/2017-03-20/algorithms-flowchart-illustration/8360072

In Australia, there are at least 20 separate parts of law that allow the government to give computers the power to make decisions. Decisions that used to be made by a human and can have important consequences.

These laws allow for computers to make decisions about social security, taxation, parental leave, superannuation, migration, biosecurity and child support. In every case, some kind of algorithm may be used to make decisions, yet we have no knowledge of how these work.

These are powerful and disturbing stories about the creation and use of data, the role the internet plays and the shaping role that mathematics and computers are playing in our society. This brings me to web archiving.

Slide 10

Image:

One of the most basic tenants of all data science is that data doesn’t exist in a vacuum, it is the result of a massive pipeline of explicit and implicit decisions

yet so much of the output of the data science world proceeds as if data can be cleanly separated from the contexts in which it is created.

Nowhere is this more apparent than the world of web archiving.

Researcher Kalev Leetaru, wrote an article for Forbes recently that starts with this paragraph. This was not his first dig at how poorly web archiving is conceptualised and constructed. He started in 2012 talking about the lack of documentation regarding even the most critical decisions like inclusion criteria, seed lists and third-party crawl donors means that we have precious little insight into how these archives were constructed and what biases may be manifest through those myriad decisions.

This is not a new conversation for me either. But algorthms and the rate of change in our virtual spaces and technologies are raising the stakes.

Slide 11

Image:

When it comes to using data to understand the world around us, the most important question revolves around how well that data reflects the phenomena we are attempting to study.

Kalev rightly asks questions about the nature of web archiving. When it comes to using data to understand the world around us, the most important question revolves around how well that data reflects the phenomena we are attempting to study. Do Twitter-based studies of human society truly reflect the dreams and fears of global society or are they systematically biased geographically and demographically? Do the breaking news events surfaced by the Facebook Trending Topics module exclude much of the continent of Africa and is Africa as a whole largely absent from the datasets we use to understand the world? Does the relative dearth of analytic algorithms for languages other than English mean we miss critical trends.

Slide 12

Image: https://specials-images.forbesimg.com/imageserve/52cd2b055def4f42b28d687712caf2aa/960×0.jpg?fit=scale

All this exploration of algorithms and the internet comes back to a question I have been raising for a decade now – what is evidence of culture? And in this question, what is the role of the archivist and the archives in the construction and dissemination of cultural heritage?

If web archives are online cultural heritage, how is their construction being understood and documented? As Kalev points out – does the medium examined define the results?

This raises the question of what web archives actually evidence of? But how do we interrogate the notion of evidence of culture?

Slide 13

Image: Mediated recordkeeping model

I want to share with you a model I created from research on how to understand the complexity of evidence of culture in online spaces. This model is an attempt to make sense of how and why people interact with recorded information – the purposes, the values, and the nature of memory as it is created, shared, accessed and managed over time in various and complex ways, including in response to technologies, other people and entities, and various mechanisms, systems and tools that help to enable and empower, as well as disempower and make hidden.

I want to share with you the three important areas it represents:

Firstly, memory and evidence as processes are separate but intrinsically linked. The processes of memory-making has a relationship to multiple systems, mechanisms and perspectives involved in establishing evidence.

Secondly, how people create is linked to how they see and identify themselves, what they are interested in, how they identify with various communities, as well as what values they perceive according to various community cultures. Narrative is vital to understanding this as it is a tool that can construct and communicate multiple and simultaneous realities, identify and make sense of the self within groups, community and society, and is imbued with power; of dominant, counter and competing narratives and as a mechanism for memory-making and knowledge preservation.

Thirdly, interaction occurs in conjunction with an understanding of action at various levels, as well as in relation to how people use, value and experience technologies including what technologies afford or do not allow to help people achieve their goals in creating and sharing something of who they are.

This model shows all these points of view to exist simultaneously and in multiples. How an individual understands their identity and work is not necessarily how it is seen by someone else. So when the archivist creates, in the creator dimension by documenting the world, they should be taking into account the varied, diverse and potentially incommensurable complexities that make up this map of how we understand cultural heritage as evidence of culture.

If we see algorithms as part of a continuum of mediated memories where and how do they fit in? Whose narratives are being told and what do we need to know about mandates to understand their contexts as memory? I don’t have any answers today but this is something I am about to examine.

But what my research into algorithms is beginning to reveal is the deep complex relationship and nature we have with data and machines. Recordkeeping is a memory-making process that contributes to evolving values, purposes and interactions over spacetime including memory (as making and remembering), narrative (as personal, sharing and evolving), evidence (as constructions of value and meaning) and technologies (as mediators and facilitators).

Archivists, and I count myself as one, need to consider what this means as to how we understand culture as evidence and heritage as it is being formed. Archivists also needs to understand and challenge their role in the system so that they may empower, discover and transform to meet multiple needs over time. Flexibility, adaptability and a need to understand what is being valued and who by as it is being created is essential to any transformation. That includes transformation within ourselves as professionals as well as the transformation of what role archives as constructions of evidence play in society.

Thank you.

Computational archival science – what is it?

Recently, I submitted a paper to the Computational Archival Science (CAS) workshop about my research exploring ways to develop tools and methods to support and undertake automated appraisal for cultural heritage. The project is at the conception stage and I am exploring the existing methods and tools that identify data from documents and map them as networked contexts. I am starting with a pilot project focusing on the Zoetic Walls street art project in Cleveland.

The project is heavily conceptual and explores what it means to document and manage cultural heritage that exists physically and virtually, and has significant ephemerality issues. The goal is to explore how it is possible to do this work in a way that engages with multiple stakeholders and various contexts that contribute to a social phenomenon that has been given some meaning and value. The ultimate goal is to design appraisal tools that might be able to be used in a much more participatory way. It is heavily conceptual right now and right at the beginning. A great place for some feedback from people who are interested in archival science and data. Or so I thought.

My short paper for the workshop was rejected, although the reviewers said it was well written. From the comments and then a closer reading of the workshop history and goals, I realise that I was trying to pitch a research project about archives to mostly digital humanities scholars who have their own particular view about what CAS is as well as what archival science is more broadly. I wrote an email responding to the reviewer comments, but it bounced back as it one of those email boxes that are not read. I could not find an email address for them either, so I have copied the email I wrote below.

The main issues that the reviewers seemed to have was that my work was too conceptual and that the conceptual aspects I was talking about are not “computational archival science.” Not just not suited or not part of the conversation, but actually not CAS. The email I wrote addresses comments from the second reviewer mostly and while I have not copied the second reviewer’s comments here, I think you get the gist from my email.

I feel like I must have missed a conversation somewhere about the relationship between computational archival science has with the actual discipline of archival science (or archival studies as it is also called here in the US). Are there any archival scholars out there who have been involved in this CAS that can enlighten me? I have read the definition listed on the website and wonder primarily about how the term, “archival thinking” has been used and its relationship use of “archival science.” I checked and I believe my work fit into the notion of Computational Thinking (CT) as being “the thought processes involved in formulating a problem and expressing its solution(s) in such a way that a computer—human or machine—can effectively carry out”. Of course I am happy to receive feedback and improve on my work and I can see where I could improve on the paper to expressly address CT, especially at this pilot stage. However, my concern remains this issue with what “archival thinking” means and how this is carried out in CT contexts as CAS.

To the workshop organizers;

Thank you for the feedback. From the reviewer feedback it is clearer to me that this workshop focus is on data science and machine learning from digital humanities perspectives.

It is unfortunate to hear that while engagement with archival science theories and principles is being asked for, engagement with archives and archival work is not suitable for the workshop. I am of course engaged with exploring what it means to document cultural heritage from an archives-as-institution point of view as this one of the ways that big data and linked data gets created. However, my research also attempts to grapple with alternative conceptual, practical and technological approaches to appraisal via crafting and testing methods to identify data as context (and then what that data tells us about the social phenomenon). I do see that I did not explore the notion of automation enough from an archival processing point of view and the potential role it can play for digital humanities research.

Regards literature on conceptual modelling, I am not addressing or using conceptual modelling, rather context entity mapping and network analysis using existing archival science standards and methodologies. It is possible to explore links between conceptual modelling from a computer science and information systems point of view, but it is really the topic for another paper. In archival science there is not much literature on conceptual modelling other than possibly the OAIS model (which is a functional model, albeit from a computer science idea of conceptual modelling) or the models developed from InterPARES (mostly business process models), or the records continuum model (conceptual and activity model) and other entity models developed by the Records Continuum Research Group (context entity models), so I am not sure what was expected in relation to this.

Regards the comment about being heavily influenced by critical theory, I do not refer to or engage in any critical theory literature or frameworks in this paper. As was mentioned clearly in the paper, my paradigmatic approach is social constructivism. The comment about being heavily influenced by critical theory in archives indicates a lack of understanding about engagement by archival science scholars with critical theory.  

Finally, it is regrettable that a workshop on archival science does not want to engage with the “four walls” of the archives and provide a space to engage with challenges that can be made from within and outside of this context. I do not work within those four walls, but I study them and what they mean in various conceptual, technological and practical ways, and any engagement with archival science, computational or not, requires exploration of their impact and meaning.  

Thank you for your time.

Regards,

Leisa

 

 

How to research decision-making in the archival discipline

Finally I have some books in my office!
Finally I have some books in my office!

There are a couple of things that have been on my mind for a while – the concept of an ‘archival record’ and how people make decisions about what to archive. In my recent research*, I examined some of the activities and interactions that occur in the formation of cultural heritage. My work looks at online social spaces (social media – specifically YouTube), and so in a way looks closely at technology. In the model I developed I specifically dedicated one area of it to mediated memories – a term I borrowed from José van Dijck’s book of the same title that I spotted on the catalogue of new items coming out when working in a bookshop and so bought and then devoured it and then started a degree in research – and well, here we are.

José van Dijck’s book is about memory in a digital age, and I did apply it in that way within my model, but I think about mediated as being not just about technology and what it does, but the systems that support decision-making in relation to technology. Because, technology does what we tell it to, but how we do it is shaped by how the technology does or does not work. In my model the contexts of mediated memories concerns the tools that support memory-making – tools, local systems, shared systems, collaborative systems, archival systems. These are not just places for stuff, but active systems that support memory co-creation, capture, organisation, curation and pluralisation. More about those terms in a future blog post.

This gets me back to the concept of archival record. As a records professional (encompassing all activities related to recordkeeping), I am confused by this term. I must have read it already many times, but I am now thinking about it in relation to building a new archival course – how do I explain what this is? Why is there a difference between a record and an archival record apart from it has been identified as one and perhaps managed in an archive? Can people who are not archivists decide something is an archival record? Is its inherent archival-ness important in making this decision?

Back to mediated memories – the only part of my model that mentions archives at all. Archival systems however, in my mind, is not about archives though, but about the ability to make a decision related to how a record is managed. Yet, a local system can also be an archival system – they are not mutually exclusive. I looked up “archival” on the SAA Glossary (such a great tool – thanks Richard) and note that it mentions “enduring” value. The definition of the term “archival records” also mentions “enduring value”. This is an interesting term and one I will explore again later, but in the meantime, thinking about mediated memories and the role that decisions have in making memory, and how it is managed, I wonder if the term archival records, is defined only in relation to the physicality of the record – that it is tangible and located in an archive? An archival system will have records – as much as a local system will have records. The differences are about how enduring the records are (as decided by someone), and how much organising they go through in order to be managed over time. Does this ultimately mean that the more “archival” a record is, the more metadata it has where the metadata shows its enduring value through time?

The concept of archival and archival records as being enduring, long-lived, permanent, is problematic within a social media context. Social media is inherently ephemeral (defined in the SAA Glossary as: Useful or significant for a limited period of time. Ephemera are things generally designed to be discarded after use). The idea of ephemera implies there is no enduring value, and this is not necessarily true. Of course archives, libraries and other memories institutions collect ephemera, but it is treated differently from records – for various reasons.  Yet, the networks and systems that provide contexts for ephemera are not necessarily captured – the decision-making that goes on in relation to ephemera as archival record really begins at the archival “door” (some refer to it as the threshold – a term I am not comfortable with). But there are decisions that are made about ephemera, and in social digital spaces, these decisions are part of the network of systems – local, shared, collaborative – the tools that are used.

I am not sure exactly where this line of thinking ends. I am interested in how decisions – by anyone, determine value over time. I am also interested in how the network contributes context to understanding something like enduring value. I wonder that if the archival system is linked to but separate from the record, then something of the decision-making and an understanding of enduring value remains.

My research in this area looks at how individuals and communities make decisions about memory – the making, the tools, the stories told.  This links to how people make decisions about their own identities, and the value of their stories – the making of (personal and community) memory.  My previous research (get the published copy here) indicated that archival and other cultural heritage institutions when collecting digital content from the web in particular do not capture or manage all the context that contribute to how the thing/document/content/record was created in the first place – the decisions made about value, story and memory by the people who created it.

*OK, it was my PhD, but I am trying to get away from saying that. I really feel like I need to move on.