Promotionsvorhaben
Quantifying relationships and differences across language editions in multilingual knowledge repository Wikipedia
Name
Anna Samoilenko
Status
Abgeschlossen
Abschluss der Promotion
Erstbetreuer*in
Prof. Dr. Steffen Staab
Wikipedia is a prominent socio-technological phenomenon that has direct impact on the modern society. Obviously, its content is important for the readers of Wikipedia, who are searching for information. Apart from human readers, many AI and recommendation systems as well as the search engine Google rely on Wikipedia database for automatic content generation. Finally, Wikipedia is very important for academics, for the first time providing an opportunity to study collaboration patterns on a large scale, and being the largest multilingual repository of human knowledge ever created. My thesis focuses on Wikipedia as a prominent example of a complex multilingual source of sociocultural data. I continue the line of research introduced by Brent Hecht and others who provided multiple proves that language editions of Wikipedia serve as stand-alone projects with unique editor base and original content, and not mere translations of the larger Wikipedias such as English or German. My aim is to advance this knowledge through the quantification of relationships and differences across language communities on Wikipedia. In particular, I focus on two novel and understudied domains: (1) mapping communities of shared information interests across multilingual editor communities on Wikipedia, and (2) quantifying content differences (on the example of a specific knowledge domain). Although my research themes are closely connected with social sciences, my approach is purely computational. Precisely, my work contributes several new scalable frameworks and operationalisation strategies that are useful for quantifying trends in multilingual large-scale data. This is done, for example, through selecting meaningful variables and units of comparison, as well as finding suitable statistical measures and computational methods to extract meaningful insights from the selected variables. I introduce four case studies which explore empirical questions about Wikipedia, and use them to illustrate the validity of my approaches. In particular, my research answers the following empirical questions about the relationships between multilingual editing communities on Wikipedia (1): How to quantitatively construct a network of shared information interests based on large-scale multilingual Wikipedia editing data? What factors best explain the strength of bilateral ties and formation of clusters? Is the set of languages covering a concept of Wikipedia random? Do certain editions show consistent interest in editing the same concepts? What socio-linguistic features explain common editing interests between language communities on Wikipedia? In terms of analysing content differences across editions (2), I focus on one of the most cross-lingually relevant knowledge domains - History -- and address the following questions: What are the most documented periods of history of the last 1,000 years in Wikipedia? What are the temporal focal points in descriptions of national histories in Wikipedia? Are country timelines consistent across language editions? How do the descriptions of national histories in English Wikipedia compare to the corresponding articles in Encyclopedia Britannica? What are the differences in the temporal and topical aspects of coverage, and in linguistic presentation of the material? My research highlights important similarities and gaps across language editions of Wikipedia. I discuss what implications these findings have on the development and application of AI systems and automatic content generation, as well as economic, academic, and knowledge-consumption aspects of the modern society.