Latent Personal Analysis (LPA) just been published with SpringerNature in UMUAI Journal
Glad to share a new paper in User Modeling and User-Adapted Interaction Journal (UMUAI) with Hagit Ben-Shoshan describing an exploration method in a complex domain, Latent Personal Analysis (LPA), with uses for user/entity modeling, impersonation detection in social media, music, and biology.
The method creates a domain and a signature and distance for each entity comprising the domain. In language, within a domain, an author's signature can be derived from, in loose terms, the author's missing popular words and frequently used infrequent words. The distance and the signature are determined by increased distance for missing popular terms and decreased distance for the vast tail of rare words missing from the person's vocabulary. Paraphrasing Debussy, style is the silence between the words: personal style is also measured by popular domain words missing from a person's vocabulary.
How about applications?
- Quite a lot of insights for authorship attribution and the digital humanities.
For example, here's a style heatmap of 19th-century writers, comparing books' signatures distances to determine similarity.
Qualitative exploration: The book Robinson Crusoe was taken as a domain and each chapter as an entity.
The lower panel shows the most frequent words in the book (not considering stopwords). Interestingly, God is one of the principal words in the domain, i.e., in the book. However, in the last chapter, the word God appears much less than in the rest of the book, while the word lough is much more frequent.
- In Social Media, LPA can be used for better user modeling. For example, look at the signatures of these two IMDb reviewers. This is a content-based exploration, exposing preferences as well as less preferred genres.
- Impersonation detection in social media:
LPA can determine a form of sockpuppets, that is, several users authored by a single person.
Interestingly, it can flag a Front User account, a single account operated by multiple authors.
- Music: Spotify's 2017 dataset of all songs streamed in that year, and their listening frequency in each country was the basis for the analysis that created an LPA distance and signature for each country. Countries that listen more to popular music have a lower distance. Examples of these are Canada, Switzerland, and Australia. These countries' signatures are characterized by small differences in listening habits to hit songs. Countries distant from the domain are where local music is preferred, for example, Turkey or Uruguay.
Here's a PCA of the distance matrix table of Countries Spotify music LPA-signatures. The first dimension accounts for 47.7% of the variance, and the second for 16.8%.
- Biology: LPA was used to compare the spectral spread of sub repertoires of B cell clones (B cells with a shared mother cell) within a person. We reiterated previous findings that showed that gut and blood tissues have separate repertoires.
We further identify a third branch of clonal patterns typical of the lymphatic organs (Spleen, MLN, and bone marrow) separated from the other two categories. We also show that the Spleen encompasses the closest picture of the entire repertoire, and person-popular clones are as popular in the Spleen. The LPA biology paper with Uri Alon and Uri Hershberg.
- Diving into literature debates. Few have questioned the authorship of the Shakespearian poem "A lover's complaint". Creating a domain from all of Shakespear's writings, most have a typical distance, which is not the case for the poem in question.