Since the 1990s, cultural heritage institutions have been investing in digital technologies to address a growing public demand for open and permanent access to information resources. Accordingly, galleries, libraries, archives, and museums worldwide have strategically focused on the digitization of their holdings. The next step involved the development of digital collections and services in support of online research and learning.1
While enabling direct access to cultural and scientific heritage, digitization of archival materials has also fostered their preservation, virtual collocation on web portals, and the creation of an integrated learning environment. As a result, libraries have seen a substantial increase in the use of digitized materials that have attracted diverse users.2 Researchers, scholars, educators, entrepreneurs, and web surfers have engaged with content on an unprecedented scale because it is available in digital format.3
This digital shift in the library world continues to accelerate. Due to the pandemic crisis in 2020, print collections have rapidly become unavailable. Research and learning have moved to a virtual environment for the immediate future, and perhaps for good. Digital content has suddenly transitioned from being a preview of the physical collection to its primary access point.4
Digital collections, however, are not simply representations of physical collections. Rather, they are resources in their own right.5 Unlike physical collections, digital ones have detailed metadata and, often, full text available due to OCR conversion of text images into machine-encoded data. Both metadata and data can be mined, analyzed, and visualized. Text mining refers to discovering and extracting meaningful patterns from large numbers of text documents. Practitioners of computational linguistics and digital humanities call these agglomerations of texts corpora or capta.6 Text analysis of corpora or capta at its very basic level involves looking at words’ frequencies, contexts, lexical preferences, and relations among textual elements. Visualization refers to the design of the graphic representations of data objects and their relationships.7 It is commonly viewed as an essential part of textual data analysis that helps to make quantitative information legible and easy to comprehend.8
Digital collections’ data are open for exploration and analysis just as any other data are. Similarly to humanists, who now examine how technology is changing our understanding of the liberal arts, we can employ digital tools to see how they may shape our understanding of archival curation and librarianship.9 We begin with examining some pragmatic reasons for visualization of digital collections. Then our focus will move to the particular application of visualization tools, specifically the R programming language, including ggpolt2 and RShiny. R has been designed to facilitate data analysis, statistics, statistical programming, and graphics.10
There are multiple reasons to visualize a digital collection’s data, but one of the most important is to appeal to our natural visual abilities. Humans are very good at seeing patterns. The biggest part of our brain cortex is involved in visual perception.11 According to the Light Switch Theory by Andrew Parker, the eyes and vision have been the primal driving force of biological evolution since the Cambrian Period, approximately 543 million years ago.12 The rapid development of visual perception has thus been considered a fundamental survival strategy for animal species, including humans. Tuned by evolution, human vision has always been our most critical sense for detecting and extracting information from our physical surroundings.13
Naturally, various forms of visualization, as means of communication utilized by people, have been present worldwide throughout human history. Cave paintings and rock art, maps, calendars, genealogical trees, chronological tables, and time lines, as well as modern-day charts and graphs, have all been used to convey essential information so that it can be grasped quickly.14 This high efficiency of an image that “is worth a thousand words” has recently played a leading role in the development of present-day screen culture. In his book on this global phenomenon, Richard Butsch formally defines screen culture and points to its continuing relevance:
Screens are about images more than language, a modern form of visual culture. . . . It is the lived culture that arises when people interact with and through screen media. As everyday life fills with screen activity—averaging nine hours per day for American adults in 2017—screen becomes an increasingly important aspect of the broader culture infiltrating and influencing all other elements.15
Indeed, visualization presently seems to dominate all communications.16 Films, television, video games, video websites, mobile applications, and social media circulate images on a massive scale. We interact with dashboards, maps, newspaper charts, app graphs, and text messages enhanced by emojis or animated GIFs on an hourly basis. Visual representations have become an integral part of our social and professional communication practices. Scott Berinato refers to visualization as a lingua franca used in knowledge economies of the twenty-first century.17 He argues that the attractiveness of visualization stems from its ability to summarize and simplify complex data and thus help to make sense of them. In addition, the practice of data visualization has now been demystified and democratized.18 It is no longer in the purview of only software experts and professional coders. Due to technological advances and growing demand, tools for creating graphs and charts have become open and available to everyone who is willing to learn a new language.19 If libraries are to participate and contribute to screen culture, then perhaps it is time to learn its language and go visual.
There is a growing literature on the relevance of graphics for digital libraries. Visualization is often used to support exploration and discovery, content analysis, and communication about the collection. For example, graphical representations of digital collections are considered to be a great alternative to text-based interfaces and search boxes, especially for nonexperts and casual users.20 Unlike empty search fields that rely on the users’ input and background knowledge, graphs and diagrams provide a comprehensive overview of the collection easily understandable by all users. Along the same lines, “generous interfaces” are designed to show graphs of digital collections up front on web portals in order to both spark users’ interest and inspire further exploration of digitized material.21 In addition to providing an overview of a collection’s scope and content, generous interfaces include the contexts for the collection, a display of the relationships among collection items, and a quick closer look at selected images. These graphic overviews are natural starting points for browsing large sets of digital items, identifying relevant topics and patterns, selecting pertinent documents and images, and finally focusing on their details.22 Recent implementations of visual search and discovery involve interactive interfaces where users navigate digital collections as virtual galleries.23 According to Eric Phettelace, interactivity adds an additional discovery layer that enables users to become active agents in finding new patterns in data and putting new interpretations on them.24
Indeed, graphics foster not only intended searches for information, but also serendipitous findings. According to Windhager and colleagues, following graphics and diagrams often leads users to discoveries of diverse perspectives about a collection’s data.25 These unexpected findings, in turn, may inspire new approaches to examination of historical evidence and new ways of thinking about the nature of primary sources and information at large. Since charts and diagrams may easily reflect a diversity of views and the complexity of information, they seem to be very well suited for multithreaded investigation.26
Similarly, archivists and curators find the application of graphics extremely useful for the analysis of large digital collections.27 Visualization allows curators to examine the structure and organization of a collection, its content and provenance, relationships among the collection’s items, the scope and size of the collection, and the number of files and their formats, as well as text patterns in documents and visual patterns among images. In addition, graphs may reveal distributions of various documents and images over time that provide remarkable insight into the process of collection development. Monitoring progress of a collection also involves the assessment of its metadata in terms of their completeness and quality. Computing applications used for visualization fully expose all inconsistencies and missing values across metadata fields. Visualization then may also be used as an effective tool for metadata quality control. All these observations about a collection’s structure and a description of its content come with a growing understanding of data enabled by visualization. Richard Hamming once pointed out that the purpose of computing is insight—not numbers.28 By the same token, the purpose of visualization is grasp rather than graphics.
To enable readers to get a good grasp on data, graphics need to communicate information clearly. Graphs and diagrams are to explain data, not to obscure them.29 According to Richie Cotton, the effectiveness of a plot can be measured by two criteria: how many insights readers can get from the plot and how fast they can do so.30 Some plots are specifically designed to tell stories. There seems to be a strong relationship between data, visualization, and narrative, especially when it comes to graphic representation of time and chronology.31 The concept of mapping duration to space, and particularly to the length of a line, was first put into practice by Joseph Priestly in 1764 to compare the life spans of 2,000 famous people.32 The idea of time represented by a measurable line has evolved ever since, reflected in drawn streams, chains, trees, and now time lines. Time has become an organizing principle for diagrams. But it is also the organizing principle for narratives. Like graphs, narratives represent events in time.33 Like graphs, stories help people see.
Digital data have their own stories to tell.34 They also have their own ways of doing so. Traditionally, graphs and diagrams have been used to support narratives with envisioned concrete evidence embedded in a text sequence. This sequence usually starts with an overview of data in order to set the scene and context for their interpretation. After setting the opening scene, the creator of a graph guides the readers through its most prominent features and smoothly directs users’ attention from one point of interest to the next one. Less prominent data features, which do not add to the main story, are typically left out, as the author tries to present the data in the most convincing way to advance the line of a given argument. However, the full complexity of digital data may not be adequately represented by one leading story line.35 Rather, it calls for multiple interpretations depending on readers’ interests. Once again interactivity of graphs seems to invite and enable users to create their own story lines and paths of discovery. The plots developed by users may diverge considerably from the order suggested by authors and follow various unexpected directions. Users may remix data and reinvent the entire interpretation of a collection. A combination of a traditional narrative approach and interactive elements that foster user-driven exploration is becoming a standard in designing visual representations.
Indeed, visual communication tends to work best when the audience gets engaged in the communication process.36 Clearly, active learning radically improves comprehension of data. It also awakens users’ curiosity. The key objective for developing visualizations of digital collections should then be to inspire and actively engage users with digital content. But the users need to be familiar with visual interfaces and feel comfortable interacting with them in the first place.37
The GLAM Labs community provides support for users seeking to feel more comfortable interacting with visualizations.38 Galleries, libraries, archives, and museums (GLAM) have promoted digital content for reuse and experimentation in educational, commercial, and artistic projects. For the GLAM Labs community, digital cultural heritage is not just to contemplate, but also to fully engage with in creative ways. Accordingly, Europeana Pro, the British Library Digital Scholarship Department, the Library of Congress’s LC Labs, and the Digital Public Library of America—DPLA Pro all now offer comprehensive guidelines for how to work and innovate with digital collections.39
The development of digital cultural heritage agglomerations, like the Europeana Project that started in 2007 and the DPLA established in 2010, have paralleled the rise of the digital humanities (DH) as a new research field with its own questions and methodology. Digital collections are primary research sources for DH scholars, whereas data visualization is one of their essential methods for text and data analysis.40 Anne Burdick and her associates argue that visualization in fact provides “graphical legibility to analytical results.”41 In their view, geo-temporal visualizations and mapping allow scholars to examine complex interrelations among cultural, social, and historical phenomena.
In contrast with a traditional humanities research approach that emphasizes individual authorship, a digital approach fosters cooperation and partnership. Based on a survey of five hundred scholars, librarians, and archivists, Jessica Wagner Webster shows that there are multiple opportunities for successful collaboration among these stakeholders in regard to digital projects.42 Interestingly, the roles that these stakeholders play do not always align with expected tasks, in which DH scholars come up with research questions and interpret the results, while librarians and archivists are in charge of digital tools and their implementation. One of the benefits of interdisciplinary projects is a variety of views brought together. This is exactly where innovation begins.
As discussed earlier, graphic representation is among our most important tools for organizing data and sharing information. The process of creating effective visualizations of given data involves some important preliminary steps, including selecting visualization tools and preprocessing data compiled in a table. The first step of this preprocessing is examining the data table and getting familiar with collection data and metadata. Gaining detailed knowledge about data allows consideration of which aspects of information contained in the data might be of interest to collection users or curators.
One of the tools that may help a user to get a clear insight about collection data is OpenRefine.43 It is a free, open source software application for working with raw and messy data. OpenRefine allows for importing data in various formats, exploring large data sets, cleaning and transforming data, and also linking data sets with web services: for instance, getting geographic coordinates for addresses. OpenRefine runs on all major operating systems, including Windows, macOs, and Linux.
An OpenRefine project operates similarly to a spreadsheet or a table consisting of columns with metadata elements and rows of data. The rows can be filtered by various criteria and can also be edited. OpenRefine allows for detailed examination of the collection content and its description.
After examining data, the next step involves developing specific questions about a digital collection that compiled data may potentially address. In fact, these questions inform the initial mining of raw data and metadata. For this reason, this is the stage where interdisciplinary collaboration is particularly relevant because it brings diverse interests and questions together. Depending on the questions asked, the relevant pieces of information are extracted from the data table. These selected pieces are then closely examined for completeness and consistency. By nature all data, including metadata, tend to be messy. Therefore cleaning or tidying data is an essential prerequisite for their effective visual display.
Following data cleanup, the next step, often necessary, is transforming data. Data transformation allows for obtaining defined values necessary for plotting and graphic display. For example, extracting time measures in specific units from string date representations, aggregating data points according to different categories, and applying mathematical formulas to column values may be needed to obtain well-defined sets of specific numbers. The graphs are created by subjecting clean and transformed data to plotting functions that translate numbers into their graphic representations. The final step in the visualization process is tuning graphs for clarity.
As mentioned earlier, the number of tools for visualization is continually increasing. Many of them, including IBM Many Eyes, Library of Congress Viewshare, Microsoft Excel, Tableau, D3.js, FusionCharts, Google Charts, Dygraphs, Infogram, Plotly, IBM Watson Analytics, Tableau Public, TimelineJS, StorymapJS, Google Maps, and Historypin are discussed elsewhere.44
This report focuses on the application of the R programming language and its specific packages—ggplot2 for visualization and RShiny for interactive visualization. The next chapter addresses the methodology of learning the R programming language, the general workflow for basic visualization, and an introduction to graphic representations of data that serve specific analytical tasks.
I would like to thank Robert Weyrauch for inspiration to learn and use R, Pawel Musial for his guidance and patience, Ellen Bosman for her active support for this project, and also Matthew Martinez and Tiffany Schirmer, along with all current and former team members at NMSU Library, for their tremendous work on the Tombaugh Papers Collection.