Scaling Digital Library Technology
from Research to Production
Summary
- Balance the timescale disparity between technological and educational changes. Technology changes rapidly, while shifts in education policy and practice evolve at a much different and often slower rate.
- Provide one functioning product (the nsdl.oercommons.org web portal) for library builders and users while simultaneously integrating new technologies or products being developed by projects supported by the NSDL program.
- Incorporate standardized metadata schema and vocabularies at a time when the purpose of metadata and philosophies about its use were in constant flux because the field itself had not come to agreement.
- Plan for the sustainability of projects that were created with short-term funding.
Despite such challenges, NSDL made progress in the areas of metadata creation, automatic extraction, and the alignment between educational standards and curriculum resources.
Lessons Learned
- Technology development processes and solutions do not scale easily, especially within the context of a distributed project.
- Assumptions about the ease of integrating new technology (i.e., that plug-and-play code was a solution for all projects) were not substantiated.
- Each NSDL project had different requirements, which made communication and technology integration difficult, time consuming, and resource intensive.
Essay: Scaling Digital Library Technology from Research to Production
Introduction
At the time of NSDL's inception, other national and international organizations were building digital library prototypes and conducting research about metadata. The state-of-the-art in digital libraries advanced rapidly over the next decade and leveraged some of the infrastructure and lessons from the global technology sector. However, the vision for NSDL presented unique design requirements that distinguished it from other efforts. These requirements included creating both the NSDL.org web portal and a distributed network of integrated digital libraries, as well as building research components into the NSF's funding stream that would yield tools designed to add value to all NSDL projects.
The guiding principle of developing a central NSDL portal was somewhat eclipsed by the emergence of web search engines (e.g., Google and Yahoo), which allowed users to navigate directly to resources of interest. For that reason, NSDL needed to design and implement services and collections that added value for user groups that could not be delivered through searching on the open Web. Additionally, the Core Integration (CI) team also was charged with coordination, which included guidance and technical support to other NSDL projects. These requirements posed unanticipated challenges and created tensions between centralized and distributed collections and between operational and experimental capabilities, all of which played out in large and small ways throughout the decade. Pressure often existed to "go live" with research, which resulted in an experience for digital library builders (and users) that was somewhat analogous to building an airplane while trying to fly it.
Although NSDL faced a variety of challenges, its development resulted in research, collections, and tools that have been shared through more than 100 papers and presentations about NSDL projects at conferences and in journals in a variety of subject-specific disciplines. (See the NSDL Comprehensive Bibliography .)
The Multidimensional Integration Challenges for NSDL
Balancing Operations, Educational Priorities, and Research
At the time the NSDL program was conceived, the Web was still primarily a medium to deliver and consume resources rather than a platform for interaction. It was inevitable that the technologies and infrastructure required at NSDL's inception would be largely obsolete almost as soon as they were built. In contrast, the educational issues NSDL hoped to address were often long-standing, systemic problems. Changes in teaching and learning typically require retooling of teacher education programs, curricula, educational materials, and policies, all of which happen slowly in comparison to the pace of technological change. Thus a challenge for NSDL—one faced by any similar project—was the difference in timescales between technological and educational change. This disparity had at least two technology-related consequences:
- Developing digital tools and services was a fertile field with a steady stream of new innovations that could be applied in the NSDL context. These innovations filled the technology research space with "bright shiny objects" that distracted NSF reviewers and evaluators from the more fundamental, education-oriented needs for infrastructure.
- Infrastructure developed for some of the earlier projects—although still useful and used—became outdated, and little or no funding was allocated to refresh these systems.
Compounding the timescale disparities were conflicting priorities with new development activities that often occurred simultaneously with the efforts to provide a production-ready system to builders and users. Among the many resulting challenges, it was especially difficult to disseminate lessons learned at both CI and project levels in time to dovetail with projects' development phases and their varied levels of need for CI support. This challenge grew at least in proportion to the number of projects funded by the NSDL program. (For more information, see Endnote 1: A Brief History of NSDL Infrastructure Development.)1
Example: The Challenge of Integrating Technology across NSDL Projects
The Shibboleth authentication system was integral in early design requirements as a way to allow users to log in at one digital library and then be seamlessly passed to other libraries across the NSDL network without logging in again. This system would allow libraries to integrate services such as bookmarking and saving resources while also providing data about users' activities across NSDL in order to improve their experience. The Shibboleth technology was available long before most NSDL projects were ready to integrate its functionality. However, by the time a critical mass of libraries decided Shibboleth could be useful, they were not able to agree on the information that should be gathered from users at login or on data retention or privacy policies. By consensus, Shibboleth eventually was removed from the list of design requirements.
Metadata and Standardization
From the inception of the NSDL program, the NSDL metadata repository had a pressing need to find a way to share metadata with other repositories and information providers using common descriptors. This goal endured, but a philosophical shift occurred around the purpose of metadata. Initially, it was regarded as a tool for search and discovery in the centralized NSDL.org portal. Now metadata is viewed as one tool among many to give teachers and learners a context for using STEM learning resources. Concurrently, a shift around the process of search and discovery occurred—from driving users to the NSDL.org portal for metadata discovery to putting metadata and STEM resources in the path of the user (i.e., in other websites or NSF-funded projects).
Also from the outset, NSDL emphasized standardization, especially for metadata. The level of standardization desired led to tension between those who saw standards as a requirement for interoperability (good) and those who saw them as constraints on creativity (bad). For example, requiring that resources be tagged with U.S. grade-level designations made it possible to conduct grade-specific searches. However, such tagging was irrelevant or even impossible for some collections and communities. Because of the wide differences among NSDL participants, committees and projects had difficulty agreeing on what level of detail should be required and even what standards should be used, excerpted, or adapted (e.g., IEEE LOM, Dublin Core-Ed, GEM). (We note that this conflicted view permeates the educational technology industry, not just the NSDL program, portal, or projects.)
Despite these challenges, exciting areas of advancement that are likely to have impact were achieved in metadata creation, automatic extraction, and standards alignment. NSDL projects developed remarkable tools for these purposes, ranging from manual approaches to fully automated systems that apply natural language processing and computational linguistics. These capabilities have gained importance with the advent of Common Core Standards, and they coincide with larger trends in data mining and data analytics. As an interdisciplinary community of practice that overlapped many standards-setting organizations, NSDL significantly influenced the broader metadata community. (For more information, see Endnote 2: The Costs and Benefits of Using Different Metadata Schemas.)2
The work involved in creating NSDL also led to technical, procedural, and scaling advances that were reflected in two seminal digital library research papers:
- Metadata Aggregation and 'Automated Digital Libraries': A Retrospective on the NSDL Experience, Best Paper award at the 2006 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
- Representing Contextualized Information in the NSDL, Best Paper award at the 2006 European Conference on Digital Libraries (ECDL)
Examples: Aligning Resources with Educational Standards
- In 2004, the Syracuse Center for Natural Language Processing developed tools for aligning content to standards via semantic analysis, building on a standards database from JeS & Co. This effort led to funding for integration of the Content Alignment Tool (from Syracuse University) with WGBH's Teachers' Domain Educational Standards Correlation tool (TD-ESC) and the Achievement Standards Network (ASN).
- Also in 2004, Eduworks Corp and the New Media Center were funded to develop a tool for creating lifelong personal collections. Results underscored the importance of automatically tagging resources with metadata as they were uploaded into a collection rather than relying on users to provide it.
- Building on previous work on standards alignment, the last-funded Pathway, ICPalms (FY2011) is creating "a widget-based portal...with embedded tools, services, content, and professional development that together aim to bridge standards, curriculum, instruction, and assessment, through collaboration and customization by individual users."
- Many groups are using the foundational NSDL metadata collection to represent NSDL resources in a context relevant to their respective audiences, including Science.gov, netTrekker, and a variety of commercial K-12 learning management systems. As of 2012, the central organization was also pursuing further work on two recent metadata standards initiatives: the Learning Resource Metadata Initiative, a project of the Association Educational Publishers and Creative Commons, and Learning Registry, a consortium that includes the U.S. Department of Education, the Department of Defense, non-profit organizations, and U.S. and international companies.
Scalability, Project Support, and Persistence
The initial funding model for NSDL.org and NSDL projects did not realistically address the need for persistence of collections or services constructed with short-term NSF grants. Although ongoing funding was planned for the infrastructure maintained by the CI, these plans failed to reflect the fact that costs for archiving, harvesting, and other integration functions all grew in proportion to the number of collections added. Hence, as the number of NSDL projects increased, maintenance and sustainability became an issue for CI.
To address the need for more support in the areas of maintenance and persistence, NSF added requirements that new NSDL proposals include long-term sustainability plans so that content and services would continue to be available, at least at a project level. In some cases, projects met this requirement by forming successful partnerships with professional societies or by embedding collections and services within a university library or department. The NSF Pathways initiative further codified the emphasis on project sustainability by requiring explicit plans and Memorandum of Understanding agreements (MOUs) with organizations that would assume responsibility for projects' content and services once NSF funding ended.
In the final years of the NSF's NSDL program, its directors set up a mechanism to pool 15% of all project funding to support longevity (e.g., metadata migration, architecture refresh). This approach provided sustainability for the CI team and gave individual projects the moral authority to influence global technical plans while enabling a smaller group to make consistent decisions, including those that might disadvantage some projects to better support the larger community. Although this requirement came towards the end of the NSDL program and we cannot report on its long-term effects, this funding model is a solution that future projects might consider.
As the NSDL program transitions to different organizational and operational settings, it will be important to deal with archival issues to assure that NSDL resources continue to be available and to help NSDL projects such as Pathways take full advantage of the emerging NSF cyberlearning initiatives. Collectively and individually, the various NSDL projects have amassed considerable technical knowledge about what it takes to develop and sustain a technical infrastructure to support effective communication among projects as well as a repository architecture. We hope that NSF will find ways to build on this experience in future projects.
Lessons Learned
- Technology development processes and solutions do not scale easily, especially in the context of a distributed project.
- Assumptions about the ease of integrating new technology (i.e., that plug-and-play code was a solution for all projects) were not substantiated.
- Each NSDL project required different levels of support, and this intensive communication and additional technical assistance were not initially anticipated or included in budgets at either the CI or project levels.
Endnotes
1A Brief History of NSDL Infrastructure Development
From 2001 to 2005, the CI created a central library infrastructure focused on accumulating a maximal set of metadata representing STEM-relevant resources. Metadata records about STEM resources were automatically harvested from a large number of existing digital collections and from the new digital collections of STEM-focused educational resources created by NSDL projects. Automatic harvesting posed significant challenges. Metadata quality, formats, and vocabularies varied dramatically across collections so it was difficult for either the CI or other NSDL projects to make use of the metadata. Based on what was learned from the initial research and implementation, much effort was given between 2006 and 2011 to agreeing on metadata schemas, aligning vocabularies, and cleaning up existing metadata in NSDL.org. This production-ready metadata provides the foundation for new research on metadata that pertains to the use of STEM resources, such as learning application readiness and paradata.
As the development of the library progressed, the CI learned that the initial database structure for the central library infrastructure could not scale to support the exponentially increasing number of metadata records. The next iteration of NSDL library infrastructure was based in part on the Fedora repository architecture, but the scale of NSDL required significant enhancements to this software, including the creation of a highly stable system backup configuration and a "network overlay" architecture. This architecture expresses relationships among resources and collections as links, potentially adding significant value for technical service providers and users. Where the earlier metadata-centric architecture supported basic search and discovery, the current network-overlay architecture allows NSDL.org to provide more context about resources. The results of research on these Fedora software enhancements were incorporated into the Fedora Commons distribution, now in use by over 300 repositories around the world.
In addition to the research outcomes resulting from work conducted by the CI, a number of software, tool, service, and application products were developed by projects funded from the NSDL program. It is difficult to identify, let alone classify, specific products separately from related collections, often because the products have become so seamlessly integrated into NSDL infrastructure. Sometimes the barriers to integration were as much human as technical, so not all products had uptake or broader use.
2The Costs and Benefits of Using Different Metadata Schemas
To illustrate further the issues associated with standardization, a number of collection development projects were funded early in NSDL to explore the costs and benefits of using different metadata schemas and vocabularies to describe their collections. Naturally, their priorities—to meet the needs of discipline- or age-specific user groups—were valued more than providing metadata to the NSDL central metadata repository. To achieve uniformity during harvesting, descriptively rich metadata for specific disciplines or age ranges were simplified prior to storage and display in NSDL.org, resulting in a significant loss of information. This simplification created a confusing experience for users and frustration for projects because their efforts were not adequately reflected on NSDL.org. User interface studies and subsequent analyses of metadata yielded some lessons learned and generated further issues:
- Granularity. The level of resources being cataloged and displayed in NSDL.org ranged from small applets to large collections. Users need a way to distinguish among them and understand the difference.
- Vocabulary. Each STEM discipline and age group used different terms (e.g., for subject areas or pedagogical strategies). This variance presented both a problem for users searching a centralized portal, such as NSDL.org, and a coordination challenge for individual projects.
- Ownership. Projects viewed metadata as their intellectual property and a value-added contribution that would benefit users. One major challenge became how to build trust between projects and CI while showing the value of contributing to a central portal, even though data was lost through metadata "leveling."
- Interchange. If sharing metadata with a centralized portal resulted in loss of data, then how would projects that wanted to share metadata agree on common elements among themselves?