Past and Future Collaborative Expedition Workshops (3W65)
Collaborative Expedition Workshop #74, June 10, 2008, at NSF (3W66)
4201 Wilson Blvd., Room 1235 NSF, Board Room (3W67)
Draft Title: Overcoming I/O Bottlenecks in Full Data Path Processing: Intelligent, Scalable Data Management from Data Ingest to Computation Enabling Access and Discovery (3W68)
- How to RSVP, Workshop Location/ Directions, and Remote Teleconferencing (3XPI)
- Print Version (3XTS)
- A. Draft Workshop Purpose (3W69)
- B. Draft Workshop Questions (3W6A)
- C. Draft Agenda (3W6C)
- D. Draft Resources (3W6D)
- E. Workshop Series Background (3W6E)
A. Draft Workshop Purpose (3W6F)
Participants will explore opportunities, including relevant developments in technology research, in computer system architectures, technologies and information processing approaches potentially contributing to optimize effective, efficient, petascale approaching exascale processing and throughput of ultra-large data collections across a full system processing data path that includes data ingest through computation enabling access and discovery by systems or people. Specific emphasis in presentations and consequent workshop discussion will focus on contributions responsive to potential advances in computer science and engineering that enable intelligent, highly scalable, data intensive computation, including analysis, discovery, integration and/or visualization of ultra-large scale data collections. (3XR0)
This workshop was jointly developed by Emerging Technology Subcommittee of the Architecture and Infrastructure Committee, CIOC and three Coordinating Groups (“CG’s) of the Subcommittee on Networking and Information Technology Research and Development: Social, Economic and Workforce Implications of IT and IT Workforce Development CG, Human-Computer Interaction and Information Management CG, and High-End Computing CG. (3XR1)
It is likely that how we design our physical and virtual knowledge sharing environments (including scientific knowledge that will influence policy-making and innovation) will play a pivotal role in the continued vitality and creativity of our 21st century democracy. The workshop will open up dialogue to facilitate "bootstrapping" among multiple frontier communities and institutions committed to advancing civic design in the public realm, including scientific, educational, and cultural heritage institutions. It is an opportunity to understand current design challenges faced by leaders in frontier research settings, whose efforts will indelibly shape all of our cyberinfrastructure experiences in years to come. The workshop also supports information exchange among Federal Enterprise Architecture improvement activities advancing citizen-centric government in 2008, including Architecture Principles for The US Government (issued by CIO Council, effective date Aug. 24, 2007). (3XOS)
"It is probably true quite generally that in the history of human thinking the most fruitful developments frequently take place at those points where two different lines of thought meet. These lines may have their roots in quite different parts of human culture, in different times or different cultural environments or different religious traditions: hence if they actually meet, that is, if they are at least so much related to each other that a real interaction can take place, then one may hope that new and interesting developments may follow." Werner Heisenberg (3XOP)
"Creativity is a process that can be observed only at the intersection where individuals, domains, and fields intersect." Csikszentmihalyi, 1999 (3XOQ)
"Architecture is the thoughtful making of space." Louis Kahn (3XOR)
Workshop planning provides an opportunity to experience shared stewardship around broad mission goals that include: (3XOT)
- To be of service, in cross-boundary settings, not only to the region, but to the nation (3XOU)
- To contribute to successful innovation toward citizen-centric government (3XOV)
- To learn by doing, to put into practice the results of our own dialogue (3XOW)
- To experience the kind of complex, multidimensional organizational situation that is providing the background for strategic leadership (3XOX)
B. Draft Workshop Questions (3W6H)
Socio-Economic questions in creative tension with the Information Lifecycle Management challenges (technical) of this workshop (3XLF)
- 1. What are the conducive conditions for the creativity and governance needed among networked scientific and scholarly communities so results and implications flow in a timely manner into science and innovation policy channels? (3XPE)
- 2. What common messages for advancing scientific advances are resonant across communities with in-depth and diverse experience with distributed collaboration, collections development, and scholarly knowledge infrastructure? (3XPD)
- 3. What are the Public Good aspects of Scientific Organizing, Knowledge Diffusion, and Innovation currently being advanced in the Public Realm? (3XOY)
- 4. What institutions and organizations have a shared mission for improved science and innovation policy as reflected in their strategic plans?''' (3XOZ)
- 5. What are the current and future contributions of light-weight aggregator tools for advancing discovery, shared understanding, and organizing that scales across individuals, communities of practice, and institutions? Examples in use by this workshop community include: wiki namesake pages,Emerging Technology Life-cycle process and Strategy Markup Language (StratML) (3XP1)
- 6. How can relevant science policy and innovation stakeholders tap "build to share" principles being advanced by forward-looking information stewardship organizations, including: (3XP2)
- a) Digital data and information communities advancing sound approaches for electronically stored information. Examples include librarians, curators, web content managers, ontologists, researchers, artists, historians, data managers, and records managers. (3XP3)
- b) Open Standards bodies and consortia (3XP4)
- c) Universities and university consortia (3XP5)
- d) International stewardship associations (3XP6)
- e) Virtual organizations (3XP7)
- 7. How do we create simulations that help us strategize and act effectively during rapid change – including the need for rapid discernment (moral and ethical implications) by people representing multiple disciplines with multiple "scientific languages? (3XP8)
- 8. What strategies are emerging to advance the public's awareness and participation in science, global virtual collections, and scholarly knowledge infrastructures? (3XP9)
- 9. How do we build from the best of past scientific research and also draw upon generational differences and cyberinfrastructure opportunities in a manner that reinforces strengths? (3XPA)
- 10. What are the emerging strategies for advancing scholarly knowledge infrastructures, collections management, and public web content with the resilience to mitigate disruptions or degradations of service over time? (3XPB)
- 11. How do we provide the right sets of information flowing into and out of science-based, mission-rehearsal simulations, etc. so the policy nuggets travel up even when the learning is experiential? (3XPC)
C. Tentative Agenda (all times should be considered as approximate and subject to pace and conduct of discussion) (3W6J)
8:30am - Check-in and Coffee (3W6K)
8:45am - Workshop Overview . [ slides ] . [ audio ] (3W6L)
Susan Turnbull, GSA, Co-chair, Emerging Technology Subcommittee, AIC Representative to DRM WG, and Co-chair, Social, Economic and Workforce Implications of IT and IT Workforce Development Working Group, Subcommittee on Networking and Information Technology Research and Development (NITRD SEW) (3W6M)
Almadena Chtchelkanova, Ph.D., Program Director, Directorate for Computer and Information Science and Engineering, Computing and Communication Foundations Division , The National Science Foundation and member High End Computing Research and Development Interagency Working Group, Subcommittee on Networking and Information Technology Research and Development, (NITRD HEC-R&D) (3W6N)
Robert Chadduck, Computer Engineer/ Principal Technologist, Electronic Records Archive (ERA), Program Management Office, National Archives and Records Administration (NARA), & member, Human-Computer Interaction and Information Management Coordinating Group, Subcommittee on Networking and Information Technology Research and Development (NITRD HCI&IM) (3W6O)
Richard N. Spivack, Ph.D., Economist, Impact Analysis Office Technology Innovation Program NIST and Co-chair, Emerging Technology Subcommittee, AIC, (3XQG)
Christopher Greer, Ph.D., Director, The National Coordination Office, Networking and Information Technology Research and Development, The Executive Office of the President . [ slides ] . [ audio ] (3XO2)
9:15am – Introductions: Attendee self introductions, including brief statements of interests and questions in light of perspectives and practical operational challenges to optimize throughput in processing ultra-large scale data collections - Robert Chadduck, NARA, Moderator (3XR2)
10:00am – Panel One: Data Management Approaches Contributing to Optimize I/O in Full Data Path Processing . [ audio ] (3XR3)
Robert Chadduck, NARA, Moderator (3XR4)
Dr. Michael Folk, Ph.D., Director, The HDF Group (“THG”) - HDF5 Experiences with I/O Bottlenecks . [ slides ] . [ audio ] (3XR5)
Abstract: I/O bottlenecks are an important consideration driving the design of software and formats for managing science and engineering data. HDF5, a widely used format for managing large and/or complex data objects and collections, must deal with many different kinds of bottlenecks, which can be characterized as four general types: architectural bottlenecks, bottlenecks due to characteristics of data and information objects, bottlenecks in accessing and operating on objects, and bottlenecks in usability and accessibility. We will describe these types of bottlenecks, then describe some ways in which the HDF5 format and software, as well as the HDF Group, are able to address them. (3XTR)
Dr. David Du, Ph.D., Program Director, Directorate for Computer and Information Science and Engineering, Division of Computer and Network Systems, The National Science Foundation, Long term End-to-End Security, Privacy, and Provenance . [ slides ] . [ audio ] (3XR7)
LouisReich, NASA Goddard Space Flight Center,/ Computer Sciences Corporation - Research Findings Concerning the Utility and Scalability of the XFDU and Related Technologies in the Packaging and Validation of Very Large Digital Information Products . [ slides ] . [ [ audio] ] (3XU2)
11:30pm – Lunch (3W6S)
12:30pm – Panel Two: Mass Storage Systems & Technologies Interests and Research (3W6T)
Dr. Reagan Moore, Ph.D, Director, Data Intensive Cyberinfrastructure Environments Groups, The University of California, San Diego, Managing massive data collections, Moderator . [ slides ] . [ audio ] (3XS5)
Abstract: Modern data management systems support the organization of distributed data sets into shared collections. However, as the collection size increases to the petabyte level and the number of files increases to hundreds of millions, administrative tasks become onerous. Rule-based data management systems automate the execution of administrative tasks, minimizing the effort required to manage massive data collections. Rule-based systems can also be used to validate assessment criteria, demonstrating compliance of the data management system with management policies. The iRODS data grid technology casts management policies as rules that control the execution of procedures. The procedures are cast as sets of micro-services, that are executed directly at the remote storage location where the file resides. This approach ensures that management policies are enforced no matter which client is used to access the shared collection. (3XS6)
Policy Virtualization using Rule-based Data Grids, Arcot Rajasekar, Mike Wan, Wayne Schroeder, Reagan Moore; University of California, San Diego; (sekar,mwan,schroede,moore@sdsc.edu) (3XS7)
Michelle Butler, Technical Program Manager, Storage Enabling Technologies Group, The National Center for Supercomputer Applications ("NCSA"), "Blue Waters" Project . [ slides ] . [ audio ] (3XR9)
Paul Nowoczynski, Advanced Data Management Specialist, The Pittsburgh Supercomputer Center, "Petascale Storage Systems Research" . [ slides ] . [ audio ] (3XT4)
Dr. Ethan Miller, Ph.D, Associate Professor, The Department of Computer Science, The University of California, Santa Cruz (UCSC), Search and Indexing for Petabyte-scale Storage and Beyond . [ slides ] . [ audio ] (3XTT)
Abstract: Storage systems are growing to encompass many petabytes of data contained in billions of files. While it may be possible to store such data, it is becoming ever-difficult to actually find anything in such a large storage system. Existing tools are often resource-intensive (web-scale search), and make it difficult (if not impossible)to browse the file system or find files with specific characteristics,let alone mine the store for data. (3XU9)
- Major research issues include: (3XTU)
- building indexes for petabyte-scale storage (3XTV)
- scalable search mechanisms (3XTW)
- data mining on large scale storage, including potentially running code at the data (3XTX)
- extraction of indexing terms from non-textual documents such as scientific data sets (3XTY)
- incorporating provenance and other contextual information into indexes (3XTZ)
- answering search queries accurately and precisely, particularly in systems with billions of files (3XU0)
- the interaction of indexing and power management for large-scale low- power (archival) storage. (3XU1)
Dr. Martin Swany, Professor, Department of Computer and Information Sciences, University of Delaware, Logistical Networking: Buffering in the Network . [ slides ] . [ [ audio] ] (3XVL)
2:00pm – Panel Three: File Systems and I/O Interests and Research (3XLC)
Dr. GaryGrider, Ph.D, Los Alamos National Laboratory, Moderator and Highlights of File Systems and I/O Research and Implications for Information Lifecycle Management (ILM) Challenges . [ slides ] . [ audio ] (3XRC)
Abstract: The High End Computing Interagency Working Group (HEC IWG) is chartered with coordinating US Government investments in Research and Development (R&D) for HEC. The HEC FSIO Technical Advisory Group (TAG) is chartered with providing guidance to the HEC IWG in the area of File Systems and I/O (FSIO). The HEC FSIO importance and research needs and priorities will be discussed. (3XS8)
Dr. GarthGibson, Ph.D., Carnegie Mellon University, Failure in Supercomputers and Supercomputer Storage . [ slides ] . [ audio ] (3XRD)
Abstract: The largest computer systems have entered the era of Peta operations per second and will climb to Exa operations per second over the next decade, largely on the strength of more cores per chip and more chips per system. The inevitable consequence of increasing component counts is more parts that can fail, higher failure rates, more concurrent failures and more effort devoted to coping with and recovering from failures -- a key role for storage systems. In this talk I will review historical data on failure rates in supercomputers to project future failure rates, review growing limitations on traditional fault tolerance strategies for supercomputers based on high-speed checkpointing to parallel storage systems, and address the increasing failure issues in storage components. (3XVU)
Henry Newman, Instrumental, Inc., Emerging Role of Standards in Information Lifecycle Management (ILM) . [ slides ] . [ audio ] (3XRE)
Abstract: The data path (the path from the application to the storage device) is fractured between many different standards bodies and has changed little over the last 20 years. During this same period the requirements for data lifecycle management have exploded forcing the development of user space applications to manage a broad set of ILM requirements, from Sarbanes/ Oxley, to Health Insurance Portability Accountability Act (HIPAA), preservation archive, to High Performance Computing archive and many more. The limitations in the data path need to be reviewed and a new paradigm for data management needs to be considered standardizing the requirements and moving them from user space applications into a standards based framework to allow data to migrate to new systems without these complex user space applications. (3XS4)
BobRogers, Chief Technology Officer, Application Matrix, LLC and ISM SNIA working group, Challenges and Opportunities of Information Lifecycle Management (ILM) . [ slides ] . [ audio ] (3XRF)
Abstract: Information Lifecycle Management (ILM) does not come in a box; nor is that likely any time in the near future. The automation of information management policies regarding protection, security, and retention require a different way of thinking about information, its value to the organization, and the business needs for the use of the information. This presentation will touch on current issues and opportunities and suggest technologies poised to benefit from ILM tools and techniques. (3XTI)
3:30pm – Open Discussion: Opportunities for Synergy in Next Steps, Including Potential Commonalities in Technologies or Approaches in Response to “Hard Problems” to Optimize Full Data Path Throughput (3XRG)
4:00pm – Poster Sessions Led by University Researchers: HEC-URA FSIO research program, mass storage technologies, data management approaches, etc. (3XRH)
Garth Gibson, The Carnegie Mellon University, Petascale Data Management: Guided by Measurement (3XRI)
Matthew Wolf, The Georgia Institute of Technology,Managed Streams: Scalable I/O for Petascale Computing (3XVM)
Abstract: By overlapping communication and computation, asynchronous methods can reduce the impact of I/O on the running time of data-intensive HPC codes. DataTap is such a mechanism, comprised of a lightweight application library coupled with platform services that manage the resulting structured stream I/O requests. This poster reports on a novel state-aware mechanism, managed streams, for controlling how and when asynchronous background data transfers are carried out on large-scale HPC machines. State-awareness is shown to dramatically reduce the perturbation experienced by applications for realistic petascale codes, with experimental results attained on the ORNL Cray XT3/XT4 machine running fusion science application code. (3XVN)
Phil Carns, Argonne National Laboratory (3XRK)
Walt Ligon, Clemson University (3XRL)
Pete Wyckoff, The Ohio Supercomputer Center (3XRM)
Alok Choudhary, Northwestern University (3XRO)
Remzi Arpaci-Dusseau, University of Wisconsin (3XRP)
Xian-He Sun, The Illinois Institute of Technology (3XRQ)
Tzi-cker Chiueh, The State University of New York at Stony Brook (3XRS)
Julio Lopez, Carnegie Mellon University, Data-Intensive Computing Research at CMU (3XVO)
Abstract: Various research initiatives at Carnegie Mellon are aimed at enabling scalable computation on massive datasets. For example, the DISC project is developing new capabilities to facilitate data analytics in science applications. The Computational Databases (CoDs) project is exploring the use of database techniques in the context of scientific computing. The Tashi project is being developed as a free software infrastrcuture for managing data-intensive computations in shared clusters. The Self-* Storage is being explored to increase automation in large-scale cluster-based storage. (3XVP)
Xiaodong Zhang, The Ohio State University (3XRT)
Mahmut Kandemir, Pennsylvania State University (3XRU)
Scott Brandt, The University of California, Santa Cruz, (3XRV)
Paul Nowoczynski & Jared Yanovich; The Pittsburgh Supercomputer Center; Results and Developments in Scalable Lightweight Storage Hierarchies (3XRW)
Yong Chen; Department of Computer Science, The Illinois Institute of Technology; Server-Push Architecture for Improving I/O Access Performance (3XRX)
Nawab Ali, Department of Computer Science, The Ohio State University; Redesigning Parallel File Systems Using Object-based Storage Devices (3XRY)
Xin Li, University of Rochester, Reference-Driven Performance Anomaly Identification (3XS0)
5:30pm or 6:00pm – Adjourn (3XRZ)
D. DRAFT Resources (3W6Y)
- Open Archival Information Systme Reference Model, Jan. 2004 (3XU5)
- Open Archival Inforation System, CCSDS, Jan. 2002 (3XU6)
- draft XML Formatted Data Unit and Construction Rules, February, 2008 (3XU7)
- Modeling and Simulation at the Exascale for Energy and Environment, Department of Energy, Office of Science (3W6Z)
- Open Science Grid (3XUA)
- Earh System Grid (3XUB)
- Center for Enabling Distributed Petascale Science (3XUC)
- DOE ASCR SciDAC Petascale Data Storage Institute, http://www.pdsi-scidac.org (3XVQ)
- Carnegie Mellon University Parallel Data Laboratory, http://www.pdl.cmu.edu (3XVR)
- IETF Parallel NFS (NFSv4.1), http://www.ietf.org/html.charters/nfsv4-charter.html and http://www.pnfs.com (3XVS)
- The Computer Failure Data Repository, http://cfdr.usenix.org (3XVT)
- 25th IEEE Symposium on Massive Storage Systems and Technologies, http://storageconference.org (3Z39)
E. Workshop Series Background (3W70)
Purpose and Audience: GSA's USA Services/ Intergovernmental leads monthly Collaborative Expedition workshops to advance the quality of citizen-government dialogue and collaborations at the crossroads of intergovernmental initiatives, Communities of Practice, Federal IT research and IT user agencies. The workshops seek to advance collaborative innovations in government and community services such as emergency preparedness, environmental monitoring, healthcare and law enforcement. (3W71)
The workshops serve individuals from government, business, and non-government organizations to practice an emerging societal form, Intergovernmental Communities of Practice (CoPs), in light of the Citizen-Centric Government goal of the President’s Management Agenda and the Public Information Access provisions of the E-government Act of 2002. (3W72)
Each workshop organizes participation around a common purpose, larger than any institution, including government. By learning how to appreciate multiple perspectives around potentials and realities of this larger “purpose”, subsequent actions by individuals representing many forms of expertise, can be better expressed in their home and collaborative settings. By centering around people and the "whole system" challenges they organize around, IT design and development processes can mature with less risk and greater national yield of breakthrough performance. (3W73)
Joint workshop sponsors in addition to GSA, include the Emerging Technology Subcommittee of the Architecture and Infrastructure Committee and Coordinating Groups of the Subcommittee on Networking and Information Technology Research and Development, including, Social, Economic and Workforce Implications of IT and IT Workforce Development CG, High End Computing CG, High Confidence Software and Systems CG, Software Design and Productivity CG, and Human-Computer Interaction and Information Management CG. These organizations value this “frontier outpost” to open up quality conversations, augmented by information technology, to leverage the collaborative capacity of united, but diverse sectors of society, seeking to discover, frame, and act on national potentials. (3W74)