Past and Future Collaborative Expedition Workshops    (3W65)

Collaborative Expedition Workshop #74, June 10, 2008, at NSF    (3W66)

4201 Wilson Blvd., Room 1235 NSF, Board Room    (3W67)

Draft Title: Overcoming I/O Bottlenecks in Full Data Path Processing: Intelligent, Scalable Data Management from Data Ingest to Computation Enabling Access and Discovery    (3W68)

A. Draft Workshop Purpose    (3W6F)

Participants will explore opportunities, including relevant developments in technology research, in computer system architectures, technologies and information processing approaches potentially contributing to optimize effective, efficient, petascale approaching exascale processing and throughput of ultra-large data collections across a full system processing data path that includes data ingest through computation enabling access and discovery by systems or people. Specific emphasis in presentations and consequent workshop discussion will focus on contributions responsive to potential advances in computer science and engineering that enable intelligent, highly scalable, data intensive computation, including analysis, discovery, integration and/or visualization of ultra-large scale data collections.    (3XR0)

This workshop was jointly developed by Emerging Technology Subcommittee of the Architecture and Infrastructure Committee, CIOC and three Coordinating Groups (“CG’s) of the Subcommittee on Networking and Information Technology Research and Development: Social, Economic and Workforce Implications of IT and IT Workforce Development CG, Human-Computer Interaction and Information Management CG, and High-End Computing CG.    (3XR1)

It is likely that how we design our physical and virtual knowledge sharing environments (including scientific knowledge that will influence policy-making and innovation) will play a pivotal role in the continued vitality and creativity of our 21st century democracy. The workshop will open up dialogue to facilitate "bootstrapping" among multiple frontier communities and institutions committed to advancing civic design in the public realm, including scientific, educational, and cultural heritage institutions. It is an opportunity to understand current design challenges faced by leaders in frontier research settings, whose efforts will indelibly shape all of our cyberinfrastructure experiences in years to come. The workshop also supports information exchange among Federal Enterprise Architecture improvement activities advancing citizen-centric government in 2008, including Architecture Principles for The US Government (issued by CIO Council, effective date Aug. 24, 2007).    (3XOS)

"It is probably true quite generally that in the history of human thinking the most fruitful developments frequently take place at those points where two different lines of thought meet. These lines may have their roots in quite different parts of human culture, in different times or different cultural environments or different religious traditions: hence if they actually meet, that is, if they are at least so much related to each other that a real interaction can take place, then one may hope that new and interesting developments may follow." Werner Heisenberg    (3XOP)

"Creativity is a process that can be observed only at the intersection where individuals, domains, and fields intersect." Csikszentmihalyi, 1999    (3XOQ)

"Architecture is the thoughtful making of space." Louis Kahn    (3XOR)

Workshop planning provides an opportunity to experience shared stewardship around broad mission goals that include:    (3XOT)

B. Draft Workshop Questions    (3W6H)

Socio-Economic questions in creative tension with the Information Lifecycle Management challenges (technical) of this workshop    (3XLF)

C. Tentative Agenda (all times should be considered as approximate and subject to pace and conduct of discussion)    (3W6J)

8:30am - Check-in and Coffee    (3W6K)

8:45am - Workshop Overview . [ slides ] . [ audio ]    (3W6L)

Susan Turnbull, GSA, Co-chair, Emerging Technology Subcommittee, AIC Representative to DRM WG, and Co-chair, Social, Economic and Workforce Implications of IT and IT Workforce Development Working Group, Subcommittee on Networking and Information Technology Research and Development (NITRD SEW)    (3W6M)

Almadena Chtchelkanova, Ph.D., Program Director, Directorate for Computer and Information Science and Engineering, Computing and Communication Foundations Division , The National Science Foundation and member High End Computing Research and Development Interagency Working Group, Subcommittee on Networking and Information Technology Research and Development, (NITRD HEC-R&D)    (3W6N)

Robert Chadduck, Computer Engineer/ Principal Technologist, Electronic Records Archive (ERA), Program Management Office, National Archives and Records Administration (NARA), & member, Human-Computer Interaction and Information Management Coordinating Group, Subcommittee on Networking and Information Technology Research and Development (NITRD HCI&IM)    (3W6O)

Richard N. Spivack, Ph.D., Economist, Impact Analysis Office Technology Innovation Program NIST and Co-chair, Emerging Technology Subcommittee, AIC,    (3XQG)

9:00am - Welcome    (3XO1)

Christopher Greer, Ph.D., Director, The National Coordination Office, Networking and Information Technology Research and Development, The Executive Office of the President . [ slides ] . [ audio ]    (3XO2)

9:15amIntroductions: Attendee self introductions, including brief statements of interests and questions in light of perspectives and practical operational challenges to optimize throughput in processing ultra-large scale data collections - Robert Chadduck, NARA, Moderator    (3XR2)

10:00amPanel One: Data Management Approaches Contributing to Optimize I/O in Full Data Path Processing . [ audio ]    (3XR3)

Robert Chadduck, NARA, Moderator    (3XR4)

Dr. Michael Folk, Ph.D., Director, The HDF Group (“THG”) - HDF5 Experiences with I/O Bottlenecks . [ slides ] . [ audio ]    (3XR5)

Abstract: I/O bottlenecks are an important consideration driving the design of software and formats for managing science and engineering data. HDF5, a widely used format for managing large and/or complex data objects and collections, must deal with many different kinds of bottlenecks, which can be characterized as four general types: architectural bottlenecks, bottlenecks due to characteristics of data and information objects, bottlenecks in accessing and operating on objects, and bottlenecks in usability and accessibility. We will describe these types of bottlenecks, then describe some ways in which the HDF5 format and software, as well as the HDF Group, are able to address them.    (3XTR)

Dr. David Du, Ph.D., Program Director, Directorate for Computer and Information Science and Engineering, Division of Computer and Network Systems, The National Science Foundation, Long term End-to-End Security, Privacy, and Provenance . [ slides ] . [ audio ]    (3XR7)

11:30pmLunch    (3W6S)

12:30pmPanel Two: Mass Storage Systems & Technologies Interests and Research    (3W6T)

Dr. Reagan Moore, Ph.D, Director, Data Intensive Cyberinfrastructure Environments Groups, The University of California, San Diego, Managing massive data collections, Moderator . [ slides ] . [ audio ]    (3XS5)

Abstract: Modern data management systems support the organization of distributed data sets into shared collections. However, as the collection size increases to the petabyte level and the number of files increases to hundreds of millions, administrative tasks become onerous. Rule-based data management systems automate the execution of administrative tasks, minimizing the effort required to manage massive data collections. Rule-based systems can also be used to validate assessment criteria, demonstrating compliance of the data management system with management policies. The iRODS data grid technology casts management policies as rules that control the execution of procedures. The procedures are cast as sets of micro-services, that are executed directly at the remote storage location where the file resides. This approach ensures that management policies are enforced no matter which client is used to access the shared collection.    (3XS6)

Policy Virtualization using Rule-based Data Grids, Arcot Rajasekar, Mike Wan, Wayne Schroeder, Reagan Moore; University of California, San Diego; (sekar,mwan,schroede,    (3XS7)

Michelle Butler, Technical Program Manager, Storage Enabling Technologies Group, The National Center for Supercomputer Applications ("NCSA"), "Blue Waters" Project . [ slides ] . [ audio ]    (3XR9)

Paul Nowoczynski, Advanced Data Management Specialist, The Pittsburgh Supercomputer Center, "Petascale Storage Systems Research" . [ slides ] . [ audio ]    (3XT4)

Dr. Ethan Miller, Ph.D, Associate Professor, The Department of Computer Science, The University of California, Santa Cruz (UCSC), Search and Indexing for Petabyte-scale Storage and Beyond . [ slides ] . [ audio ]    (3XTT)

Abstract: Storage systems are growing to encompass many petabytes of data contained in billions of files. While it may be possible to store such data, it is becoming ever-difficult to actually find anything in such a large storage system. Existing tools are often resource-intensive (web-scale search), and make it difficult (if not impossible)to browse the file system or find files with specific characteristics,let alone mine the store for data.    (3XU9)

Dr. Martin Swany, Professor, Department of Computer and Information Sciences, University of Delaware, Logistical Networking: Buffering in the Network . [ slides ] . [ [ audio] ]    (3XVL)

2:00pmPanel Three: File Systems and I/O Interests and Research    (3XLC)

Abstract: The High End Computing Interagency Working Group (HEC IWG) is chartered with coordinating US Government investments in Research and Development (R&D) for HEC. The HEC FSIO Technical Advisory Group (TAG) is chartered with providing guidance to the HEC IWG in the area of File Systems and I/O (FSIO). The HEC FSIO importance and research needs and priorities will be discussed.    (3XS8)

Dr. GarthGibson, Ph.D., Carnegie Mellon University, Failure in Supercomputers and Supercomputer Storage . [ slides ] . [ audio ]    (3XRD)

Abstract: The largest computer systems have entered the era of Peta operations per second and will climb to Exa operations per second over the next decade, largely on the strength of more cores per chip and more chips per system. The inevitable consequence of increasing component counts is more parts that can fail, higher failure rates, more concurrent failures and more effort devoted to coping with and recovering from failures -- a key role for storage systems. In this talk I will review historical data on failure rates in supercomputers to project future failure rates, review growing limitations on traditional fault tolerance strategies for supercomputers based on high-speed checkpointing to parallel storage systems, and address the increasing failure issues in storage components.    (3XVU)

Abstract: The data path (the path from the application to the storage device) is fractured between many different standards bodies and has changed little over the last 20 years. During this same period the requirements for data lifecycle management have exploded forcing the development of user space applications to manage a broad set of ILM requirements, from Sarbanes/ Oxley, to Health Insurance Portability Accountability Act (HIPAA), preservation archive, to High Performance Computing archive and many more. The limitations in the data path need to be reviewed and a new paradigm for data management needs to be considered standardizing the requirements and moving them from user space applications into a standards based framework to allow data to migrate to new systems without these complex user space applications.    (3XS4)

BobRogers, Chief Technology Officer, Application Matrix, LLC and ISM SNIA working group, Challenges and Opportunities of Information Lifecycle Management (ILM) . [ slides ] . [ audio ]    (3XRF)

Abstract: Information Lifecycle Management (ILM) does not come in a box; nor is that likely any time in the near future. The automation of information management policies regarding protection, security, and retention require a different way of thinking about information, its value to the organization, and the business needs for the use of the information. This presentation will touch on current issues and opportunities and suggest technologies poised to benefit from ILM tools and techniques.    (3XTI)

3:30pmOpen Discussion: Opportunities for Synergy in Next Steps, Including Potential Commonalities in Technologies or Approaches in Response to “Hard Problems” to Optimize Full Data Path Throughput    (3XRG)

4:00pmPoster Sessions Led by University Researchers: HEC-URA FSIO research program, mass storage technologies, data management approaches, etc.    (3XRH)

Garth Gibson, The Carnegie Mellon University, Petascale Data Management: Guided by Measurement    (3XRI)

Matthew Wolf, The Georgia Institute of Technology,Managed Streams: Scalable I/O for Petascale Computing    (3XVM)

Abstract: By overlapping communication and computation, asynchronous methods can reduce the impact of I/O on the running time of data-intensive HPC codes. DataTap is such a mechanism, comprised of a lightweight application library coupled with platform services that manage the resulting structured stream I/O requests. This poster reports on a novel state-aware mechanism, managed streams, for controlling how and when asynchronous background data transfers are carried out on large-scale HPC machines. State-awareness is shown to dramatically reduce the perturbation experienced by applications for realistic petascale codes, with experimental results attained on the ORNL Cray XT3/XT4 machine running fusion science application code.    (3XVN)

Phil Carns, Argonne National Laboratory    (3XRK)

Walt Ligon, Clemson University    (3XRL)

Pete Wyckoff, The Ohio Supercomputer Center    (3XRM)

Alok Choudhary, Northwestern University    (3XRO)

Remzi Arpaci-Dusseau, University of Wisconsin    (3XRP)

Xian-He Sun, The Illinois Institute of Technology    (3XRQ)

Tzi-cker Chiueh, The State University of New York at Stony Brook    (3XRS)

Julio Lopez, Carnegie Mellon University, Data-Intensive Computing Research at CMU    (3XVO)

Abstract: Various research initiatives at Carnegie Mellon are aimed at enabling scalable computation on massive datasets. For example, the DISC project is developing new capabilities to facilitate data analytics in science applications. The Computational Databases (CoDs) project is exploring the use of database techniques in the context of scientific computing. The Tashi project is being developed as a free software infrastrcuture for managing data-intensive computations in shared clusters. The Self-* Storage is being explored to increase automation in large-scale cluster-based storage.    (3XVP)

Xiaodong Zhang, The Ohio State University    (3XRT)

Mahmut Kandemir, Pennsylvania State University    (3XRU)

Scott Brandt, The University of California, Santa Cruz,    (3XRV)

Paul Nowoczynski & Jared Yanovich; The Pittsburgh Supercomputer Center; Results and Developments in Scalable Lightweight Storage Hierarchies    (3XRW)

Yong Chen; Department of Computer Science, The Illinois Institute of Technology; Server-Push Architecture for Improving I/O Access Performance    (3XRX)

Nawab Ali, Department of Computer Science, The Ohio State University; Redesigning Parallel File Systems Using Object-based Storage Devices    (3XRY)

Xin Li, University of Rochester, Reference-Driven Performance Anomaly Identification    (3XS0)

5:30pm or 6:00pmAdjourn    (3XRZ)

D. DRAFT Resources    (3W6Y)

E. Workshop Series Background    (3W70)

Purpose and Audience: GSA's USA Services/ Intergovernmental leads monthly Collaborative Expedition workshops to advance the quality of citizen-government dialogue and collaborations at the crossroads of intergovernmental initiatives, Communities of Practice, Federal IT research and IT user agencies. The workshops seek to advance collaborative innovations in government and community services such as emergency preparedness, environmental monitoring, healthcare and law enforcement.    (3W71)

The workshops serve individuals from government, business, and non-government organizations to practice an emerging societal form, Intergovernmental Communities of Practice (CoPs), in light of the Citizen-Centric Government goal of the President’s Management Agenda and the Public Information Access provisions of the E-government Act of 2002.    (3W72)

Each workshop organizes participation around a common purpose, larger than any institution, including government. By learning how to appreciate multiple perspectives around potentials and realities of this larger “purpose”, subsequent actions by individuals representing many forms of expertise, can be better expressed in their home and collaborative settings. By centering around people and the "whole system" challenges they organize around, IT design and development processes can mature with less risk and greater national yield of breakthrough performance.    (3W73)

Joint workshop sponsors in addition to GSA, include the Emerging Technology Subcommittee of the Architecture and Infrastructure Committee and Coordinating Groups of the Subcommittee on Networking and Information Technology Research and Development, including, Social, Economic and Workforce Implications of IT and IT Workforce Development CG, High End Computing CG, High Confidence Software and Systems CG, Software Design and Productivity CG, and Human-Computer Interaction and Information Management CG. These organizations value this “frontier outpost” to open up quality conversations, augmented by information technology, to leverage the collaborative capacity of united, but diverse sectors of society, seeking to discover, frame, and act on national potentials.    (3W74)