Repository Platforms for Research Data IG Activity Overview 2015-03-09 RPRD P5 Meeting

2015-03-09 RPRD P5 Meeting

Creator

Discussion
March 23, 2015 at 11:06 pm #138292
David Wilcox
Member
Agenda

11:30 – 12:00: Summary and Introductions

– Introduce co-chairs

– Report current group status

– Call for volunteer to take notes

– Call for co-chair from scientific data/repository background

– Brief introductions

12:00 – 12:15: Group Logistics

– Platform for document collaboration

– Editor(s) of document when it gets closer to publishing

– Monthly virtual meetings of group members

– Timeline for the work to be undertaken

12:15 – 12:25: Reviewing Relevant Work

– Work within RDA

– Need volunteers to interact with relevant RDA groups

– Work outside the RDA

– Need volunteers to review and summarize

12:25 – 12:40: Use Case Methodology

– Consensus on a use case format

– How/where will we gather use cases?

12:40 – 12:55: Functional Requirements

– How to relate use cases to functional requirements

12:55 – 13:00: Wrap-up and Next Steps

– How to join the group

– First virtual meeting

– Next in-person meeting

Here is the draft timeline that we plan to discuss; it is broken down over 18 months.

Months 1-3:

– Review relevant work within RDA

– Review relevant work outside RDA

– Create guideline for use case formating

– Define how use cases will be related to functional requirements

– Identify sources and criteria for use cases

Months 4-6:

– Gather and format use cases according to guidelines

Months 7-10:

– Define functional requirements based on use case analysis

– Relate use cases to functional requirements

Months 11-12:

– Create draft report

Months 13-14:

– Publish draft report and circulate for feedback

Months 15-16:

– Edit report based on feedback

Months 17-18:

– Publish and distribute final report

Minutes
- Purpose to align functional requirements and use cases for repository platforms.
- BoF resulted in application to be a working group, through RDA processes.
- Want to create a guide/matrix to be used by repository developers. An almost officially approved Interest Group. A little more flexibility than a working group—but want to produce results in 12-18 months.
- Call for a co-chair from international community, scientific data/repository background—current co-chairs are from North America.
  - Question from the audience about diversity — should the group represent smaller disciplinary repositories, or is it primarily focused on the large repository platforms? Chairs are looking for those in the room who signed up for the initial working group, who signed up for the case statement. Want to encourage broad participation, but need to limit the number of chairs. This is an action-oriented group, looking to have people join in and contribute.
  - Not trying to exclude any size or type of repository. Envisioning that other interest groups can spin off from this one for, say, visualization
  - Will be based on specific use cases, not on a survey of landscape of existing repositories. Start from use cases being developed in RDA.
- Present
  - Research data manager at Columbia
  - Wendy Koslowski – Cornell
  - Tricia Cruz – DataOne
  - iRODS – interested in comparison of respoitory systems to data grid technologies
  - Thorny Staples – Smithsonian – trying to build research support system for Smithsonian, for researchers to take advantage of from the beginning so they leave behind curable data at the end.
  - US Dept of Agriculture – building a prototype system for managing agricultural data
  - Several people for Petabyte scale repositories in sciences
  - SDSC – building own platform for managing large data, make data available quickly to researchers and others
  - Peter Wittenberg – built up large archive over last ten years, built own robust system. Maintenance of system is not doable, changing perhaps to Fedora. Member of Technical Advisory Board.
  - Repository for neuroscience data. Switching to C-CAN or D-CAN, interested in other options.
- Discussion of group logistics
  - Will be gathering use cases, discussion, building up functional requirements. Need a platform for collaboration on documents.
    
    Proposal: Do early drafting of documents in Google Docs for collaborative document writing, then move to wiki space for later versions that are getting closer to published final versions.
    
    No objections to this from the group.
    
    Anticipate that early on many members of the group will be doing this work, editing draft documents, etc. Towards end of group work, probably makes sense to have a smaller group of people doing editing, managing submissions, so ideas and edits are not stepping on top of each other.
    
    Comment: please make sure that all work, including that in Google Docs, is accessible from the wiki (have world-viewable rights on the docs so everyone can seem them while being formed?)
    
    Comment: many groups are collecting use cases — how to integrate with their efforts?
    
    Comment: use case collection. Other group shave developed google forms for data collection, way to show data after it has been collected.
    
    Have talked to libraries for research data group and long tail group about coordinating tight them. Have a proposed method. Want to try to find ways to make use of existing work.
  - Meetings. Want to have these monthly perhaps—can use mailing list, but good to try to find a way to do voice communication. Timing – 10am US eastern seems to be a good compromise, proposing that. Plan to send out a Doodle poll to determine monthly calls
    
    Comment: another working group has two conference calls, one geared towards west coast and asia, one for east coast and Europe.
- Timeline for next 18 months – draft timeline.
  - Months 1-3: how to properly scope the effort—possible to be very broad in how you define a repository, open to wide definition initially. First few months are focused on scoping appropriately.
    
    Review relevant work within RDA
    
    Review relevant work outside RDA
    
    Create guideline for use case formatting
    
    Define how use cases will be related to functional requirements
    
    Identify sources and criteria for use cases
  - Months 4-6
    
    Gather and format use cases according to guidelines
  - Months 7-10
    
    Define functional requirements
  - Months 7-12
    
    Create draft repor
  - Comments:
    
    Looking at data index. Won’t be a repository, more a series of metadata connecting to repositories. Bigger repositories vs smaller repositories—bigger ones will say we have been doing this for years. In this, is there an intention to develop a series of best practices—requirements that a good repository should fulfill? Not just data quality, but operational quality as well.
    
    Scope of this effort is on relating research use case data to functional requirements for repositories, at the software level. What does a given repository platform need to do in order to satisfy these use cases.
    
    Look at certification working group for more on how to run a repository.
    
    One approach that is missing – this area seems to lend itself to a systematic review of the literature. Is a healthy and robust set of published papers on requirements for data repositories. Do a systematic review of these? Column A set of requirements, then do an organic ground-up of case statements, when you examine the two might come up with a really good truth. Example of Michael Witt’s paper. Collate this data in a meaningful way? Merge results of systematic review of literature with the work that comes out of the organic review of use cases. Could fast track the effort.
    
    Yes, this is implied in the initial three month review of work outside of RDA. Don’t want to duplicate work that has already been done.
- Matrix model of what the output might look like – showing draft matrix on the slide
  - Rating matrix for importance of certain functions, in a way that software developers can understand and translate. Functional requirements across columns, Use cases are the rows. E.g., researcher data related to a journal article. Rate how essential each of the functions defined is, to this use case.
  - Not about creating a data seal of approval type of list of requirements, which comes at things from the outside. Rather creating sets of functionality appropriate for specific desired types of use cases.
    
    Comments:
    
    Don’t understand the matrix? Functional requirements in columns don’t seem to align with the use cases in the rows? Explained what the example means
    
    Like the approach. But, be careful that this is not just a repository for the ultimate data sets, called collections at the end of the process, related to articles etc through a DOI. But what many repositories do in creating data, they use handles as data is being created. Look at DFT model, creating intermediate data that you need to refer to with handles, but don’t have a DOI yet.
    
    Yes, will come out of how we define use cases and start creating requirements
    
    Important to distinguish the purpose of the repository. Is validating the content important, and who is that for—the users of the data? Different for a preservation repository? Different requirements for different types of repositories.
    
    Left hand row of data – use cases could be different uses of the repository data, like harvesting. Different types of communities, different levels of use.
    
    Launch of group’s work is to come up with these use cases and create rich definition of what the use cases will cover.
    
    Will be tricky to figure out – most repositories are moving on from search and extraction to analysis, adding compute capabilities to compare the database. Repositories are becoming something more, will be tricky to capture and define what a repository is
    
    Yes. Could go very broad in what a repository is, not sure where yet to draw the line, because if we get too broad can get unachievable scope. Won’t start from zero, but need to spend some time getting to this definition.
    
    Have seen that functional requirements can get very detailed, repositories are very complex systems with dozens of tasks. What level of detail are we trying to get to? Could be hundreds of functional requirements, could be hard to read or have too much information.
    
    Need to be careful to produce something that is not just correct, but easy to use.
    
    Perhaps as a help—many repository systems are built on top of storage systems. Might help to provide the kind of layer system, so we know we are not having to talk about cloud or other platforms that repository sits on top of. Limit the scope by making a way to separate out these kinds of things.
    
    Yes. Perhaps will do something like splitting out platform type questions into different tabs.
- Other groups to coordinate with – volunteers to interact with other groups and bring back information to inform this group. Don’t want to drag on for two years and still be having imore and more work—want to produce something that is helpful to move RDA forward.
  - Data Fabric
  - Data Foundation and Terminology
  - Certification
- Use case template, standard for how to collect use cases – user story based approach used by Fedora and Islandora
  - Primary Actor, Scope, Level (priority), Story. Example from GitHub – Islandora and Fedora4 interest group, use case template.
  - Stay away from defining technical implementation, let that come later. State in plain language.
  - Simple format—other groups in RDA use more complex formats. Purpose of this one—if you had a five-page user story, would break that down into a series of shorter use cases.
  - Comments:
    
    Implementation of this? Challenge in agile process is connecting across user stories—this is a very vertical approach. How does your workflow relate to this?
    
    In agile, would create an epic that connects the individual stories. Might want to group functional requirements and user stories.
    
    This approach allows you to identify needs vs. requirements, which is fine—requirements can get way too detailed. How to handle expectations though, for people who really are looking for requirements, expectations about performance, etc? Any thought of how to capture that?
    
    This is very; much a draft, interested in improving this template. Not sure how to incorporate those unstated expectations.
    
    Use this process in EXSEDE. What are the quality attributes that people want to see. Incorporate those into the use case. Capture what peole are really expecting out of the thing—reliability, performance, storage. Often people don’t think about this when they are writing the use cases because they assume the basics will just be there.
    
    Need to make these expectations very concrete
    
    How do we want to come to functional requirements and metrics? Get these from the use cases? Or, have the use cases one the one hand, and make functional requirements that would meet these?
    
    Literature review and looking at other systems may give us benchmarks. What kind of functions can we also pull out of the use cases. Functional requirements may come from a few different sources.
    
    Should also have a field in the template for users to put in functional requirements.
    
    Question: is the output of the group advisory in nature? May or may not want to take on this role, but there are a lot of people out there who would look to output as advisory document. Not only produce the rubric of what is, but advisory tags – mandatory if applicable, optional, will remain silent on… Important that the distinction be made from the get-go, whether this is advisory. Other IG experience—best they could do was give high-level principles, but not tell people how to implement privacy and licensing in their environment.
    
    Answer: don’t want to take this much further than saying, this type of function can meet this type of use case. Thinking that advisory function is beyond the scope of this group. Might spawn a working group that says, what kind of platforms are out there that would meet these kinds of requirements
    
    Hesitation on advisory nature—danger of being too encouraging of, one repository to do it all.
    
    Don’t expect to find at the end of the day that there is one platform to perform all of these functions.
    
    Think that position is correct, as far as approach. Encourage taking the approach, of letting people take this and form guidance from a working group on how to create advisory documents on what kind of platforms to use in a specific context
    
    Idea: survey the end users of these systems.
    
    Yes. Makes sense insofar as their use cases can speak to requirements for software, and don’t get into things like policy.
- Wrap-up
  - For those who are interested in continuing discussion by joining group, or just following discussion:
    
    http://www.rd-alliance.org/groups/repository-platforms-research-data.html
    
    Can sign up for mailings
  - Next working group is at Plenary 6 in Paris.
  - Will start work about a month from now.
  - Comments:
    
    What the group is doing is excellent. But, also have a wiki space to talk about experience with systems, as well as use cases?
    
    Looking at what people might have actually done could generate additional good information, functional requirements that aren’t captured initially.
    
    This is a good point, both of these, could be part of the potential working group that gets spawned from the interest group
Creator

Discussion

Repository Platforms for Research Data IG

Group Organizers

2015-03-09 RPRD P5 Meeting