Dear all,
here are my notes from today's call. Please feel free to extend.
Our next call is scheduled for Aug 17, 13:00 UTC.
Notes:
* We discussed several use cases, particularly from RPID, in terms of
how the strawman fits and what fields might be missing
o Typical properties were related to time, geolocation; there are
also more 'domain properties' like temperature, wind speed - but
are these actually necessary for fundamental decision making?
o Can we form 'packages' out of this? like: KI for trust, KI for
geo/time, KI for environment?
* RPID: 4 exemplary scenarios; often need more information than in the
strawman; can this be broken down into parts, where parts of the
scenarios are enabled purely by the trust KI?
o Example 1: weather scenario: sensor network data, group by
dates, publish as 'daily research objects'; using all the
mandatory fields in the strawman; second part is then analysis
part, but RPID did not proceed yet to the filtering case. but
from what is there, it looks that creation date (already in
strawman) and device ID (can't be included) are important.
o Example 2: rice genomics: phenotypes & genotypes data;
copyright/licensing is a big issue - who created data, who
published data; also uses derivedFrom; future properties may be:
publication date, also geoinformation
* Discussion at IU: pulling info from domains will just make the
profile bigger; this is not what we want, but no clarity on what
else to do, so the discussion stopped. But the problem remains
unsolved.
o This is familiar also from the previous PIT group work. We also
got to that point and did not have any answers.
* One way to approach this: What is the value of the limited profile?
If this stands on its own, what does it enable? Does it enable
enough (cost/benefit ratio)?
o Dublin Core or DataCite must have been there. Can we learn from
them? But: This is not about metadata fields, but at the
conceptual process that leads to including or not including them
(or in what form). We want to clarify that process for our KI
decisions.
* Currently, we can't see a clear limit to the profile. So we want to
structure the decisions on what field to include or not include.
o Ulrich got to something: took down some first ideas for
structuring: graphs; some sort of ordering (the typical 'date'
use case, but also geolocation; string ordering); patterns in
strings; (there were more - I did not get all of them..)
o it's all geared to give easy 'yes'/'no' answers
* Ulrich got there by thinking about RDA discussions and what currents
run in them. Example: versioning discussions in RDA have always been
based on different understandings what versioning is, e.g. version
numbers of objects vs. graph lineage (git etc.); they are
orthogonal, but both provide some form of ordering - and therefore,
ordering in principle seems to be an interesting/relevant part;
comparability of versions seems to be important;
o Can we find more such examples?
o Guideline is always: What information is at a generic level
required to crawl through DOs? and then it all ends up at these
yes/no decisions
o TW: another example for a recurring RDA discussion could be
granularity/collection/subsetting - but what is the
generalization of this that leads to yes/no decisions?
o Another example: pattern in strings: is about searching - search
questions are always about string pattern matching; this again
leads to yes/no decisions
+ Can we also look the other way, i.e.: which processes are
ultimately boiled down to a pattern matching question?
--
Dr. Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
Author: Ulrich Schwardmann
Date: 04 Aug, 2017
Dear Tobias, all
thanks Tobias for streamlining my straying thoughts.
As the mayor problem in the discussion yesterday and also in discussions
before I see: we define a straw man for a set of types as kind of
elementary for each decision process in crawling through data, and on
the other hand we see, that the use cases we have in general need some
additional specific types not covered by the straw man.
This has the consequence, that with the straw man itself we cannot crawl
through the data once to get the answers needed by the use cases, but we
need a second run for the queries particularly necessary for the
individual use cases. Such a double run approach in most cases will be
highly inefficient.
We tried several times to extend the straw man in different directions,
but this always lead to doubts about a specific decision and questions
about extending a bit further. In general the discussion ended with the
statement, that we can't decide, therefore we stop at this point and
look at that what we have. Which means, at the end we propose
inefficient processes in most of our known use cases and probably for
all of those we don't know yet.
Another approach was to define the straw man and a couple of more
community specific sets of types. This might help to extend the number
of use cases covered. But at the end this again will create a couple of
exceptions which will need a second run. For me the inefficiency caused
by the necessity of a complete second run is a big obstacle for general
acceptance of such a solution.
And this leads back to the abstract starting point of this group: the
problem to give guidelines of how we can get easy Yes/No answers when
crawling through masses of DOs. The emphasis on easy cannot be
underestimated here: often the most ressource consuming simulations on
our HPC system boiles down to a Yes or No to a certain question.
But my suggestion is not to select specific types as guideline, but to
define, how easy Yes/No answers can be provided; in other words to have
a closer look at the decision processes itself, that take place during
the single crawling steps and to try to define criteria to select info
types this way.
This would not name directly the types allowed as kernel infos, but
allows a much broader flexibility. This is comparable to not name the
elements in a set but define the set by a function.
Naming types always restricts the semantic and therefore immediately and
dramatically restricts the use cases, which are always driven by a
specific semantic behind it. Saying that a type has to have a specific
criteria, like being a number, means that all kind of semantic can be
used that is directly related to numbers, which is a huge field. And why
for instance numbers: because there are lots of very easy decision
processes possible on numbers. But again not every thing can be covered
by numbers.
To go deeper, this problem has two aspects:
* the landscape of the crawling process istself
* making the selections of DOs during the crawling process for the
overall output of the process
The first relays on some graph structure of identifiers, the second
needs easy Yes/No answers about decisions to follow a particular path in
that graph. These decisions can be made by information about the graph
structure at the current node itself and/or by additional information
available at the current node like availability of certain types and
more often of the content of data in particular types.
This gives three categories for decision in that crawling process:
- local graph structure
- local type structure
- particular type content
The first two lead in general to easy decisions by construction. The
third is the one, which makes more trouble. For this one we need to
closer define what an the easy decision process itself could be.
In principle this means that a certain service/function for an info type
together with a given condition exists, that reliably answers Yes/No or
True/False in a certain short timeframe.
One could say, that for kernel infos one only allows such types, where
decision processes exist (and are used) that fulfil these criteria,
simple as that. But it certainly makes sense to discuss in this group
deeper the suggested structure and give more advice to the data managers.
As a rough guideline one could say, that the condition as well as the
info type data both needs to be simple, and they need to be compatible
(for instance one cannot ask for values > 1 (condition) and provide a
string as info type data, unless one has some conversion/mapping in place).
Examples for such info type condition combinations would be:
* the boolean True/False itself is the trivial kind of type for a
decision process, but these are certainly valuable candidates, because
that is the fastest possible decision process.
* types that have some sort of well defined order, such that it is easy
to ask for greater, less or equal. One could also demand the much
greater class of a semi ordering, because one can determine, that
incomparability would give the answer No. Examples for ordering would be
numbers, strings by lexicographical order and structured strings like
dates, geolocations, etc.. With semi ordering one can compare also nodes
in a graph, lists or arrays, including strings again, viewed now as
arrays, also sets and possibly dictionaries viewed as sets etc.. This
list comprises a range from versioning in its both othogonal aspects to
update dates or geolocation.
* types that have some structure, that can be explored by pattern
matching, because in general pattern matching is a fast decision
process. Examples would be strings, but also numbers and probably all
the examples of the semi ordering above. BTW: this makes semi orderering
even more interesting in this context, because there would be at least
two approaches for easy decisions. This list comprises a range from
typical publication queries to (simple) semantic web queries.
* Others have to be explored... Lets have a look at the uses cases,
whether we missed something. What's about this devices example we had in
the discussion yesterday?
Finally one could end up with a three level structure:
* one that defines the straw man as a most generic part of kernel infos,
but one clearly says, that this will be not enough in most of the cases
* one that defines a couple of well known examples of kinds (like
integer, time, etc., or like ordered or matchable types) of info types
and related kinds of decisions (like >, belongs to or matches) on these
types,
* and one provided by this more generic definition, that a certain
service/function for an info type together with a given condition needs
to exist, that reliably answers Yes/No or True/False in a certain short
timeframe.
There is certainly a lot to do on such an approach, but I think it could
be a good starting point to come over these boundary problems we have.
Author: Tobias Weigel
Date: 16 Aug, 2017
Hello Ulrich,
in view of tomorrow's call, I've gone through your notes again and came
up with a couple of detail questions (can ask tomorrow), but more
importantly, observations regarding future directions of the group:
a) We need a better and somewhat formal description of the crawling &
selection process, including the decision-making part
b) We need a framework for condition specifications. You've already put
some cornerstones in place (categories) and some essential requirements
(simple machine-readability, compatibility).
Also, I'm musing about how complete your various points are - e.g. the 3
decision crawling categories - I did not find gaps so far, which is
good, but it might still worth thinking about extensions. But the two
items above may be more direct regarding next steps.
Best, Tobias
Author: Mark Parsons
Date: 16 Aug, 2017
Hi all,
I just joined this group. Where is the call in info? I’d like to join, if I may.
cheers,
-m.
Mark A. Parsons
0000-0002-7723-0950
Senior Research Scientist
Tetherless World Constellation
Rensselaer Polytechnic Institute
http://tw.rpi.edu
+1 303 941 9986
Skype: mark.a.parsons
mail: 1550 Linden Ave., Boulder CO 80304, USA
On 16 Aug 2017, at 10:00, TobiasWeigel <***@***.***> wrote:
Hello Ulrich,
in view of tomorrow's call, I've gone through your notes again and came
up with a couple of detail questions (can ask tomorrow), but more
importantly, observations regarding future directions of the group:
a) We need a better and somewhat formal description of the crawling &
selection process, including the decision-making part
b) We need a framework for condition specifications. You've already put
some cornerstones in place (categories) and some essential requirements
(simple machine-readability, compatibility).
Also, I'm musing about how complete your various points are - e.g. the 3
decision crawling categories - I did not find gaps so far, which is
good, but it might still worth thinking about extensions. But the two
items above may be more direct regarding next steps.
Best, Tobias
On 04.08.2017 16:05, Ulrich Schwardmann wrote:
Dear Tobias, all
thanks Tobias for streamlining my straying thoughts.
As the mayor problem in the discussion yesterday and also in discussions
before I see: we define a straw man for a set of types as kind of
elementary for each decision process in crawling through data, and on
the other hand we see, that the use cases we have in general need some
additional specific types not covered by the straw man.
This has the consequence, that with the straw man itself we cannot crawl
through the data once to get the answers needed by the use cases, but we
need a second run for the queries particularly necessary for the
individual use cases. Such a double run approach in most cases will be
highly inefficient.
We tried several times to extend the straw man in different directions,
but this always lead to doubts about a specific decision and questions
about extending a bit further. In general the discussion ended with the
statement, that we can't decide, therefore we stop at this point and
look at that what we have. Which means, at the end we propose
inefficient processes in most of our known use cases and probably for
all of those we don't know yet.
Another approach was to define the straw man and a couple of more
community specific sets of types. This might help to extend the number
of use cases covered. But at the end this again will create a couple of
exceptions which will need a second run. For me the inefficiency caused
by the necessity of a complete second run is a big obstacle for general
acceptance of such a solution.
And this leads back to the abstract starting point of this group: the
problem to give guidelines of how we can get easy Yes/No answers when
crawling through masses of DOs. The emphasis on easy cannot be
underestimated here: often the most ressource consuming simulations on
our HPC system boiles down to a Yes or No to a certain question.
But my suggestion is not to select specific types as guideline, but to
define, how easy Yes/No answers can be provided; in other words to have
a closer look at the decision processes itself, that take place during
the single crawling steps and to try to define criteria to select info
types this way.
This would not name directly the types allowed as kernel infos, but
allows a much broader flexibility. This is comparable to not name the
elements in a set but define the set by a function.
Naming types always restricts the semantic and therefore immediately and
dramatically restricts the use cases, which are always driven by a
specific semantic behind it. Saying that a type has to have a specific
criteria, like being a number, means that all kind of semantic can be
used that is directly related to numbers, which is a huge field. And why
for instance numbers: because there are lots of very easy decision
processes possible on numbers. But again not every thing can be covered
by numbers.
To go deeper, this problem has two aspects:
* the landscape of the crawling process istself
* making the selections of DOs during the crawling process for the
overall output of the process
The first relays on some graph structure of identifiers, the second
needs easy Yes/No answers about decisions to follow a particular path in
that graph. These decisions can be made by information about the graph
structure at the current node itself and/or by additional information
available at the current node like availability of certain types and
more often of the content of data in particular types.
This gives three categories for decision in that crawling process:
- local graph structure
- local type structure
- particular type content
The first two lead in general to easy decisions by construction. The
third is the one, which makes more trouble. For this one we need to
closer define what an the easy decision process itself could be.
In principle this means that a certain service/function for an info type
together with a given condition exists, that reliably answers Yes/No or
True/False in a certain short timeframe.
One could say, that for kernel infos one only allows such types, where
decision processes exist (and are used) that fulfil these criteria,
simple as that. But it certainly makes sense to discuss in this group
deeper the suggested structure and give more advice to the data managers.
As a rough guideline one could say, that the condition as well as the
info type data both needs to be simple, and they need to be compatible
(for instance one cannot ask for values > 1 (condition) and provide a
string as info type data, unless one has some conversion/mapping in place).
Examples for such info type condition combinations would be:
* the boolean True/False itself is the trivial kind of type for a
decision process, but these are certainly valuable candidates, because
that is the fastest possible decision process.
* types that have some sort of well defined order, such that it is easy
to ask for greater, less or equal. One could also demand the much
greater class of a semi ordering, because one can determine, that
incomparability would give the answer No. Examples for ordering would be
numbers, strings by lexicographical order and structured strings like
dates, geolocations, etc.. With semi ordering one can compare also nodes
in a graph, lists or arrays, including strings again, viewed now as
arrays, also sets and possibly dictionaries viewed as sets etc.. This
list comprises a range from versioning in its both othogonal aspects to
update dates or geolocation.
* types that have some structure, that can be explored by pattern
matching, because in general pattern matching is a fast decision
process. Examples would be strings, but also numbers and probably all
the examples of the semi ordering above. BTW: this makes semi orderering
even more interesting in this context, because there would be at least
two approaches for easy decisions. This list comprises a range from
typical publication queries to (simple) semantic web queries.
* Others have to be explored... Lets have a look at the uses cases,
whether we missed something. What's about this devices example we had in
the discussion yesterday?
Finally one could end up with a three level structure:
* one that defines the straw man as a most generic part of kernel infos,
but one clearly says, that this will be not enough in most of the cases
* one that defines a couple of well known examples of kinds (like
integer, time, etc., or like ordered or matchable types) of info types
and related kinds of decisions (like >, belongs to or matches) on these
types,
* and one provided by this more generic definition, that a certain
service/function for an info type together with a given condition needs
to exist, that reliably answers Yes/No or True/False in a certain short
timeframe.
There is certainly a lot to do on such an approach, but I think it could
be a good starting point to come over these boundary problems we have.
Am 03.08.2017 um 16:35 schrieb TobiasWeigel:
Dear all,
here are my notes from today's call. Please feel free to extend.
Our next call is scheduled for Aug 17, 13:00 UTC.
Notes:
* We discussed several use cases, particularly from RPID, in terms
of how the strawman fits and what fields might be missing
o Typical properties were related to time, geolocation; there
are also more 'domain properties' like temperature, wind speed
- but are these actually necessary for fundamental decision
making?
o Can we form 'packages' out of this? like: KI for trust, KI for
geo/time, KI for environment?
* RPID: 4 exemplary scenarios; often need more information than in
the strawman; can this be broken down into parts, where parts of
the scenarios are enabled purely by the trust KI?
o Example 1: weather scenario: sensor network data, group by
dates, publish as 'daily research objects'; using all the
mandatory fields in the strawman; second part is then analysis
part, but RPID did not proceed yet to the filtering case. but
from what is there, it looks that creation date (already in
strawman) and device ID (can't be included) are important.
o Example 2: rice genomics: phenotypes & genotypes data;
copyright/licensing is a big issue - who created data, who
published data; also uses derivedFrom; future properties may
be: publication date, also geoinformation
* Discussion at IU: pulling info from domains will just make the
profile bigger; this is not what we want, but no clarity on what
else to do, so the discussion stopped. But the problem remains
unsolved.
o This is familiar also from the previous PIT group work. We
also got to that point and did not have any answers.
* One way to approach this: What is the value of the limited
profile? If this stands on its own, what does it enable? Does it
enable enough (cost/benefit ratio)?
o Dublin Core or DataCite must have been there. Can we learn
from them? But: This is not about metadata fields, but at the
conceptual process that leads to including or not including
them (or in what form). We want to clarify that process for
our KI decisions.
* Currently, we can't see a clear limit to the profile. So we want
to structure the decisions on what field to include or not include.
o Ulrich got to something: took down some first ideas for
structuring: graphs; some sort of ordering (the typical 'date'
use case, but also geolocation; string ordering); patterns in
strings; (there were more - I did not get all of them..)
o it's all geared to give easy 'yes'/'no' answers
* Ulrich got there by thinking about RDA discussions and what
currents run in them. Example: versioning discussions in RDA have
always been based on different understandings what versioning is,
e.g. version numbers of objects vs. graph lineage (git etc.); they
are orthogonal, but both provide some form of ordering - and
therefore, ordering in principle seems to be an
interesting/relevant part; comparability of versions seems to be
important;
o Can we find more such examples?
o Guideline is always: What information is at a generic level
required to crawl through DOs? and then it all ends up at
these yes/no decisions
o TW: another example for a recurring RDA discussion could be
granularity/collection/subsetting - but what is the
generalization of this that leads to yes/no decisions?
o Another example: pattern in strings: is about searching -
search questions are always about string pattern matching;
this again leads to yes/no decisions
+ Can we also look the other way, i.e.: which processes are
ultimately boiled down to a pattern matching question?
--
Dr. Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
--
Full post:
https://www.rd-alliance.org/group/pid-kernel-information-wg/post/notes-t...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post:
https://www.rd-alliance.org/mailinglist/unsubscribe/57236
--
Mit freundlichem Gruss
Ulrich Schwardmann
Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
URL: http://www.gwdg.de E-Mail: ***@***.***
Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
--
Dr. Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
--
Full post:
https://www.rd-alliance.org/group/pid-kernel-information-wg/post/notes-t...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post:
https://www.rd-alliance.org/mailinglist/unsubscribe/57236