Francis Cave from the ACAP project has posted a great response to my earlier post expressing concerns. The high-order bit is that “1.0” is a misnomer; ACAP is a work in progress, and the ACAP team is committed to refining the proposal in light of suggestions and critiques from those interested in its success. That’s all for the best.
Some more specific issues that Cave brings up that I have something more to say on:
The list of participants in the ACAP project includes a lot more publishers than search engines. Thus, it’s titled towards trying to express those things publishers would like to express, but without a corresponding sensitivity to what’s feasible for a search engine to do. (See, for example, Andy Oram’s critique of ACAP’s technical demands.) His group would have liked more input from the search side; I see the kinds of critiques Andy and I and others are now offering as a slightly belated substitute for the input the search engines wouldn’t or couldn’t provide.
The ACAP team didn’t originally plan to use the Robot Exclusion Protocol at all, but then chose to implement their proposals as REP extensions
based upon the blindingly obvious fact that REP is the established way for content owners to communicate routinely with crawler operators, and it will be far easier for crawler operators to implement extensions to what they are already able to interpret in REP than to propose an entirely new protocol.That’s right, and while the decision creates transition issues of one sort (as I discussed in my first post), it does also add transitions issues of a different sort. I stand by my best-practice recommendation:
[A]nyone writing an ACAP file [should] use ACAP-permissions-reference in their robots.txt to send all crawlers that speak ACAP to an external file that consists of pure ACAP.
What’s In, What’s Out
I made fun of the number of admittedly not-ready-for-prime-time “features” in ACAP 1.0. Now that I know that 1.0 isn’t really 1.0, I’m much more tolerant of having those features “in.” They can be discussed; implementers can articulate which of them work and which of them don’t. The lack of polish doesn’t bother me. Cave and I are entirely on the same page at this level: let’s talk about which features work and fix or replace the ones that don’t.
Dates and Times
Cave says that “just adopting standard ISO formats for date-times isn’t a total solution.” I’m curious to know why. Wouldn’t content owners want that level of specificity? And wouldn’t search engines want standard formats with standard parsers?
Interpretations by Prior Arrangement
This is a tough issue. Cave:
There are several examples in our proposals of forms of expression that search engines, unless they make a special arrangement with the content owner, would be bound to treat as “cloaking”.
That’s a good point, and I hadn’t realized it. I’d like to know more; perhaps there’s a way that honest content providers could establish that what they’re doing isn’t cloaking as defined today.
Maybe implementors on the receiving side would like us to divide our proposals more clearly into a core set of features that don’t involve such issues and a supplementary set that might. From a publisher perspective a number of what you might see as “non-core” features are quite fundamental to what they need to be able to communicate, so at this stage I don’t think it would be helpful to create such a separation.
I think implementors need a clean separation between features they’re expected to implement, and ones they aren’t. Certainly, if there are going to be any legal dimensions to ACAP—a place that lots of content owners want to go—a cleanly defined boundary is utterly essential. If publishers see a feature as “fundamental,” then either ACAP needs to be iterated until it expresses them cleanly and a fully-specified syntax, or the publishers need to wake up and understand why their “needs” are inconsistent with technical reality. I suspect the grab bag that is ACAP 1.0 includes examples of both feasible and infeasible “fundamentals.”
Used Resource Types
“Present” is the output of one of those committee processes; part of the cost of doing technical work in a group. The typos I noticed in the list of resource types have been corrected. Dropping “extract” simplifies a bit, but I would still suggest factoring the list along three dimensions:
- What information is involved? You could use a resource in toto, use an excerpt from it, use an abstracted/condensed summary of it, and so on.
- Where the used version is not the original itself, who is responsible for generating the used version? The search engine or the publisher?
- How old should the used” version be? For example, there’s the version originally indexed, and there’s the current version on the publisher’s server.
I may be simplifying away essential complexity here—correct me if I am—but this seems like a cleaner, easier-to-understand taxonomy of used versions than the list in the current ACAP draft. I wouldn’t try to intermingle these questions in the same attribute or verb.
(Side question: what are the semantics of a statement that it’s permitted to display a thumbnail of a music file?)
I proposed generalizing the
max-length attribute, and yes, it was already on their to-do list. Yay.