The Laboratorium : Automated Content Access Progress

Francis Cave from the ACAP project has posted a great response to my earlier post expressing concerns. The high-order bit is that “1.0” is a misnomer; ACAP is a work in progress, and the ACAP team is committed to refining the proposal in light of suggestions and critiques from those interested in its success. That’s all for the best.

Some more specific issues that Cave brings up that I have something more to say on:

Participation

The list of participants in the ACAP project includes a lot more publishers than search engines. Thus, it’s titled towards trying to express those things publishers would like to express, but without a corresponding sensitivity to what’s feasible for a search engine to do. (See, for example, Andy Oram’s critique of ACAP’s technical demands.) His group would have liked more input from the search side; I see the kinds of critiques Andy and I and others are now offering as a slightly belated substitute for the input the search engines wouldn’t or couldn’t provide.

REP

The ACAP team didn’t originally plan to use the Robot Exclusion Protocol at all, but then chose to implement their proposals as REP extensions

based upon the blindingly obvious fact that REP is the established way for content owners to communicate routinely with crawler operators, and it will be far easier for crawler operators to implement extensions to what they are already able to interpret in REP than to propose an entirely new protocol.

That’s right, and while the decision creates transition issues of one sort (as I discussed in my first post), it does also add transitions issues of a different sort. I stand by my best-practice recommendation:

[A]nyone writing an ACAP file [should] use ACAP-permissions-reference in their robots.txt to send all crawlers that speak ACAP to an external file that consists of pure ACAP.

What’s In, What’s Out

I made fun of the number of admittedly not-ready-for-prime-time “features” in ACAP 1.0. Now that I know that 1.0 isn’t really 1.0, I’m much more tolerant of having those features “in.” They can be discussed; implementers can articulate which of them work and which of them don’t. The lack of polish doesn’t bother me. Cave and I are entirely on the same page at this level: let’s talk about which features work and fix or replace the ones that don’t.

Dates and Times

Cave says that “just adopting standard ISO formats for date-times isn’t a total solution.” I’m curious to know why. Wouldn’t content owners want that level of specificity? And wouldn’t search engines want standard formats with standard parsers?

Interpretations by Prior Arrangement

This is a tough issue. Cave:

There are several examples in our proposals of forms of expression that search engines, unless they make a special arrangement with the content owner, would be bound to treat as “cloaking”.

That’s a good point, and I hadn’t realized it. I’d like to know more; perhaps there’s a way that honest content providers could establish that what they’re doing isn’t cloaking as defined today.

Cave again:

Maybe implementors on the receiving side would like us to divide our proposals more clearly into a core set of features that don’t involve such issues and a supplementary set that might. From a publisher perspective a number of what you might see as “non-core” features are quite fundamental to what they need to be able to communicate, so at this stage I don’t think it would be helpful to create such a separation.

I think implementors need a clean separation between features they’re expected to implement, and ones they aren’t. Certainly, if there are going to be any legal dimensions to ACAP—a place that lots of content owners want to go—a cleanly defined boundary is utterly essential. If publishers see a feature as “fundamental,” then either ACAP needs to be iterated until it expresses them cleanly and a fully-specified syntax, or the publishers need to wake up and understand why their “needs” are inconsistent with technical reality. I suspect the grab bag that is ACAP 1.0 includes examples of both feasible and infeasible “fundamentals.”

Used Resource Types

“Present” is the output of one of those committee processes; part of the cost of doing technical work in a group. The typos I noticed in the list of resource types have been corrected. Dropping “extract” simplifies a bit, but I would still suggest factoring the list along three dimensions:

What information is involved? You could use a resource in toto, use an excerpt from it, use an abstracted/condensed summary of it, and so on.
Where the used version is not the original itself, who is responsible for generating the used version? The search engine or the publisher?
How old should the used” version be? For example, there’s the version originally indexed, and there’s the current version on the publisher’s server.

I may be simplifying away essential complexity here—correct me if I am—but this seems like a cleaner, easier-to-understand taxonomy of used versions than the list in the current ACAP draft. I wouldn’t try to intermingle these questions in the same attribute or verb.

(Side question: what are the semantics of a statement that it’s permitted to display a thumbnail of a music file?)

Max-length

I proposed generalizing the max-length attribute, and yes, it was already on their to-do list. Yay.

December 11, 2007 at 4:19 PM

Francis Cave

Here are my responses to the further specific issues that you raise.

Participation. Yes, now that I know that you and Andy Oram are taking a detailed interest in our work, I’ll certainly be taking an interest in what you both have to say about further developments in ACAP, as these unfold. We have certainly known all along that what is expressible and what is feasible to process are two different sets of things, and the key to successful implementation of ACAP is identifying the overlap between these two sets. However, our tests with one search engine so far do give us confidence that at least what we’ve tested is feasible.

REP. I agree that many publishers may find it easier to manage their ACAP expressions in a separate file from their conventional REP expressions. Several different strategies for managing access and usage policies are possible. A lot will depend upon what proves to be the best way of incorporating policy management into existing workflow and content management systems. One can even imagine extreme cases in which all permissions are embedded in content and robots.txt says very little of importance.

Dates and times. We decided to express time-limits in units of whole days because we couldn’t see a practical way in which it could be any more precise than that, given that a search engine cannot specify a precise time at which an item will be added to or removed from a massively-parallel indexing system. This is no better or worse than the current situation, in which a publisher doesn’t know how long it will take for a permission change in robots.txt to be reflected in a search engine index. An item might be queued up for indexing today but not hit the indexes for hours or even days.

Interpretation by prior arrangement. Our view is that what different crawler operators (i.e. search engines and other content aggregators) will find problematic will vary, so our current plan is, rather than define our own list of what is or is not in a fundamental list, to provide a mechanism whereby crawler operators can describe the ability of their crawler(s) to interpret ACAP in a “crawler description” record that they could publish as part of their crawler FAQ. A crawler description record could be a chunk of XML, or similar, that describes (inter alia) which features of ACAP have been implemented for general use, which have been implemented for limited use by prior arrangement, and which have not been implemented. This is an idea that has not yet got beyond the early internal drafting stage, so any thoughts on that would be welcome.

Used resource types. I agree that what we have proposed is a conflation of a number of properties in different dimensions. One dimension (version) is already made partially explicit by ‘current’ and ‘old’. It may be that these different dimensions could be made more explicit and more precise, but this is bound to lengthening the forms of expression. I should stress that we don’t see REP as providing a definitive rendition of ACAP semantics, just one that is sufficient for crawlers to use (although it must of course be well-defined and unambiguous). We’re already working on an XML expression of ACAP semantics that already separates out some of the dimensions that you mention. However, I tend to espouse the “principle of functional granularity”, which states that something is only worth expressing if there is a need to express it. By that principle I’m not convinced that either originators or recipients of ACAP permissions need the multi-dimensional level of granularity that you propose. But I remain open to persuasion.

Snippets and thumbnails. You raise an interesting point about audio thumbnails that is being addressed as we turn our attention more to non-text resources. We know that some search engines include extracts from audio-visual resources in search results - are these snippets or thumbnails? I suppose they’re snippets if they’re extracts of specific relevance to the search criteria (i.e. composed of frame-sequences indexed using speech recognition), and thumbnails if not. Image feature analysis is not, as I understand it, practical for real-time web crawling/indexing, and most images are indexed based upon the text that surrounds them in a web page, so image snippets in the strict sense - i.e. details extracted from an image, based upon search criteria - are not really a practical proposition at the moment.

Francis Cave, ACAP Technical Project Manager

December 11, 2007 at 4:27 PM

James Grimmelmann

I like the idea of “crawler description” records. I’m guessing that the work that goes into creating the format and the experience of having search companies create them will feed back usefully into the design of the format on the content-provider side. That’s a good conversation to be encouraging; it should make the technical disagreements between search engines and content providers clearer, and help everyone find common ground.

December 15, 2007 at 8:58 AM

Idetrorce

very interesting, but I don’t agree with you Idetrorce