Anchor text for ClueWeb12

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

  • ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)
The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

9 Responses to “Anchor text for ClueWeb12”

  1. BetaLearner Says:
    hi, we have try to download the ClueWeb12_Anchors but it seems not working. Is there any other ways to abtaining the data? Or if we try the MIREX on a hadoop cluster with only ten machines, how long will it take? thanks~
  2. Djoerd Hiemstra Says:

    We ran the anchor text extraction on 80 machines at the SARA super computing center in Amsterdam, where it took about 2 hours. The same task for ClueWeb09 on our own 16 machine cluster took 11 hours. My guess is that the anchor text extraction should finish on 10 machines in about 24 hour, but it really depends on the machines you have (our machines were bought in 2008 for 1000 Euro per machine, so they are not modern nor very sophisticated).

    I am very sorry about the anchors being offline. The system seeding the anchors crashed, and it will not be up before August 5, because of holidays and internal reorganizations.

  3. Djoerd Hiemstra Says:

    The seeder is up again. If you are experiencing problems downloading or bad transfer rates, please check if your university blocks or slows down BitTorrent connections.

  4. Alessandro Sordoni Says:
    Dear Sir, I tried to download the ClueWeb12_Anchors but it seems not working. Is the seeder up by now? Thanks a lot
  5. Djoerd Hiemstra Says:

    Dear Alessandro, I am sorry the seeder is off-line again, the system seeding is now replaced by another machine. Using BitTorrent was not a particularly good idea. We will try to get things running again in the upcoming days. Djoerd.

  6. Djoerd Hiemstra Says:

    Apologies, the torrent had to be recreated, so anyone who partially downloaded data before today has to start over by removing the old torrent and downloading the torrent on this page.

  7. Djoerd Hiemstra Says:

    Our ClueWeb anchor texts are now tracked by Academic Torrents. http://academictorrents.com

  8. Janek B. Says:
    The torrent seems to be dead again. Is there any way to still get the archive with Clueweb12 anchor texts? I’d be really helpful for some research we’re doing.
  9. Djoerd Hiemstra Says:

    Dear Janek, Sorry for the late reply, this was sent exactly when I left for holidays. I think it is on-line again, can you please check? Best wishes, Djoerd.

Leave a Reply