Internet Archive
-
gowarc Public
Read and write WARC files in Go
-
Zeno Public
State-of-the-art web crawler 🔱
-
wiki-references-db Public
Data models and scripts to build a database of references (broadly defined) appearing on Wikipedia and other wikis
internetarchive/wiki-references-db’s past year of commit activity -
openlibrary Public
One webpage for every book ever published!
-
heritrix3 Public
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
RevisionChest Public
Transforms Wikipedia XML dumps into a more compact, stream-friendly format
internetarchive/RevisionChest’s past year of commit activity