You are here: TWiki > Gna Web > CatalogTodo r1 - 17 Apr 2005 - 20:14 - Main.joe


Start of topic | Skip to actions
Unlike most other distance learning catalogs, the main purpose of the GNA distance learning catalog is to serve as a test bed for research and development of web technologies. Two technologies which are extensively used in this catalog are automated data extraction and automated topic classification.

The first technology involves pattern matching to extract course information from a web page. The system is trained by adding tags to a web entry, and then uses an edit distance algorithm to extract course information. To extract course information, we have a three layer process.

  • 1-5 courses get added using our form based information
  • 5-50 courses get manually typed by hand using data entry outsourced to http://www.suntecindia.com/
  • 50+ courses get added using the automated data extractor

The second technology still needs some work. We begin using naive Bayesian classification, but we still get a lot of odd matches. Trying to improve topic classification.

Other areas of research which we've added is a system of filter plugins which resemble tagging used in social bookmarking.

Aside from this, we have the whole thing run on a postgresql using Mandriva cooker. This is bleeding edge code so we find a number of bugs which we report to the open source community.

=List of catalog projects=

  • Add filter for high school courses
  • Create e-mail to link for new courses.
Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r1 | More topic actions
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors and is licensed under the terms of the GNU Free Documentation License.
Ideas, requests, problems regarding TWiki? Send feedback