Ptt1819 Assignment4

Domain Exploration (Draft)

Task

Get acquainted with Wikipedia dumps. https://dumps.wikimedia.org/

  1. Explore articles from the categories in the table below.
  2. Create an overview on which kind of topics (such as computer games as an example for software) are covered.
  3. Make use of lexico-syntactical pattern, text-based clustering or graph-based analysis for identifying groups of articles.
  4. Insights: When not adding depth restriction, every category leads to around 6 Million articles. Thus, restricting the scope is a possible simplification when extracting features.
No Contribution Group
1 Software engineering Philipp, Bartosz, Nils
2 Software Philipp, Bartosz, Nils
3 Hardware Alexander, André
4 Learning
5 Animals Tim, Anita
X Open to further suggestions for similarly large categories.

Submission

  • Submit your code in a repository and send us the link to it via E-mail.
  • Prepare an intermediate presentation of 10 mins:
    • How do you intend to technically process
    • What features do you intend to retrieve?
    • What preprocessing steps are necessary?
    • How do you intend to process the features?
    • What is your hypothesis on a possible outcome of your exploration?
    • Present first insights.

Ideas

A list of ideas on how to receive an overview follows.

  • Which nouns are the most frequent ones over all articles?
  • Does a clustering based on nouns provide any primary insights?
  • Which hypernym relations exist? In 'Java is a programming language'.
  • Can you identify frequent names using Named Entity Recognition technology?
  • Can you identify a depth level, where a completely unrelated domain is reached? Can you identify the culprit categories leading to this domain breach?