Ptt1819 Assignment4
Domain Exploration (Draft)
Task
Get acquainted with Wikipedia dumps. https://dumps.wikimedia.org/
- Explore articles from the categories in the table below.
- Create an overview on which kind of topics (such as computer games as an example for software) are covered.
- Make use of lexico-syntactical pattern, text-based clustering or graph-based analysis for identifying groups of articles.
- Insights: When not adding depth restriction, every category leads to around 6 Million articles. Thus, restricting the scope is a possible simplification when extracting features.
No | Contribution | Group |
---|---|---|
1 | Software engineering | Philipp, Bartosz, Nils |
2 | Software | Philipp, Bartosz, Nils |
3 | Hardware | Alexander, André |
4 | Learning | |
5 | Animals | Tim, Anita |
X | Open to further suggestions for similarly large categories. |
Submission
- Submit your code in a repository and send us the link to it via E-mail.
- Prepare an intermediate presentation of 10 mins:
- How do you intend to technically process
- What features do you intend to retrieve?
- What preprocessing steps are necessary?
- How do you intend to process the features?
- What is your hypothesis on a possible outcome of your exploration?
- Present first insights.
Ideas
A list of ideas on how to receive an overview follows.
- Which nouns are the most frequent ones over all articles?
- Does a clustering based on nouns provide any primary insights?
- Which hypernym relations exist? In 'Java is a programming language'.
- Can you identify frequent names using Named Entity Recognition technology?
- Can you identify a depth level, where a completely unrelated domain is reached? Can you identify the culprit categories leading to this domain breach?
page revision: 6, last edited: 28 Jan 2019 16:20