Analysis assignment for ESE 2013 course


  • Assignment posted on 31 May 2013.
  • Phase 1 (team assembly and research sketch): 31 May 2013 — 5 June 2013.
  • Discussion/supervision on phase 1: 4 June 2013.
  • Register teams and project via email to ed.znelbok-inu|gnaltfos#ed.znelbok-inu|gnaltfos: 6 June
  • Phase 2 (research design and execution): 6 June - 26 June 2013.
  • Discussion/supervision on phase 2: 18 June 2013.
  • Final student presentations due on 27 June 2013.


Students are supposed to design and execute a research project that involves analysis of software artifacts. Some overall research directions were discussed in preparation of the assignment. Team work (with team size of 2-3) is strongly encouraged to reach a critical mass within the time available and to better challenge the research questions and methodology. The assignment consists of two phases. Phase 1 aim at the assembly of student research teams and agreement on a sketch for research questions and methodology. Phase 2 concerns the actual research work. (The experiment assignment will be run in parallel with the present assignment, at least for part of it, as it starts one or two weeks later.)


  • Define a working title for the project (of just a few words).
  • Define an overall (comprehensible, perhaps imprecise) research question.
  • Define one or more hypotheses to be supported or falsified.
  • Discuss elements of a theory from which your hypotheses follow.
  • Define a corpus of software artifacts to be used for the analysis.
  • Define the methodology of measuring and analysing.
  • Implement your analysis, gather, visualize, and interpret your data.
  • Discuss threats to validity and future work options.
  • If possible, relate to some other research papers for related work.

Available topics

Related or different topics could be acceptable, but precise topics need to be agreed upon with the teaching stuff during Phase I.

  1. Metadata inference (see recent lecture)
  2. Code-sharing detection (6 interested students)
  3. Java bytecode comparison (3 interested students)
  4. PyPi build idioms (1 interested student)
  5. API usage analysis (2 interested students)
  6. Comment/source code alignment (1 interested student)
  7. API evolution/growth speed across hosting platforms (5 interested students)

The first two topics are particularly endorsed by the Software Languages Team because they directly relate to open research questions that the team needs to research anyway. Thus, the first two topics are described in some additional detail.

Metadata inference

See the lecture of the same name for some background.

Rationale: The overall goal is to work towards a single database which contains knowledge about technologies (packages, APIs, etc.) from many different platforms. In the 101project, At this point, we know of many Java APIs, but we don't know much about other platforms. It should be fairly possible to systematically integrate a few more platforms, as the corresponding package database systematically list all possible of technologies.

Research sketch:

  • Pick a "platform" from which to extract metadata about technologies.
  • Locate metadata available for the technologies on a platform.
  • Identify an access path (API?) for automated data extraction from platform.
  • Retrieve listing of technologies with metadata from platform.
  • Determine what sort of tagging/categorization techniques are used.
  • Measure numbers of technologies and tags/categories.
  • Represent all results in JSON.

Optional, advanced questions:

  • Discuss whether usage of the technologies could be automatically detected.
  • Discuss whether platform data could be reasonably integrated into the 101project.

Code-sharing detection ++

Rationale: In the 101project, there are some languages for which there are many contributions. In fact, the most popular languages are Java and Haskell at this point (with several dozens of contributions each). The contributions are known to be highly similar because they often share common code elements. For instance, there is just a few Java-based data models across the many different Java-based contributions. This research should help with assessing the similarity of the contribution basically with the help of simple clone-detection techniques.

Research sketch:

  • Pick a "language". This should be Java probably, but other candidates are Python, JavaScript. (Haskell is probably already addressed by an effort.)
  • Perform filename similarity analysis.
  • Perform file content similarity analysis. (For instance, test for identical source files.)
  • Perform file fragment similarity analysis.
  • Determine some useful way of presenting the similarity information.

This work may entail the development of simple fact or fragment extractors and clone detection tools (e.g., based on text diff). There is considerable experience with all these aspects in the 101project.


Code-sharing detection


  • Andreas Dausenau
  • Kevin Klein
  • Johannes Klöckner
  • Michael Monschau
  • Thomas Schmorleiz

Other data:

  • Languages under analysis: Java or JavaScript

Comment/source code alignment


  • Kevin Keul
  • Artur Daudrich
  • Marcel Heinz

Java bytecode comparison


  • Jan-Hendrik Borth
  • Matthias Paul
  • Peter Tissen

Metadata inference


  • Nicolas Beck
  • Tobias Keweloh
  • Kevin Klein
  • Jan Rüther
  • Thomas Schmorleiz