05-ecosystem-analytics.Rmd

# Ecosystem Analytics

## Motivation
A popularized form of software reuse is the use of distributable packages. Packages are modular
libraries packaged with reusable and extensible functionality. As an example, a software project in
need of parsing JSON files can use a parsing package instead of developing and maintaining this
functionality by themselves. Furthermore, developers can also build new packages on top of existing
packages. As an example, a developer building a general-purpose parser for the web can depend and
build upon the JSON parser package. 

To make packages available in software projects, package managers provide an automized workflow to
fetch remote packages, resolve compatibility constraints between them and then integrate them into
projects. Package managers can also be seen as "batteries included" for programming languages. For
example, pip[^5.1] and Cargo[^5.2], official package managers for Python and Rust, provide its
users with vast community developed and tested packages. The use of package managers is also
immensely popular; npm, the de-facto package manager for JavaScript, recorded 18 billion package
downloads in one month @Linux2016.

Additionally, there is also an increasing trend towards creating open source projects. In the time
span of one week, npm register 4,685 newly created packages @Linux2016. An enabling factor of this
trend is social coding platforms such as GitHub[^5.3], that provides an open environment for
package communities to develop, review and use newly developed projects. Thus, leading to early
adaption and widespread within package communities.

As a consequence, software projects that import functionality from remotely-developed packages, and
also at the same time export functionality as a package for others are increasingly becoming
interdependent. In a shared environment such as npm or Github, the interdependence between software
projects form a global graph-like structure known as a _software ecosystem_, where nodes represent
software projects and edges represent the dependency between them. 

As the interdependence between software projects in a shared environment encompasses more than the
code that they depend on; several attempts have been made to define the phenomena around software
ecosystems: Messerschmitt and Szyperski @Messerschmitt2003 state that a software ecosystem is
"a collection of software products that have some given degree of symbiotic relationships". On a
similar line, Lungu @Lungu2009 defines a context to the symbiotic relationship between projects: "A
software ecosystem is a collection of software projects which are developed and co-evolve in the
same environment." An overlooked element is the social-technical relationships between dependent
projects, Mens et al. @Mens2013 incorporate this aspect to Lungu's @Lungu2009 definition with the
following extension: "by explicitly considering the communities involved (e.g. user and developer
communities) as being part of the software ecosystem". Stallman @Stallman2002 opposes the use of
software ecosystems to describe interdependent networks of software projects: "It is inadvisable
to describe the free software community, or any human community, as an ecosystem, because that word
implies the absence of ethical judgment." Although software ecosystems are subject to ethical
judgments, this is not a limiting factor to capture interconnected software projects as an
ecosystem. 

This chapter will use Mens et al.'s @Mens2013 definition of software ecosystems as it is the most
modern and comprehensive definition (i.e., also including the social interaction) to describe and
study interconnected software projects.

Software ecosystems are interesting from two analytics perspectives, one being the _landscape
perspective_ of contemporary software problems and the other being the _network perspepective_ for
studying the propagation of contemporary software problems. Together they make up _ecosystem
analytics_. The two perspectives can yield meaningful insights to identify the spread of common
inefficiencies in software development processes or critical deficiency in source code. As an
example, the network perspective of software ecosystems provides effective means to identify
projects affected by a vulnerability originating from a distant software project. 

Through the lens of a literature survey, this chapter reviews and consolidates recent research of
ecosystem analytics from both a theoretical and practical standpoint. By synthesizing recent
studies, practices, open challenges and applications of ecosystem analytics, we aim to equip the
reader with a comprehensive overview and recommendations for future research in this field. Thus,
we have formulate the following three research questions:

* **RQ1**: What is the current state of the art in software analytics for ecosystem analytics?
* **RQ2**: What are the practical implications of the state of the art?
* **RQ3**: What are the open challenges in ecosystem analytics, for which future research is
  required?

Each of these research questions will be answered using recent papers written in this field of
research. This chapter is structured as follows: First, the research protocol is described in
detail. This includes decisions on which papers are included in the review. After this, the
research questions are answered using the previously stated set of papers.

## Research Protocol

We follow Kitchenham's @kitchenham2004procedures literature review protocol to systematically
arrive at relevant publications for answering the research questions in the previous section.  As
suggested in @kitchenham2004procedures, the search strategy should be iterative with
consultations of experts in the field. Our search strategy consists of the following three
steps:

* Initial seed set from an expert in the field, MSc. Joseph Hejderup 
* Searching in a digital search engine, namely Google Scholar[^5.4]
* Selection of referenced papers based on findings from the previous two steps

### Initial seed

An expert in the field of ecosystem analytics, MSc. Joseph Hejderup provided a list of thirteen
papers, tabulated in Table 4.1. For each paper in the table, we evaluate its relevance against our
research questions. In total, we consider three papers not be relevant as they do not strongly
focus on the interconnections between projects (Table 4.2, presents a detailed explanation). 

### Digital Search Engine

By identifying common and reoccurring keywords from the initial seed set, we construct the
following Google Scholar[^5.4] queries: 

* "engineering software ecosystems" _(2014)_
* "software ecosystems" AND "empirical analysis" _(2018)_
* "software ecosystem" AND "empirical" _(2014)_
* "software ecosystem analytics" _(2014)_
* "software ecosystem" AND "analysis" _(2017)_

As we aim to uncover the state-of-the-art in ecosystem analytics, some queries would not return any
recent results. For example, querying "software ecosystem analytics" set to 2018 or 2016 would not
yield any findings. Therefore, we could only retrieve the latest publications starting from 2014 in
three queries. We documment this by adding the year in brackets with the most recent publication to
each query string above.

To determine the relevance of a paper, we examine the title and the abstract of a candidate paper
in each query result. Our selection process is the following: first, we read the title. If we find
the title to be relevant, we stop and continue with examining the next paper. If not, we read the
abstract as the last step. If the abstract is not relevant, we then discard the paper and continue
with the next paper.  By following this process for each query result, we arrive with the following
papers in Table 4.3

### Referenced papers
Based on the selected papers from the previous two steps, we extract their references and apply the
same selection strategy as for the digital search engine; we first read the title and then proceed
with the abstract if it is not clear or relevant. The result is tabulated in Table 4.

## Answers
In this section, we answer our research questions by synthesizing information from our selected set
of papers.

### RQ1: What is the current state of the art in software analytics for ecosystem analytics?
<!-- Topics that are being explored, research methods, tools and datasets being used -->
To answer this research question, we explore topics in ecosystem analytics. For each explored
topic, we then summarize which research methods, tools, and datasets are being used.

#### Explored Topics
We discover the usage of trivial packages, the impact of (breaking) changes in dependencies,
quality of dependencies and dependency networks to be strong area topics in ecosystem analytics.

Notable incidents like the Left-pad incident have sparked interest in studying trivial packages.
Research on this topic explores the usage of trivial packages both quantitatively and qualitatively
by analyzing the usage of the trivial packages and the reasons why developers choose to use them
respectively.  Schleuter @NPM2016 stress that trivial packages can have a profound impact on
software ecosystems.

Breaking changes is another popularly researched topic. Similar to trivial packages, researchers
use both quantitatively and qualitatively methods. The impact of breaking changes and the way in
which developers react to these changes makes it a very important topic.

Developing metrics to measure the health of an ecosystem is another popular area. Researchers are
experimenting with metrics to measure aspects such as the quality of dependencies.

Dependency networks allow researchers to study gain insights into the evolution of software
ecosystems over time.

#### Research Methods
The studied papers cover a plethora of research methods. These methods can be divided into two
categories: quantitative and qualitative.

Many quantitative research papers analyze the data in a statistical manner, using software
ecosystems as their data. The types of data depend heavily on the ecosystem used for analysis. Some
papers go as far as using the source code of packages @Abdalkareem2017. While other research
focusses on the meta-data of software ecosystems, such as dependency networks @Kikas2017. Another
recurring research method is survival analysis, as used by @Decan2018, which can be used to
estimate the survival rate of a population over time. In software engineering, this has been
successfully applied to open source projects.

Qualitative research within software ecosystems aims to gain a better understanding of the interplay
between developers and software ecosystems. Certain papers solely rely on the results of
qualitative research whereas some papers use both quantitative research and qualitative research to
triangulate their findings.

#### Ecosystems
As previously stated, studying dependencies between different software projects is one of the most
common topics. Therefore, researchers primarily study package managers and their centralized
repositories.

The most studied ecosystem is _npm_[^5.5], as used by Abdalkareem et al. @Abdalkareem2017, Bogart
et al. @Bogart2016, Decan, Mens, and Claes @Decan2017, Kikas et al. @Kikas2017, and Decan, Mens and
Grosjean @Decan2018. There are a few reasons why this a popularly studied ecosystem: npm is the
largest software registry, containing more than double of the next most populated package registry
in 2016 @Linux2016. Moreover, npm is the package manager of JavaScript, which is the most used
programming language according to a RedMonk survey @RedMonk2018. Another compelling reason is that
the majority of npm packages are openly developed and hosted on GitHub @Kikas2017. This is also the
case for RubyGems (Ruby) and Crates.io (Rust) @Kikas2017.

Recently, Decan, Mens, and Grosjean @Decan2018 use a data service called _libraries.io_  which
includes seven different packaging ecosystems: Cargo (rust), CPAN (Perl), CRAN (R), npm
(JavaScript), NuGet (.NET), Packagist (PHP) and RubyGems (Ruby). Instead of manually scraping
dependency data from a single ecosystem, researchers should take advantage of a data source that
unifies multiple ecosystems in one data source, allowing researchers to study one problem over
multiple ecosystems at once.

Apart from package-based ecosystems, several papers study other ecosystems. Bavota et al.
@Bavota2014 study the Apache ecosystem containing Java projects. For evaluating health metrics, Cox
et al. @Cox2015 use Maven Central. Claes et al. @Claes2015 study ten years of package
incompatibilities in testing and stable distributions of Debian i386. Robbes, Lungu, and
Röthlisberger @Robbes2012 opted for the Squeak/Pharo ecosystem. They state that this ecosystem
would provide support for answering their research questions.

#### Main research findings
Based on the findings of Abdalkareem et al. @Abdalkareem2017, trivial packages make up 16.8% of the
npm. Although, 10% of use trivial packages, only 45% of them have test suites.

Robbes et al. @Robbes2012 mention the large impact API changes can have on an ecosystem. Bogart et
al. @Bogart2016 studied the attitude of developers towards breaking changes in dependencies. Their
main findings were that an ecosystem plays an essential role in the way developers deal with
breaking changes. Both papers conclude that developers generally do not respond in time to breaking
changes and as a result breaking changes can have a large impact on a software ecosystem. This
conclusion is reinforced by the findings of Decan et al. @Decan2018, where frequent changes can
lead to an unstable dependency network due to transitive dependencies.

Not only do developers not react in a timely fashion to breaking changes, but Robbes et al.
@Robbes2012 also discover that developers are also not quick to respond to API deprecation. Bavota
et al. @Bavota2014 suggests that updates should only be done when they consist of bug fixes, not
API changes, to combat this issue.  

Attempts have also been made to find a metric that establishes the health of a dependency. Cox et
al. @Cox2015 contribute to this by providing a metric to establish the freshness of a dependency.

An interesting finding in the topic of package dependency networks by Kikas et al. @Kikas2017 is
that ecosystems, over time, become less dependent on a single popular package.

### RQ2: What are the practical implications of the state of the art?
<!-- Rather than theoretical, actual case studies done with findings -->
In this research question, we aim to find out the practical implications of the state of the art as
discussed in the previous section. As many of the discussed papers are case studies, we summarize
their findings in this section.

In a majority of the papers, we find that developers are slow to update their dependencies, or at
times they do not do it at all. Hora et al. @Hora2016 suggest that the main reason for this is that
breaking changes cannot be solved in a uniform manner throughout an ecosystem, but rather need a
specific implementation for each system. We have also found that breaking changes are constantly
introduced when dependencies are updated. According to Raemaekers, van Deursen and Visser
@Raemaekers2017, about 33% of releases, either minor or major, contain a breaking change. Breaking
changes could pose compiling errors, thereby breaking the system that depends on it.

Developers tend to react poorly to changes in their dependencies; Kula et al. @Kula2017 have found
in a survey of 4600 projects that 81.5% of the projects contain outdated dependencies with
potential security risks. Not only do developers not update their dependencies, according to an
empirically study conducted by McDonnell, Ray and Kim @McDonnell2013 on the Android API, they also
do not update their codebase with respect to the changes introduced by dependencies.

In the area of ecosystem health, Constantinou and Mens @Constantinou2017 have researched which
factors indicate that a developer is likely to abandon an ecosystem. Their study, which analyzed
GitHub[^5.3] issues and commits, has found that developers are more likely to abandon a system when
they 1) do not communicate with their fellow developers, 2) do not participate often in social or
technical activities and 3) for an extended period of time do not react or commit any more. Another
interesting characteristic of ecosystem health, studied by Kula et al. @Kula2017-2, is the way in
which projects age over time. Their study found that the usage over time of 81.7% of 4,659 popular
GitHub projects can be fitted on a function with an order higher than two.

Malloy and Power @Malloy2018 have studied the transition from Python 2 to Python 3. Python 3 has
been out since 2008, and the final version of Python 2 was released in 2010. Both are (almost) 10
years old. Even though, during their study, they find that most Python developers choose to
maintain compatibility with both Python 2 and Python 3 by only using a subset of features from
Python 3. Malloy and Power @Malloy2018 state that developers are severely limiting themselves by
not using the new language features of Python 3.

Another interesting topic of research is the impact tools can have on ecosystems. Among these
tools are badges. Badges are annotations on software projects which display some information
about a software project. One of these badges can warn developers about outdated packages.
Based on the results of Trockman @Trockman2018, badges can have a positive impact on the speed at
which developers update their dependencies.

<!--The last paragrade does not make sense at all, re-write it make it understandable -->
Overall, we can conclude that there are improvements to be made. The current method that
most users use to manage their dependencies is lacking. Whether it be updating late or not
updating at all, there are many risks bound to this. Dietrich, Jezek, and Brada @Dietrich2014 have
also found that there are a lot of problems in the Java ecosystem, and has posed a set of
relatively minor changes to both development tools and the Java language itself that could be very
effective. These improvements are highlighted by answering the last research question.

### RQ3: What are the open challenges in ecosystem analytics, for which future research is required?
<!--List of challenges, research questions and an aggregated set of open research items -->
This research question gives insight into the current open challenges in the field of ecosystem
analytics. It focuses on the challenges described in the studied papers.

The most common open challenge across almost all papers is the generalization of results. Most of
the studied papers use a single ecosystem on which they base their results on. This, in turn, means
that it is unclear whether the results hold for other ecosystems. For example, Claes et al.
@Claes2015 state that a possibility for future work is to investigate to what extent findings are
transferable between package-based software distributions.

However, Decan, Mens, and Grosjean @Decan2018 state, after researching dependency network evolution
for seven ecosystems, that they do not make any claims that their results can be generalized beyond
the main package managers for specific languages. This is because Decan, Mens, and Grosjean
@Decan2018 do not expect similar results for networks such as WordPress, as these packages tend to
be more high-level (e.g. used by end users instead of reused by other packages). This is shown as
well in the different results obtained by Bogart et al. @Bogart2016, which shows that values differ
per ecosystem. Overall, this shows that there is a lot of space for future research to be done in
generalizing research beyond the already researched ecosystems.

Another persistent open challenge is the ability to determine the health of an ecosystem. Although
Jansen @Jansen2014 has provided OSEHO, "a framework that is used to establish the health of an open
source ecosystem", Jansen @Jansen2014 notes that "there is surprisingly little literature available
about open source ecosystem health". Kikas et al. @Kikas2017 agree, stating that a general goal
should be to provide analytics for maintainers about the overall ecosystem trends.

Furthermore, this challenge is related to determining the health of a system. Kikas et al.
@Kikas2017 state that "a measure quantifying dependency health in an ecosystem should be
developed". Moreover, according to Jansen @Jansen2014, determining the health of a system from an
ecosystem perspective is required to determine which systems to use. This problem also ties into
developing mechanisms for assisting developers in the selection of packages as well. In particular,
finding the best dependency, according to the functionality needs of the existing application.
Abdalkareem et al. @Abdalkareem2017 state that helping developers find the best packages suiting
their needs need to be addressed. Kikas et al. @Kikas2017 agree that another general goal is to
provide maintainers with improved tooling to manage their dependencies.

However, whenever dependencies are chosen, another open challenge is to assist maintainers to keep
dependencies up to date. In order to find out when dependencies should be updated, there is a need
for developing new metrics. Bavota et al. @Bavota2014 state that their observations could be a
starting point to build recommenders for supporting developers in complex dependency upgrades. Cox
et al. @Cox2015 provide "a metric to aid stakeholders in deciding on whether the dependencies of a
system should be updated". However, Cox et al. @Cox2015 also state multiple refinements on this
metric which could still be researched.