Skip to Main Content
The Apache Software Foundation
Apache 20th Anniversary Logo

This was extracted (@ 2024-11-20 22:10) from a list of minutes which have been approved by the Board.
Please Note The Board typically approves the minutes of the previous meeting at the beginning of every Board meeting; therefore, the list below does not normally contain details from the minutes of the most recent Board meeting.

WARNING: these pages may omit some original contents of the minutes.
This is due to changes in the layout of the source minutes over the years. Fixes are being worked on.

Meeting times vary, the exact schedule is available to ASF Members and Officers, search for "calendar" in the Foundation's private index page (svn:foundation/private-index.html).

Nutch

16 Oct 2024 [Sebastian Nagel / Shane]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: ongoing with medium to low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (14 years ago)
There are currently 23 committers and 23 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Joe Gilvary was added as committer and PMC member on 2024-08-07

## Project Activity:
1.20 was released on 2024-04-24.

The work on Nutch 1.21 is ongoing with bug fixes and improvements related to
the protocol layer, the fetcher, Java code quality and documentation. A draft
implementation of a protocol plugin to crawl SAMBA shares is under review.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low but steady level.

17 Jul 2024 [Sebastian Nagel / Justin]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: ongoing with medium to low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (14 years ago)
There are currently 22 committers and 22 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Tim Allison was added as committer and PMC member on 2023-07-19

## Project Activity:
1.20 was released on 2024-04-24.

Development activity was focused on bug fixes and minor improvements.
We discussed an upgrade of the project to Java 17.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low but steady level.

17 Apr 2024 [Sebastian Nagel / Sander]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: ongoing with medium to low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (13 years ago)
There are currently 22 committers and 22 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Tim Allison was added as committer and PMC member on 2023-07-19

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release 1.20 continued and the release should land
during the next weeks. Code contributions focused in the last quarter focused
on upgrades of dependencies and the build configuration for the next release
but also included bug fixes and improvements.

We proposed a project for GSoC'24: "Overhaul the Nutch plugin framework".

## Community Health:
The number of contributions has slightly increased during the last three months.

17 Jan 2024 [Sebastian Nagel / Bertrand]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: ongoing with medium to low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (13 years ago)
There are currently 22 committers and 22 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Tim Allison was added as committer and PMC member on 2023-07-19

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release 1.20 continued at slow pace, with few bug fixes,
improvements to the code and build system, and important dependency upgrade
(Tika 2.9.0 to 2.9.1) to address a CVE.

## Community Health:
The number of contributions decreased during the last month after a very
active preceding quarter.

18 Oct 2023 [Sebastian Nagel / Justin]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: ongoing with medium to low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (13 years ago)
There are currently 22 committers and 22 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Tim Allison was added as committer and PMC member on 2023-07-19

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release 1.20 continues. Important contributions were
the resolution of dependency conflicts around logging libraries (slf4j2 and
reload4j) to finally remove Log4j 1.x from all Nutch plugins, upgrade of the
Apache Tika dependency which requires to resolve a dependency conflict
(commons-io required in different versions by Tika and Hadoop), and
improvements for robots.txt handling to implement RFC 9309 entirely.

We discussed a road map for future Nutch features.

## Community Health:
The number of contributions (Jira issues, commits, activity on the mailing
lists) significantly went up during the last weeks after summer, because
of a new active committer and increased preparations for the next release.

19 Jul 2023 [Sebastian Nagel / Shane]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Project Status:
Current project status: Ongoing with low activity
Issues for the board: none

## Membership Data:
Apache Nutch was founded 2010-04-21 (13 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release continues with some bug fixes, improvements and
dependency upgrades. An indexer plugin for OpenSearch 2.x is under discussion.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low level, below that of the previous quarter.

19 Apr 2023 [Sebastian Nagel / Justin]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 21
committers and 21 PMC members in this project. The Committer-to-PMC ratio is
1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Shashanka Balakuntala Srinivasa on
 2020-08-01.
- No new committers. Last addition was Shashanka Balakuntala Srinivasa on
 2020-07-25.

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release continues. A notable additions is the OpenSearch
indexer plugin. Remaining work was about bug fixes and dependency upgrades.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low but steady level.

18 Jan 2023 [Sebastian Nagel / Bertrand]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (12 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Nutch 1.19 was released on 2022-08-22.

Work on the next Nutch release continues. One focus was on resolving
dependency conflicts around logging libraries (slf4j2 and reload4j) to finally
remove Log4j 1.x from all Nutch plugins.


## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low but steady level.

19 Oct 2022 [Sebastian Nagel / Roman]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (12 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Nutch 1.19 was released on 2022-08-22. Final work on the release candidate
focused on dependency upgrades and adapting Docker build files to changes
introduced with 1.19.

We discussed whether (or not) we need to care about a growth of the
community given the already long time since the last committer
addition.  We agreed that we need to keep an eye on this, but also
came to the conclusion, that Nutch is a success story when you
consider that it still has a community after 12 years being an Apache
TLP and 22 years after the first commit, also given that batch-based
crawlers aren't cutting-edge technology anymore.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a steady level and a visible increase in activity
during the release of 1.19.

20 Jul 2022 [Sebastian Nagel / Roy]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (12 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Work on Nutch 1.19 is ongoing with 11 Jira issues opened, 9 resolved during
the last quarter. Ongoing work includes the transition from Ant to Gradle to
build Nutch, dependency upgrades, improved error handling in the fetcher and
support for non-standard protocol implementations (eg. smb://) and the
corresponding URLStreamHandlers by Nutch plugins.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) is on a low but steady level.

20 Apr 2022 [Sebastian Nagel / Sander]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (12 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Work on Nutch 1.19 is ongoing with 15 Jira issues opened, 10 resolved during
the last quarter. Ongoing work includes the transition from Ant to Gradle to
build Nutch, fixes to the fetcher and the protocol layer.

## Community Health:
The number of contributions (Jira issues, commits, activity on the
mailing lists) was on a comparable low level in the past quarter.

19 Jan 2022 [Sebastian Nagel / Roy]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21
committers and 21 PMC members in this project.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Work on Nutch 1.19 is ongoing with 30 Jira issues opened, 27 resolved during
the last quarter. Ongoing work includes major dependency upgrades (Tika, log4j),
a review of Nutch job metrics, improvements of the protocol layer to also
include non-standard URL schemes (eg. smb).

The migration of our website away from the Apache CMS to Hugo was finally done.

## Community Health:
Contributions (Jira issues, commits, activity on the mailing lists) have shown
an increase after a quiet summer quarter.

20 Oct 2021 [Sebastian Nagel / Roy]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21
committers and 21 PMC members in this project.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Work on Nutch 1.19 is ongoing. Focus was upgrading dependencies and
improvements on the HTTP protocol plugins.

We've split the Nutch WebApp (a web GUI to setup and run crawls) into
a separate repository to reduce the number of core dependencies and to
make the Nutch core codebase better maintainable and more secure given
that there is little development happening on the WebApp code.

The migration away from the Apache CMS is still pending. We made some
progress during the last 3 months but work is proceeding slowly.

## Community Health:
Contributions (Jira issues, commits, activity on the mailing lists)
slowed down over summer but showing some increase during the last weeks.

21 Jul 2021 [Sebastian Nagel / Roman]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21
committers and 21 PMC members in this project.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Work on Nutch 1.19 is ongoing.

The Nutch Docker image was upgraded to be based on Java 11 together with
a significant reduction of the Docker image size. We voted to accept the
donation of the Nutch-Helm project which enables the deployment of Nutch
containers on Kubernetes. The project was written by Lewis John McGibbney
(also a committer/PMC of Nutch) and shall continue as a separate code base
under the hood of the Nutch project.

The Nutch PMC recently participated in the University of Southern
California Computer Science Senior Capstone Program which delivered
Fireant; a Dependabot-like service (tailored to Apache Ant + Ivy projects)
which creates pull requests to keep your dependencies secure and
up-to-date. More info can be found at https://github.com/fireant-bot/fireant.
We will most likely engage in an IP CLEARANCE process to donate Fireant to
either the Nutch or Ant PMC's in due course.

The Nutch PMC will participate in the Oregon State University Computer
Science Senior Capstone Program with a 9-month project which will primarily
focus on reimplementing the legacy Nutch build system (Ant + Ivy) with
either Maven or Gradle.

The migration away from the Apache CMS is still pending and has not
made any progress during the last 3 months.

## Community Health:
Traffic on mailing lists, issue reports and code contributions are on
a steady level.

21 Apr 2021 [Sebastian Nagel / Sheng]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21
committers and 21 PMC members in this project. The Committer-to-PMC ratio is
1:1.

Community changes, past quarter:
- No new committers and PMC members. Last addition was Shashanka Balakuntala
 Srinivasa on 2020-08-01.

## Project Activity:
Nutch 1.18 was released on 2021-01-24 and fixes the XXE injection
vulnerability (CVE-2021-23901) reported on 2021-01-04.

Work on Nutch 1.19 is ongoing. As important step forward we completed
the upgrade to build and run on JDK 11.

The migration away from the Apache CMS is still pending and has not
made any progress during the last 3 months.

## Community Health:
Traffic on mailing lists, issue reports and code contributions are on
a low but steady level.

20 Jan 2021 [Sebastian Nagel / Niclas]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project based on Apache Hadoop® data structures and the MapReduce
data processing framework.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (11 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

The last committer and PMC addition was Shashanka Balakuntala Srinivasa
on 2020-08-01.

## Project Activity:
Work on Nutch 1.18 continues with 13 JIRA issues opened and 12 resolved since
the last report.

The migration away from the Apache CMS has not made any progress during the
last 3 months.

We started to work to run Nutch on Apache Tez alternatively to MapReduce resp.
Hadoop YARN.

## Community Health:
Traffic on mailing lists and development activity are on a low but steady level.

21 Oct 2020 [Sebastian Nagel / Sander]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.

## Issues:
There are no issues requiring board attention.


## Membership Data:
Apache Nutch was founded 2010-04-21 (10 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- Shashanka Balakuntala Srinivasa was added to the PMC on 2020-08-01
- Shashanka Balakuntala Srinivasa was added as committer on 2020-07-25

## Project Activity:
Work on Nutch 1.18 continued (12 issues opened, 21 issues resolved).
Key aspects of the issues opened and worked on are improvements in the
build system (Github workflows for PRs, integration of Spotbugs
targets) and documentation of third-party licenses.

Notified from Infra about the end of the Apache CMS, we decided
to migrate the Nutch site to Hugo, following the Apache Jena
migration path. While initial steps are done, the long road of
converting content and templates is still ahead.

## Community Health:
Commit activity continued over summer but slowed down in the last weeks.
Traffic on the user mailing list is on a low but steady level.

15 Jul 2020 [Sebastian Nagel / Justin]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (10 years ago)
There are currently 20 committers and 20 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Roannel Fernandez on 2018-06-23.
- No new committers. Last addition was Roannel Fernandez on 2018-06-23.

## Project Activity:
In April we celebrated 10 years being an Apache top-level project.

Nutch 1.17 was released on 2020-06-18 with 60 issues resolved. Work on 1.18
has started.

## Community Health:
Traffic on mailing lists has somewhat increased and we see contributions
from new users.

15 Apr 2020 [Sebastian Nagel / Niclas]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (10 years ago)
There are currently 20 committers and 20 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Roannel Fernandez on 2018-06-23.
- No new committers. Last addition was Roannel Fernandez on 2018-06-23.

## Project Activity:
Work on 1.17 is proceeding with about 25 issues resolved so far, 14 more since
the last board report.

## Community Health:
Traffic on mailing lists has decreased significantly. Questions about Nutch
usage have been moved away from the user mailing list (13 mails during the
last quarter) to stackoverflow (about 25 questions, see [1]).

[1] https://stackoverflow.com/search?tab=Newest&q=nutch%20is%3aquestion

15 Jan 2020 [Sebastian Nagel / Dave]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (9 years ago)

No new PMC members added in the last 3 months. Currently 20 committers and PMC
members.

Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018.

## Project Activity:

On Oct 11, 2019 both 1.16 and 2.4 have been released

With the release of 2.4 we resolved CVE-2016-6809, a vulnerability caused by
an upstream dependency. We also announced that we retire the development on
the 2.x branch and advice users to use the 1.x/master branch instead.

Work on 1.17 is proceeding with about 11 issues resolved so far.

## Community Health:
The traffic on the mailing lists is on a steady level. We received
a couple of code contributions (PRs addressing open issues) from
a new contributor.

16 Oct 2019 [Sebastian Nagel / Myrle]

## Description:
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Nutch was founded 2010-04-21 (9 years ago)

No new PMC members added in the last 3 months. Currently 20 committers and
PMC members.

Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018.

## Project Activity:
The issues with the dependency management (see NUTCH-2669 and linked
issues) disappeared after an upgrade to the latest Tika version.

The releases of 1.16 and 2.4 are in progress, so far the votes have passed
successfully and we wait for the release artifacts to be mirrored.

With the release of 2.4 we will
- retire the development on the 2.x branch as no committer is actively
 working on it. We will advice users to use the 1.x/master branch instead.

The migration of the Nutch Wiki from MoinMoin to Confluence was
finished in July but cleanup and restructuring are still desired.

There have been discussions about getting people attracted via the
Outreachy initiatives resp. the Hacktoberfest event.

## Community Health:
There was a significant increase in code commits and resolved Jira issues.
Also the traffic on the mailing lists went up slightly. Reviews of and votes
for the release candidates have been done by
- (1.16) 6 committers + 1 user
- (2.4)  4 committers

17 Jul 2019 [Sebastian Nagel / Roman]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.15 was released on Aug 09 2018.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Development on Nutch 1.16 has been continued at slower speed.
Issues with the dependency management (see NUTCH-2669 and linked
issues) currently block the releases. We still need to wait for
a fix in an upcoming Ivy release, find a reliable work-around
or move to Maven as build system (NUTCH-2292).

Migration of the Wiki from MoinMoin to Confluence is still in progress,
waiting for INFRA-18528 to add missing pages which failed to convert.


COMMUNITY

No new PMC members added in the last 3 months. Currently 20 committers and
PMC members.

Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018.

The traffic on the user mailing list has dropped to a lower level:

 - user@nutch.apache.org:
    - 995 subscribers (down -12 in the last 3 months):
    - 29 emails sent to list (63 in previous quarter)

 - dev@nutch.apache.org:
    - 480 subscribers (down -9 in the last 3 months):
    - 166 emails sent to list (276 in previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)


JIRA ACTIVITY

 - 13 JIRA tickets created in the last 3 months
 - 20 JIRA tickets closed/resolved in the last 3 months

17 Apr 2019 [Sebastian Nagel / Phil]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.15 was released on Aug 09 2018.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

While the work on Nutch 1.16 is continued we still plan to release Nutch 2.4
first.  Issues with the dependency management (see NUTCH-2669 and linked
issues) currently block the releases. We need to wait for a fix in an upcoming
Ivy release or find a reliable work-around.

Migration of the Wiki from MoinMoin to Confluence is in progress
(waiting for INFRA-18076). We also plan to review the Wiki content and
 structure cleaning up outdated pages.


COMMUNITY

No new PMC members added in the last 3 months. Currently 20 committers and PMC
members.

Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 492 subscribers (down -8 in the last 3 months):
    - 278 emails sent to list (379 in previous quarter)

 - user@nutch.apache.org:
    - 1020 subscribers (down -15 in the last 3 months):
    - 63 emails sent to list (116 in previous quarter)


JIRA ACTIVITY

 - 25 JIRA tickets created in the last 3 months
 - 30 JIRA tickets closed/resolved in the last 3 months

16 Jan 2019 [Sebastian Nagel / Brett]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.15 was released on Aug 09 2018.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY
A discussion on the dev mailing list resulted in overall agreement to move the
build from Ant/Ivy to Maven - users are more adapted to Maven and we expect to
simplify the release procedure

The release of 1.16 should land in the next weeks with already 40 issues
resolved.


COMMUNITY

No new PMC members added in the last 3 months. Last committer and PMC addition
was Roannel Fernandez at Sat Jun 23 2018.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 500 subscribers (down -10 in the last 3 months):
    - 419 emails sent to list (309 in previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)

 - user@nutch.apache.org:
    - 1035 subscribers (down -1 in the last 3 months):
    - 116 emails sent to list (86 in previous quarter)


JIRA ACTIVITY

 - 36 JIRA tickets created in the last 3 months
 - 32 JIRA tickets closed/resolved in the last 3 months

17 Oct 2018 [Sebastian Nagel / Brett]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.15 was released on Aug 09 2018.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

The last release of the Nutch 1.x branch (1.15) brought
several major improvements: an upgrade to the new MapReduce
API, the possibility to index documents into multiple Solr
or Elasticsearch instances with configurable "routing",
improvements and fixes of the existing HTTP/HTTPS protocol
plugin and a new HTTP protocol plugin which supports http/2.

Current work includes minor improvements to the HTTP/HTTPS
protocol plugins and fixes of regressions relating to the
MapReduce API upgrades.


COMMUNITY

No new PMC members added in the last 3 months.
Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 511 subscribers (down -8 in the last 3 months):
    - 343 emails sent to list (742 in previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)

 - user@nutch.apache.org:
    - 1037 subscribers (down -5 in the last 3 months):
    - 99 emails sent to list (78 in previous quarter)


JIRA ACTIVITY

 - 32 JIRA tickets created in the last 3 months
 - 34 JIRA tickets closed/resolved in the last 3 months

18 Jul 2018 [Sebastian Nagel / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.14 was released on Dec 22 2017.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

A release of the Nutch 1.x branch (1.15) is in preparation and
should be made during the next two weeks.  It will include an
upgrade to the new MapReduce API (NUTCH-2375, now considered
stable), improvements to index data using various indexing
backends (Solr, etc.), improvements and fixes of the existing
HTTP/HTTPS protocol plugin and a new HTTP protocol plugin
which supports http/2.


COMMUNITY

We got two new committers since the last report, both having
joined us in June 2018:
- Omkar Reddy who completed his Nutch GSoC project in 2017
- Roannel Fernandez who contributed multiple improvements
  of indexer plugins

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 521 subscribers (down -5 in the last 3 months):
    - 750 emails sent to list (562 in previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)

 - user@nutch.apache.org:
    - 1043 subscribers (down -9 in the last 3 months):
    - 80 emails sent to list (189 in previous quarter)


JIRA ACTIVITY

 - 61 JIRA tickets created in the last 3 months
 - 88 JIRA tickets closed/resolved in the last 3 months

18 Apr 2018 [Sebastian Nagel / Bertrand]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.14 was released on Dec 22 2017.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

We are working hard to make the Nutch 1.x branch stable after the upgrade
to the new MapReduce API (NUTCH-2375), a huge change over 200 files
which introduced several regressions.

The release of 2.4 is on the agenda.

We still hope to participate again in GSoC but have no student applications
yet.


JIRA ACTIVITY

 - 59 JIRA tickets created in the last 3 months
 - 34 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

No new committers and PMC members in the last 3 months,
last PMC addition was Ralf Kotowski on Wed Jun 14 2017.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 525 subscribers (down -4 in the last 3 months):
    - 571 emails sent to list (872 in previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)

 - user@nutch.apache.org:
    - 1050 subscribers (down -10 in the last 3 months):
    - 189 emails sent to list (166 in previous quarter)

17 Jan 2018 [Sebastian Nagel / Rich]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.14 was released on Dec 22 2017.

The last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

We released Nutch 1.14 in December with 37 issues fixed and 41 new features
and improvements, contributions made by 20 developers.

We see an increased activity on Jira and github with contributions from new
developers kept on radar for committership.

We plan to release 2.4 during the next weeks.

We hope to participate again in GSoC and continue with the project
"Graph Generator Tool for Nutch" which was started last year during GSoC 2017
by Omkar Reddy and made huge progress upgrading the Hadoop API but did not
address the graph generation as described.


JIRA ACTIVITY

 - 54 JIRA tickets created in the last 3 months
 - 64 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

No new committers and PMC members in the last 3 months, last PMC addition was
Ralf Kotowski on Wed Jun 14 2017.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 530 subscribers (down -10 in the last 3 months)
    - 877 emails sent in the past 3 months (484 in the previous quarter)
      (mostly machine-generated emails from Jira, Jenkins, Wiki)

 - user@nutch.apache.org:
    - 1060 subscribers (down -8 in the last 3 months)
    - 166 emails sent in the past 3 months (212 in the previous quarter)

18 Oct 2017 [Sebastian Nagel / Ted]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop® data
structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

There was no release since the last board report:
- Nutch 1.13 was released on Apr 1, 2017
- the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Issues
- 39 JIRA tickets created in the last 3 months
- 28 JIRA tickets closed/resolved in the last 3 months

Omkar Reddy has finished his GSoC 2017 project.


COMMUNITY

No new committers and PMC members in the last 3 months, last PMC addition was
Ralf Kotowski on Wed Jun 14 2017.

The traffic on the user mailing list is at a steady level:

- dev@nutch.apache.org:
 - 539 subscribers (up 3 in the last 3 months):
 - 537 emails sent to list (390 in previous quarter)
   (including 500 machine-generated emails from Jira,
    Jenkins, Apache Nutch Wiki)

- user@nutch.apache.org:
 - 1069 subscribers (down -12 in the last 3 months):
 - 216 emails sent to list (182 in previous quarter)

19 Jul 2017 [Sebastian Nagel / Brett]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

There was no release since the last board report:
- Nutch 1.13 was released on Apr 1, 2017
- the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Issues
 - 28 JIRA tickets created in the last 3 months
 - 18 JIRA tickets closed/resolved in the last 3 months

Omkar Reddy is working on the GSoC 2017 project
"NUTCH-2369 - Graph Generator Tool for Nutch".


COMMUNITY

Ralf Kotowski became a committer and PMC member on Wed Jun 14 2017

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 536 subscribers (down -4 in the last 3 months):
    - 390 emails sent to list (356 in previous quarter)

 - user@nutch.apache.org:
    - 1084 subscribers (down -11 in the last 3 months):
    - 182 emails sent to list (172 in previous quarter)

19 Apr 2017 [Sebastian Nagel / Chris]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

There was one release since the last board report:
- Nutch 1.13 was released on Apr 1, 2017
- the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Issues
 - 28 JIRA tickets created in the last 3 months
 - 26 JIRA tickets closed/resolved in the last 3 months

We have one student application for GSoC 2017.


COMMUNITY

Furkan Kamaci became a committer and PMC member on Tue Jan 31 2017.

The traffic on the user mailing list is at a steady level:

 - dev@nutch.apache.org:
    - 539 subscribers (up 6 in the last 3 months)
    - 396 emails sent in the past 3 months (164 in the previous cycle)

 - user@nutch.apache.org:
    - 1092 subscribers (down -2 in the last 3 months)
    - 174 emails sent in the past 3 months (240 in the previous cycle)

18 Jan 2017 [Sebastian Nagel / Isabel]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

There have been no releases since the last board report:

 - Nutch 1.12 was released on Jun 19 2016,
 - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.

CURRENT ACTIVITY

Issues

 - 21 JIRA tickets created in the last 3 months
 -  8 JIRA tickets closed/resolved in the last 3 months

COMMUNITY

No new PMC members in the last 3 months, last PMC addition on May 23 2016.

While the traffic on the user mailing list is at a steady level, the traffic
on the development list has dropped within the last 3 months:

 - dev@nutch.apache.org:
    - 534 subscribers (down -3 in the last 3 months):
    - 164 emails sent to list (523 in previous quarter)

 - user@nutch.apache.org:
    - 1095 subscribers (down -4 in the last 3 months):
    - 242 emails sent to list (284 in previous quarter)

19 Oct 2016 [Sebastian Nagel / Jim]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

There have been no releases since the last board report:
- Nutch 1.12 was released on Jun 19 2016,
- the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Furkan Kamacı completed his GSoC project successfully and the developed code
is merged into Nutch's 2.x branch.

Four Nutch committers will give talks about crawler technology (not only Nutch)
at ApacheCon and Apache Big Data Europe in Seville.

Issues
 - 29 JIRA tickets created in the last 3 months
 - 25 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

No new PMC members in the last 3 months, last PMC addition on May 23 2016.

The traffic on the mailing lists is at a steady level:
 - dev@nutch.apache.org:
    - 537 subscribers (down -8 in the last 3 months):
    - 487 emails sent to list (608 in previous quarter)
 - user@nutch.apache.org:
    - 1098 subscribers (down -7 in the last 3 months):
    - 280 emails sent to list (258 in previous quarter)

20 Jul 2016 [Sebastian Nagel / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.12 was released on Jun 19 2016,
the last release on the 2.x branch (2.3.1)
dates to Jan 20 2016.


CURRENT ACTIVITY

We are participating in Google Summer of Code 2016
with one student (Furkan Kamacı).

Issues
 - 44 JIRA tickets created in the last 3 months
 - 27 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

Karanjeet Singh and Thamme Gowda became committers and PMC members
on Sat May 21 2016.

The traffic on the mailing lists is at a steady level:
 - dev@nutch.apache.org:
    - 546 subscribers (down -1 in the last 3 months):
    - 618 emails sent to list (914 in previous quarter)
 - user@nutch.apache.org:
    - 1108 subscribers (same as in the last 3 months):
    - 269 emails sent to list (335 in previous quarter)

20 Apr 2016 [Sebastian Nagel / Brett]

Apache Nutch™ is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora™ for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 2.3.1 was released on Wed Jan 20 2016,
the last release on the 1.x branch dates to
Dec 06 2015.

CURRENT ACTIVITY

The version control was moved from svn to git
in February.

Issues
- 54 JIRA tickets created in the last 3 months
- 42 JIRA tickets closed/resolved in the last 3 months

COMMUNITY

No new committers and PMC members in the last 3 months.
Last committer addition was on Nov 08 2015.

We hope to participate in Google Summer of Code 2016
and have two student applications.

The traffic on the mailing lists is at a steady level:
- dev@nutch.apache.org:
- 547 subscribers (down -5 in the last 3 months):
- 935 emails sent to list (787 in previous quarter)
- user@nutch.apache.org:
- 1105 subscribers (down -9 in the last 3 months):
- 331 emails sent to list (245 in previous quarter)

20 Jan 2016 [Sebastian Nagel / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.11 was released on Dec 06 2015.


CURRENT ACTIVITY

Issues
- 62 JIRA tickets created in the last 3 months
- 47 JIRA tickets closed/resolved in the last 3 months

A vote about moving from svn to git is in process. Last November
this option has been discussed with positive response.


COMMUNITY

Michael James Joyce was added as committer and PMC on Nov 08 2015.

The traffic on the mailing lists is at a steady level:
- dev@nutch.apache.org:
- 552 subscribers (down -15 in the last 3 months):
- 818 emails sent to list (954 in previous quarter)
- user@nutch.apache.org:
- 1114 subscribers (down -5 in the last 3 months):
- 259 emails sent to list (183 in previous quarter)

21 Oct 2015 [Sebastian Nagel / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.10 was released on May 06 2015. A vote on the next
release of the 2.x branch is ongoing.


CURRENT ACTIVITY

Cihad Güzel successfully finished his GSoC project.


COMMUNITY

Two people have joined the PMC and have become a committers:
- Asitang Mishra on Sep 09 2015, and
- Sujen Shah on Sep 15 2015

The traffic on the mailing lists is at a steady level.

15 Jul 2015 [Sebastian Nagel / David]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 1.10 was released on May 6th, 2015. The last release
on the 2.x branch was in January, 2015.

CURRENT ACTIVITY

We are participating in Google Summer of Code 2015
with two students. One failed to pass the mid-term
evaluation.

COMMUNITY

Guiseppe Totaro has joined the PMC and become a committer
on April 21st, 2015.

The traffic on the mailing lists is at a steady level.

22 Apr 2015 [Sebastian Nagel / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

There have been no releases since the last board report:
- Nutch 2.3 was released on January 24, 2015,
- Nutch 1.9 in August 2014

The release of Nutch 1.10 is planned to be soon after release
of Tika 1.8 which will fix a licensing issue of a library
dependency (TIKA-1581).


CURRENT ACTIVITY

We hope to participate in Google Summer of Code 2015
and have 3 mentors and 6 students registered.

Nutch is used to prepare datasets for the TREC Dynamic
Domain Track (http://trec-dd.org/) as part of Memex and
NSF Polar projects.


COMMUNITY

Jorge Luis Betancourt Gonzalez has joined the PMC and become
a committer on February 18, 2015. Mo Omer followed on March 21.

18 Feb 2015 [Sebastian Nagel / Chris]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 2.3 was released on January 24, 2015.

The release includes an important upgrade of the Gora persistence layer.
It also adds a REST API based Web Application which has been written within
the Google Summer of Code 2014.

There has been no release of the Nutch 1.x branch since the previous
report (Nutch 1.9 was released in August 2014). The release of Nutch 1.10
is planned for the next weeks.

CURRENT ACTIVITY

Chris Mattmann has begun projects related to Nutch in his CSCI 572
Search Engines class at USC. This includes dynamic page rendering
and parsing with Ajax: porting of REST services from 2.x to 1.x and
visualization of the crawl graph.

We plan to participate in Google Summer of Code 2015.

COMMUNITY

Jorge Luis Betancourt Gonzalez has been invited to become a PMC
member and committer on Feb 7, 2015. Boarding process is ongoing.

Last new committer: Talat Uyarer joined the PMC and committers on Mar 31, 2014.

The traffic on the mailing lists is at a steady level.

21 Jan 2015 [Sebastian Nagel / Bertrand]

No report was submitted.

17 Dec 2014

Change the Apache Nutch Project Chair

 WHEREAS, the Board of Directors heretofore appointed Julien Nioche
 to the office of Vice President, Apache Nutch, and

 WHEREAS, the Board of Directors is in receipt of the
 resignation of Julien Nioche from the office of Vice President,
 Apache Nutch, and

 WHEREAS, the Project Management Committee of the Apache
 Nutch project has chosen by vote to recommend Sebastian Nagel
 as the successor to the post;

 NOW, THEREFORE, BE IT RESOLVED, that Julien Nioche is relieved
 and discharged from the duties and responsibilities of the office
 of Vice President, Apache Nutch, and

 BE IT FURTHER RESOLVED, that Sebastian Nagel be and
 hereby is appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification, or
 until a successor is appointed.

 Special Order 7C, Change the Apache Nutch Project Chair, was
 approved by Unanimous Vote of the directors present.

15 Oct 2014 [Julien Nioche / Doug]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We released Nutch 1.9 in August which fixed
several important bugs and added quite a few improvements.

There has been no releases of the Nutch 2.x branch since the previous
report. We are still planning to release 2.3 soon and are waiting for
a few JIRA issues to be resolved first. The next release will benefit
from improvements made to Apache Gora.

We succesfully completed our first participation in GSoC and the Nutch
2.X branch now comes packaged with a self contained Apache Wicket-based
Web Application. - See more at [1].

COMMUNITY

There has been no change in the composition of the PMC and committers
list since Talat Uyarer joined the PMC and committers on 31/03/2014.

The traffic on the mailing lists is at a steady level.

We got 1 talk and 1 workshop accepted for ApacheCon EU. The workshop
will be conducted by 3 Nutch committers [2].

Nutch and related projects (Tika, Hadoop, Lucene) will probably get used
in a new DARPA project called Memex where 3 current and 1 emeritus
Nutch committers will be involved [3].

[1] http://nutch.apache.org/#sthash.WX93FBI4.dpuf
[2] http://s.apache.org/kwN
[3] http://www.darpa.mil/NewsEvents/Releases/2014/02/09.aspx

16 Jul 2014 [Julien Nioche / Greg]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

There has been a lot of activity since the previous report and we are
discussing releasing Nutch 1.9 within the next month or so. We fixed
several important bugs and committed various improvements to the trunk.

There has been no releases of the Nutch 2.x branch since the previous
report and there is limited activity on that branch. We are still
planning to release 2.3 during the summer and are waiting for a few
JIRA issues to be resolved first.

Apache Nutch has, for the first time, engaged in GSoC.
Lewis John McGibbney is working with student Fjodor Vershinin on the project
"Create a Wicket-based Web Application for Nutch" [0] which essentailly
 will alow ANYONE to access, run, configure, provision, queue and execute
Nutch crawl jobs within the browser.

Progress is as follows
 * 1st report progress is here http://wiki.apache.org/nutch/FirstReport
 * Documentation on REST API => http://wiki.apache.org/nutch/NutchRESTAPI
 * Mentors comments are positive. The project is going to succeed.

It will be a large step forward for the Nutch project in general.

 We also ported the Nutch website to Apache CMS, setup an IRC channel and
 a Twitter account for the project (https://twitter.com/ApacheNutch).

COMMUNITY

There has been no change in the composition of the PMC and committers
list since Talat Uyarer joined the PMC and committers on 31/03/2014.

The traffic on the mailing lists is at a steady level.

We submitted 2 talks and 1 workshop for ApacheCon EU. One of the talks
will present the results of a user survey [1] conducted by DigitalPebble Ltd.
 The workshop would be done by 3 Nutch committers.

A Nutch-related talk has been submitted for LuceneRevolution by a
member of the community.

[0] https://issues.apache.org/jira/browse/NUTCH-841
[1] http://s.apache.org/zf

16 Apr 2014 [Julien Nioche / Roy]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Apache Nutch v1.8 was released on 17th March 2014 and contained many
improvements, bug fixes and dependencies upgrades.

There has been no releases of the Nutch 2.x branch but that branch is benefiting
from the work being done on Apache GORA, partly by Nutch users and contributors.

COMMUNITY

Talat Uyarer joined the PMC and committers on 31/03/2014.
Ferdy Galema became emeritus in Feb 2014.

The traffic on the mailing lists is at a steady level.

15 Jan 2014 [Julien Nioche / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No releases since the previous board report but quite a few bugfixes and
improvements, notably a more abstract document de-duplication mechanism
(https://issues.apache.org/jira/browse/NUTCH-656) and the removal of
deprecated code.

Nutch 2.x should soon benefit from improvements being done in Apache Gora,
in particular GORA-117.

We are seeing contributions and bugfixes from new users.

COMMUNITY

No new committers/ PMC member since the previous report.

The traffic on the user and dev mailing lists is quite steady and
questions from new users usually get replied to reasonably quickly.

Julien Nioche gave a talk on Nutch at the Lucene/Solr Revolution EU
conference.

16 Oct 2013 [Julien Nioche / Brett]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No releases since the previous board report but quite a few bugfixes and
improvements, notably a contribution from Amazon for indexing to AWS
CloudSearch (https://issues.apache.org/jira/browse/NUTCH-1517) (which needs
additional work) and a discussion on how to improve document deduplication
(https://issues.apache.org/jira/browse/NUTCH-656).

We are seeing contributions and bugfixes from new users.

COMMUNITY

No new committers/ PMC member since the previous report.

The traffic on the user and dev mailing lists has come back to its usual
levels after a few months of exceptional high activity.

A talk on Nutch by Julien Nioche has been accepted for  Lucene/Solr
Revolution EU 2013.

DigitalPebble Ltd published a benchmark comparing the performance of both
versions of Nutch which should help improving Apache Gora and hence Nutch
2.x in the longer term.

17 Jul 2013 [Julien Nioche / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have done 3 releases since the previous board report:
* Nutch 1.7 : release for the trunk branch
* Nutch 2.2 : release for the 2.x branch
* Nutch 2.2.1 : minor release to include an important bug fix (NUTCH-1591)

These releases contain numerous bugfixes and improvements, notably the
upgrades of various Apache-related dependencies (Hadoop, Tika) and the
addition of NUTCH-1047 in the 1.x branch which allows to plug new indexers.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter.  Our user@ list in June had the highest traffic
since July 2011. We are also getting contributions and bugfixes from
new users.

It has been announced during the BerlinBuzzwords conference that the
CommonCrawl project [http://commoncrawl.org/] are now using Apache Nutch
for their future crawls.

17 Apr 2013 [Julien Nioche / Rich]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

There have been no new releases since the last report but quite a few
improvements and issues fixed on both trunk and the 2.x branches, in
particular (NUTCH-1047) Pluggable indexing backends, which is a major
improvement and gives more flexibility to the indexing. The parsing of
robots.txt has been delegated to the Crawler Commons project.

Work has been done on improving the WIKI pages and limiting their access
as we were getting loads of spam.

There is one issue planned for GSOC 2013 [NUTCH-841].

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter.

No less than 3 new Committers / PMC Members have joined Nutch since
the previous report (Tejas Patil / Kiran Chitturi / Lufeng).

Chris Mattmann is actively teaching Nutch in his CSCI 572 Search Engines
and Information Retrieval class [http://www-scf.usc.edu/~csci572/]
at USC this semester (Spring 2013) and includes an assignment that uses
Nutch to crawl the FBI Vault dataset for students to explore and
experiment with.

The CommonCrawl project are planning to test-drive Nutch for a future
iteration of their dataset.

16 Jan 2013 [Julien Nioche / Rich]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Nutch 1.6 has been released since the last report. There have been quite a
few improvements and issues fixed since on both trunk and the 2.x branches.
There is currently a discussion about a Nutch Admin GUI using Wicket.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Julien Nioche gave a talk about Nutch at
the ApacheCon Europe in November [http://s.apache.org/ndp] and
did an interview for InfoQ about Nutch 2 [http://s.apache.org/Sz9].
Julien also talked to the people from the CommonCrawl project
[http://commoncrawl.org/] about getting them to contribute some of their
code to Apache Nutch and get them to use it for their crawls. No tangible
results yet.

No new Committers or PMC Members since the previous report but a vote is
under way for inviting Tejas Patil to become a committer and PMC member.

17 Oct 2012 [Julien Nioche / Sam]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have been quite active lately. Nutch 2.1 has been released since the last
report and we should have a 1.6 release soon.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Julien Nioche will give a talk about Nutch at
the ApacheCon Europe.

No new Committers or PMC Members since the previous report.

25 Jul 2012 [Julien Nioche / Greg]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from Apache
Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a
link-graph database and parsing support handled by Apache Tika for HTML and an
array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have been very active lately. Nutch 1.5 has been released since the last
report and we have just released 1.5.1 which addresses some blocking issues
in 1.5.  We have also released Nutch 2.0 on the 7th July which is a major
milestone. We are working on a press announcement with Sally.

The Apache Nutch PMC has voted Sebastian Nagel to become a Nutch committer and
PMC member in April.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high level
in the last quarter. There have not been any meetings or talks related to
Nutch since the previous report.

(Nutch)

18 Apr 2012 [Julien Nioche / Jim]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming
from Apache Lucene, it now builds on Apache Solr adding web-specifics,
such as a crawler, a link-graph database and parsing support handled
by Apache Tika for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No release has been made since the last report. A number of issues has
been filled in JIRA for 1.5 and we are planning to release Nutch 1.5
and the nutchgora branch shortly. Recent patches have upgraded some of
our dependencies on other Apache projects such as Tika 1.1 and Hadoop
1.0. A functionality which was often mentioned on the user lists had
been committed (parse-metatags).  The documentation on the WIKI has
been improved.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Questions from users are usually answered
promptly by the community. There have not been any meetings or talks related
to Nutch since the previous report.

24 Jan 2012 [Julien Nioche / Shane]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a
crawler, a link-graph database and parsing support handled by Apache Tika for
HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Nutch 1.4 has been released on 26th November 2011. A number of issues has been
filled in JIRA for 1.5. The GORA-based branch (2.0) is benefiting from the
progress made in GORA but a release is not yet planned. Recent patches have
upgraded some of our dependencies on other Apache projects such as Tika,
Hadoop or Gora.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high level
in the last quarter. Some of the committers met at ApacheCon NA 2011 in
November and discussed Nutch and its interaction with GORA. Finally the book
Tika in Action (one of the co-authors Chris Mattmann is a Nutch committer and
PMC member) contains quite a few references to Nutch, which should contribute
to exposing the project to a wider audience.

26 Oct 2011 [Julien Nioche / Sam]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from Apache
Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a
link-graph database and parsing support handled by Apache Tika for HTML and an
array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Due to the lack of progress of Nutch 2.0, it has been decided to move it from
trunk to a separate branch (nutchgora) and move the stable version 1.4 back into
trunk.

Work has started towards a release of 1.4, which should happen before the end of
this month and quite a few issues on JIRA have already been earmarked for v 1.5

The website has been modified to reflect the recommendations of the foundation
and our wiki pages have been greatly updated (mostly thanks to our new committer
Lewis J. Mc Gibbney).

COMMUNITY

There is increased traffic on the user and dev mailing lists, with more
non-committers providing help and advice but also contributing suggestions and
patches.

A few members of the community have expressed their concern about 2.0
(nutchgora) not being the main focus of the development but the majority of
users/committers seems happy to leverage the stable 1.x branch.

17 Aug 2011

Change the Apache Nutch Project Chair

    WHEREAS, the Board of Directors heretofore appointed Andrzej Bialecki
    to the office of Vice President, Apache Nutch, and

    WHEREAS, the Board of Directors is in receipt of the
    resignation of Andrzej Bialecki from the office of Vice President,
    Apache Nutch, and

    WHEREAS, the Project Management Committee of the Apache
    Nutch project has chosen by vote to recommend Julien Nioche
    as the successor to the post;

    NOW, THEREFORE, BE IT RESOLVED, that Andrzej Bialecki is relieved
    and discharged from the duties and responsibilities of the office
    of Vice President, Apache Nutch, and

    BE IT FURTHER RESOLVED, that Julien Nioche be and
    hereby is appointed to the office of Vice President, Apache Nutch, to
    serve in accordance with and subject to the direction of the
    Board of Directors and the Bylaws of the Foundation until
    death, resignation, retirement, removal or disqualification, or
    until a successor is appointed.

 Resolution 7C passed by unanimous roll call vote.

20 Jul 2011 [Andrzej Bialecki / Shane]

Andrzej Bialecki does not have as much time anymore
in his role as VP of the Nutch project and he has started a
thread on stepping down and electing a new chair.

It's probably not in time for this month's board meeting,
but we'll have a resolution ready for next month.

Releases:

We're still working on the 2.0 Nutch branch. Nutch 2.0
integrates Gora to provide backend independence, allowing
Nutch to store its content in HBase, MySQL, HSQL and
Cassandra. We are currently focusing on the testing phase,
and trying to benchmark 2.0 compared to the 1.x series.

We rolled a 1.3 release, the most stable release of Nutch
to date. We also created a 1.4 branch and are actively working
on developing it. Chris Mattmann volunteered to do a 1.4 release
when the time comes.

Community:

Lewis John McGibbney was elected as a Nutch PMC member and
committer.

Mailing list activity is steady, alternating between folks
using Nutch 1.x, and those bleeding-edgers who are using the
2.0 trunk.

Students in Chris Mattmann's CSCI 572 Search Engines and
Information Retrieval course at USC are actively looking at
final projects involving Nutch.

20 Apr 2011 [Andrzej Bialecki / Doug]

Report for the Apache Nutch project: April 2011

There are no board level issues at this point in time.

Releases:

Work still progresses on the 2.0 Nutch branch which integrates
Gora to provide backend independence, allowing Nutch to store
its content in HBase, MySQL, HSQL and Cassandra. We are currently
focusing on the testing phase, and trying to benchmark 2.0 compared
to the 1.x series.

A number of improvements from 2.0 have been backported into 1.3.

Chris Mattmann has volunteered to RM the 1.3 release and hopefully
will cut an RC within the next few weeks.

Apache Gora made its first incubating release (0.1-incubating) and
we are working to upgrade Nutch to use the released version of Gora.

Community:

The Nutch PMC added Alexis Detreglode as a Committer and PMC member.

Mailing list activity is steady, alternating between folks using
Nutch 1.x, and those bleeding-edgers who are using the 2.0 trunk.

19 Jan 2011 [Andrzej Bialecki / Noirin]

Report for the Apache Nutch project: January 2011

There are no board level issues at this point in time.

Releases:

Work progresses on the 2.0 Nutch branch which integrates Gora to provide
backend independence, allowing Nutch to store its content in HBase, MySQL,
HSQL and Cassandra. We are currently focusing on the testing phase, and
trying to benchmark 2.0 compared to the 1.x series.

There has been some desire for patches and updates to the 1.x and we are
considering rolling a 1.3 release. If this comes to pass, Chris Mattmann has
volunteered to RM the release.

Community:

No new PMC members or committers were elected in this quarter. Otis
Gospodnetic decided to go Emeritus from the PMC, and the board has
ACK-ed.

Mailing list activity is steady, alternating between folks using Nutch 1.x,
and those bleeding-edgers who are using the 2.0 trunk.

Chris Mattmann gave a talk at ApacheCon NA on Nutch titled
"Lessons Learned in the Development of a Web-scale Search Engine:
Nutch2 and beyond".

20 Oct 2010 [Andrzej Bialecki / Bertrand]

=== Nutch Status Report: October 2010 ===

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

In September the project released a 1.2 release from the stable branch.

Nutch trunk has been merged with the so called "nutchbase" branch,
which constitutes a major architectural change - Nutch storage layer
uses now an object-relational mapping API called Gora (currently
undergoing incubation), with implementations for SQL databases, HBase
and Cassandra. This means that data collected and processed with Nutch
becomes now available to all third-party tools that can work with
these storage frameworks. The merge is complete now and bugfixing
continues, with the goal to reach a 2.0 release some time during Q1.

Additional branch was created with a snapshot of codebase before
merging the Gora framework, but which includes other refactoring and
delegation of functionality to external projects (such as Tika and
Solr). The purpose of this branch is to allow for some level of
backward-compatibility with Nutch 1.2, though most efforts now
concentrate on the trunk.

COMMUNITY

Markus Jelsma was voted as a new committer.

Andrzej Bialecki gave a talk on "Integration of Solr with crawlers:
Nutch, LCF and Aperture" at the Lucene Revolution conference in
Boston.

21 Jul 2010 [Andrzej Bialecki / Noirin]

=== Nutch Status Report: July 2010 ===

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

The move to the TLP has been completed.

The 1.1 release has been published. We plan to maintain the 1.x
branch in preparation for a maintenance 1.2 release some time in
Q3/Q4, and bugfixes are being applied to both trunk and 1.x as relevant.
Significant progress has been made in cleaning up the trunk version
according to the roadmap and delegating large parts of functionality to
Solr and Tika, and during the next two months we plan to merge trunk
with a branch known as Nutchbase, which uses
a lightweight ORM framework Gora to enable Nutch to use multiple storage
backends.

COMMUNITY

No changes to the PMC or committers.

Chris A. Mattmann will give a talk at the ApacheCon Atlanta in November
on "Nutch 2.0 and beyond". Andrzej Bialecki will give a talk on
"Integration of Solr with crawlers: Nutch, LCF and Aperture" at the
Lucene Revolution conference in Boston in October.

Jim complemented the project on the format of their report.

16 Jun 2010 [Andrzej Bialecki / Greg]

Greg to pursue a report for Nutch.

19 May 2010 [Andrzej Bialecki / Brett]

This is the first report of the Nutch project as a TLP. Before April
2010 Nutch was a subproject of Apache Lucene.

Moving to TLP
=============
User, dev, and private mailing lists have been migrated to their new
locations under @nutch.apache.org. SVN and site have been moved to new
locations as well - see INFRA-2656 and INFRA-2657. In the following
weeks we plan to complete the move to restore all environment to a
working state under the new locations.

Development
===========
The project is in the process of releasing version 1.1, expected to be
completed within a week. Community started discussing the design of the
next version of Nutch. There are many significant architectural changes
planned for the next version, in order to reduce code duplication and to
benefit from other Apache components, such as Tika, Solr and HBase. A
version of Nutch that uses an ORM framework to support different storage
implementations is expected to be merged with trunk/ some time during Q3.

21 Apr 2010

Establish the Apache Nutch Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the "Apache Nutch Project",
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further

 RESOLVED, that the office of "Vice President, Apache Nutch" be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:

   * Andrzej Bialecki <ab@apache.org>
   * Otis Gospodnetic <otis@apache.org>
   * Dogacan Guney <dogacan@apache.org>
   * Dennis Kubes <kubes@apache.org>
   * Chris Mattmann <mattmann@apache.org>
   * Julien Nioche <jnioche@apache.org>
   * Sami Siren <siren@apache.org>

 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.

 Special Order 7B, Establish the Apache Nutch Project, was
 approved by Unanimous Vote of the directors present.

27 Apr 2005

Nutch is nearly ready to attempt graduation.  Recently we ported our
wiki from Sourceforge, so now the project is entirely hosted at Apache.

All committers are active.  We disabled several components when we
moved to Apache, due to license compatibility problems, but nearly all
of these have now been resolved.  The Nutch Organization filed a
Software Grant with the Apache Software Foundation, formally giving
all Nutch software to Apache.