Apache Logo
The Apache Way Contribute ASF Sponsors

Formal board meeting minutes from 2010 through present. Please Note: The board typically approves minutes from one meeting during the next board meeting, so minutes will be published roughly one month later than the scheduled date. Other corporate records are published, as is an alternate categorized view of all board meeting minutes.

2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | Pre-organization meetings

Nutch

18 Jan 2017 [Sebastian Nagel / Isabel]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

There have been no releases since the last board report:

 - Nutch 1.12 was released on Jun 19 2016,
 - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.

CURRENT ACTIVITY

Issues

 - 21 JIRA tickets created in the last 3 months
 -  8 JIRA tickets closed/resolved in the last 3 months

COMMUNITY

No new PMC members in the last 3 months, last PMC addition on May 23 2016.

While the traffic on the user mailing list is at a steady level, the traffic
on the development list has dropped within the last 3 months:

 - dev@nutch.apache.org:
    - 534 subscribers (down -3 in the last 3 months):
    - 164 emails sent to list (523 in previous quarter)

 - user@nutch.apache.org:
    - 1095 subscribers (down -4 in the last 3 months):
    - 242 emails sent to list (284 in previous quarter)

19 Oct 2016 [Sebastian Nagel / Jim]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

There have been no releases since the last board report:
- Nutch 1.12 was released on Jun 19 2016,
- the last release on the 2.x branch (2.3.1) dates to Jan 20 2016.


CURRENT ACTIVITY

Furkan Kamacı completed his GSoC project successfully and the developed code
is merged into Nutch's 2.x branch.

Four Nutch committers will give talks about crawler technology (not only Nutch)
at ApacheCon and Apache Big Data Europe in Seville.

Issues
 - 29 JIRA tickets created in the last 3 months
 - 25 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

No new PMC members in the last 3 months, last PMC addition on May 23 2016.

The traffic on the mailing lists is at a steady level:
 - dev@nutch.apache.org:
    - 537 subscribers (down -8 in the last 3 months):
    - 487 emails sent to list (608 in previous quarter)
 - user@nutch.apache.org:
    - 1098 subscribers (down -7 in the last 3 months):
    - 280 emails sent to list (258 in previous quarter)

20 Jul 2016 [Sebastian Nagel / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.12 was released on Jun 19 2016,
the last release on the 2.x branch (2.3.1)
dates to Jan 20 2016.


CURRENT ACTIVITY

We are participating in Google Summer of Code 2016
with one student (Furkan Kamacı).

Issues
 - 44 JIRA tickets created in the last 3 months
 - 27 JIRA tickets closed/resolved in the last 3 months


COMMUNITY

Karanjeet Singh and Thamme Gowda became committers and PMC members
on Sat May 21 2016.

The traffic on the mailing lists is at a steady level:
 - dev@nutch.apache.org:
    - 546 subscribers (down -1 in the last 3 months):
    - 618 emails sent to list (914 in previous quarter)
 - user@nutch.apache.org:
    - 1108 subscribers (same as in the last 3 months):
    - 269 emails sent to list (335 in previous quarter)

20 Apr 2016 [Sebastian Nagel / Brett]

Apache Nutch™ is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop®
data structures and Apache Gora™ for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 2.3.1 was released on Wed Jan 20 2016,
the last release on the 1.x branch dates to
Dec 06 2015.

CURRENT ACTIVITY

The version control was moved from svn to git
in February.

Issues
- 54 JIRA tickets created in the last 3 months
- 42 JIRA tickets closed/resolved in the last 3 months

COMMUNITY

No new committers and PMC members in the last 3 months.
Last committer addition was on Nov 08 2015.

We hope to participate in Google Summer of Code 2016
and have two student applications.

The traffic on the mailing lists is at a steady level:
- dev@nutch.apache.org:
- 547 subscribers (down -5 in the last 3 months):
- 935 emails sent to list (787 in previous quarter)
- user@nutch.apache.org:
- 1105 subscribers (down -9 in the last 3 months):
- 331 emails sent to list (245 in previous quarter)

20 Jan 2016 [Sebastian Nagel / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.11 was released on Dec 06 2015.


CURRENT ACTIVITY

Issues
- 62 JIRA tickets created in the last 3 months
- 47 JIRA tickets closed/resolved in the last 3 months

A vote about moving from svn to git is in process. Last November
this option has been discussed with positive response.


COMMUNITY

Michael James Joyce was added as committer and PMC on Nov 08 2015.

The traffic on the mailing lists is at a steady level:
- dev@nutch.apache.org:
- 552 subscribers (down -15 in the last 3 months):
- 818 emails sent to list (954 in previous quarter)
- user@nutch.apache.org:
- 1114 subscribers (down -5 in the last 3 months):
- 259 emails sent to list (183 in previous quarter)

21 Oct 2015 [Sebastian Nagel / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.


ISSUES

There are no issues requiring board attention at this time.


RELEASES

Nutch 1.10 was released on May 06 2015. A vote on the next
release of the 2.x branch is ongoing.


CURRENT ACTIVITY

Cihad Güzel successfully finished his GSoC project.


COMMUNITY

Two people have joined the PMC and have become a committers:
- Asitang Mishra on Sep 09 2015, and
- Sujen Shah on Sep 15 2015

The traffic on the mailing lists is at a steady level.

15 Jul 2015 [Sebastian Nagel / David]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 1.10 was released on May 6th, 2015. The last release
on the 2.x branch was in January, 2015.

CURRENT ACTIVITY

We are participating in Google Summer of Code 2015
with two students. One failed to pass the mid-term
evaluation.

COMMUNITY

Guiseppe Totaro has joined the PMC and become a committer
on April 21st, 2015.

The traffic on the mailing lists is at a steady level.

22 Apr 2015 [Sebastian Nagel / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene®, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

There have been no releases since the last board report:
- Nutch 2.3 was released on January 24, 2015,
- Nutch 1.9 in August 2014

The release of Nutch 1.10 is planned to be soon after release
of Tika 1.8 which will fix a licensing issue of a library
dependency (TIKA-1581).


CURRENT ACTIVITY

We hope to participate in Google Summer of Code 2015
and have 3 mentors and 6 students registered.

Nutch is used to prepare datasets for the TREC Dynamic
Domain Track (http://trec-dd.org/) as part of Memex and
NSF Polar projects.


COMMUNITY

Jorge Luis Betancourt Gonzalez has joined the PMC and become
a committer on February 18, 2015. Mo Omer followed on March 21.

18 Feb 2015 [Sebastian Nagel / Chris]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

RELEASES

Nutch 2.3 was released on January 24, 2015.

The release includes an important upgrade of the Gora persistence layer.
It also adds a REST API based Web Application which has been written within
the Google Summer of Code 2014.

There has been no release of the Nutch 1.x branch since the previous
report (Nutch 1.9 was released in August 2014). The release of Nutch 1.10
is planned for the next weeks.

CURRENT ACTIVITY

Chris Mattmann has begun projects related to Nutch in his CSCI 572
Search Engines class at USC. This includes dynamic page rendering
and parsing with Ajax: porting of REST services from 2.x to 1.x and
visualization of the crawl graph.

We plan to participate in Google Summer of Code 2015.

COMMUNITY

Jorge Luis Betancourt Gonzalez has been invited to become a PMC
member and committer on Feb 7, 2015. Boarding process is ongoing.

Last new committer: Talat Uyarer joined the PMC and committers on Mar 31, 2014.

The traffic on the mailing lists is at a steady level.

21 Jan 2015 [Sebastian Nagel / Bertrand]

No report was submitted.

17 Dec 2014

Change the Apache Nutch Project Chair

 WHEREAS, the Board of Directors heretofore appointed Julien Nioche
 to the office of Vice President, Apache Nutch, and

 WHEREAS, the Board of Directors is in receipt of the
 resignation of Julien Nioche from the office of Vice President,
 Apache Nutch, and

 WHEREAS, the Project Management Committee of the Apache
 Nutch project has chosen by vote to recommend Sebastian Nagel
 as the successor to the post;

 NOW, THEREFORE, BE IT RESOLVED, that Julien Nioche is relieved
 and discharged from the duties and responsibilities of the office
 of Vice President, Apache Nutch, and

 BE IT FURTHER RESOLVED, that Sebastian Nagel be and
 hereby is appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification, or
 until a successor is appointed.

 Special Order 7C, Change the Apache Nutch Project Chair, was
 approved by Unanimous Vote of the directors present.

15 Oct 2014 [Julien Nioche / Doug]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We released Nutch 1.9 in August which fixed
several important bugs and added quite a few improvements.

There has been no releases of the Nutch 2.x branch since the previous
report. We are still planning to release 2.3 soon and are waiting for
a few JIRA issues to be resolved first. The next release will benefit
from improvements made to Apache Gora.

We succesfully completed our first participation in GSoC and the Nutch
2.X branch now comes packaged with a self contained Apache Wicket-based
Web Application. - See more at [1].

COMMUNITY

There has been no change in the composition of the PMC and committers
list since Talat Uyarer joined the PMC and committers on 31/03/2014.

The traffic on the mailing lists is at a steady level.

We got 1 talk and 1 workshop accepted for ApacheCon EU. The workshop
will be conducted by 3 Nutch committers [2].

Nutch and related projects (Tika, Hadoop, Lucene) will probably get used
in a new DARPA project called Memex where 3 current and 1 emeritus
Nutch committers will be involved [3].

[1] http://nutch.apache.org/#sthash.WX93FBI4.dpuf
[2] http://s.apache.org/kwN
[3] http://www.darpa.mil/NewsEvents/Releases/2014/02/09.aspx

16 Jul 2014 [Julien Nioche / Greg]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

There has been a lot of activity since the previous report and we are
discussing releasing Nutch 1.9 within the next month or so. We fixed
several important bugs and committed various improvements to the trunk.

There has been no releases of the Nutch 2.x branch since the previous
report and there is limited activity on that branch. We are still
planning to release 2.3 during the summer and are waiting for a few
JIRA issues to be resolved first.

Apache Nutch has, for the first time, engaged in GSoC.
Lewis John McGibbney is working with student Fjodor Vershinin on the project
"Create a Wicket-based Web Application for Nutch" [0] which essentailly
 will alow ANYONE to access, run, configure, provision, queue and execute
Nutch crawl jobs within the browser.

Progress is as follows
 * 1st report progress is here http://wiki.apache.org/nutch/FirstReport
 * Documentation on REST API => http://wiki.apache.org/nutch/NutchRESTAPI
 * Mentors comments are positive. The project is going to succeed.

It will be a large step forward for the Nutch project in general.

 We also ported the Nutch website to Apache CMS, setup an IRC channel and
 a Twitter account for the project (https://twitter.com/ApacheNutch).

COMMUNITY

There has been no change in the composition of the PMC and committers
list since Talat Uyarer joined the PMC and committers on 31/03/2014.

The traffic on the mailing lists is at a steady level.

We submitted 2 talks and 1 workshop for ApacheCon EU. One of the talks
will present the results of a user survey [1] conducted by DigitalPebble Ltd.
 The workshop would be done by 3 Nutch committers.

A Nutch-related talk has been submitted for LuceneRevolution by a
member of the community.

[0] https://issues.apache.org/jira/browse/NUTCH-841
[1] http://s.apache.org/zf

16 Apr 2014 [Julien Nioche / Roy]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Apache Nutch v1.8 was released on 17th March 2014 and contained many
improvements, bug fixes and dependencies upgrades.

There has been no releases of the Nutch 2.x branch but that branch is benefiting
from the work being done on Apache GORA, partly by Nutch users and contributors.

COMMUNITY

Talat Uyarer joined the PMC and committers on 31/03/2014.
Ferdy Galema became emeritus in Feb 2014.

The traffic on the mailing lists is at a steady level.

15 Jan 2014 [Julien Nioche / Sam]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
data structures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No releases since the previous board report but quite a few bugfixes and
improvements, notably a more abstract document de-duplication mechanism
(https://issues.apache.org/jira/browse/NUTCH-656) and the removal of
deprecated code.

Nutch 2.x should soon benefit from improvements being done in Apache Gora,
in particular GORA-117.

We are seeing contributions and bugfixes from new users.

COMMUNITY

No new committers/ PMC member since the previous report.

The traffic on the user and dev mailing lists is quite steady and
questions from new users usually get replied to reasonably quickly.

Julien Nioche gave a talk on Nutch at the Lucene/Solr Revolution EU
conference.

16 Oct 2013 [Julien Nioche / Brett]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No releases since the previous board report but quite a few bugfixes and
improvements, notably a contribution from Amazon for indexing to AWS
CloudSearch (https://issues.apache.org/jira/browse/NUTCH-1517) (which needs
additional work) and a discussion on how to improve document deduplication
(https://issues.apache.org/jira/browse/NUTCH-656).

We are seeing contributions and bugfixes from new users.

COMMUNITY

No new committers/ PMC member since the previous report.

The traffic on the user and dev mailing lists has come back to its usual
levels after a few months of exceptional high activity.

A talk on Nutch by Julien Nioche has been accepted for  Lucene/Solr
Revolution EU 2013.

DigitalPebble Ltd published a benchmark comparing the performance of both
versions of Nutch which should help improving Apache Gora and hence Nutch
2.x in the longer term.

17 Jul 2013 [Julien Nioche / Shane]

Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene™, the project has diversified
and now comprises two codebases, based respectively on Apache Hadoop
datastructures and Apache Gora for leveraging NoSQL databases.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have done 3 releases since the previous board report:
* Nutch 1.7 : release for the trunk branch
* Nutch 2.2 : release for the 2.x branch
* Nutch 2.2.1 : minor release to include an important bug fix (NUTCH-1591)

These releases contain numerous bugfixes and improvements, notably the
upgrades of various Apache-related dependencies (Hadoop, Tika) and the
addition of NUTCH-1047 in the 1.x branch which allows to plug new indexers.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter.  Our user@ list in June had the highest traffic
since July 2011. We are also getting contributions and bugfixes from
new users.

It has been announced during the BerlinBuzzwords conference that the
CommonCrawl project [http://commoncrawl.org/] are now using Apache Nutch
for their future crawls.

17 Apr 2013 [Julien Nioche / Rich]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

There have been no new releases since the last report but quite a few
improvements and issues fixed on both trunk and the 2.x branches, in
particular (NUTCH-1047) Pluggable indexing backends, which is a major
improvement and gives more flexibility to the indexing. The parsing of
robots.txt has been delegated to the Crawler Commons project.

Work has been done on improving the WIKI pages and limiting their access
as we were getting loads of spam.

There is one issue planned for GSOC 2013 [NUTCH-841].

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter.

No less than 3 new Committers / PMC Members have joined Nutch since
the previous report (Tejas Patil / Kiran Chitturi / Lufeng).

Chris Mattmann is actively teaching Nutch in his CSCI 572 Search Engines
and Information Retrieval class [http://www-scf.usc.edu/~csci572/]
at USC this semester (Spring 2013) and includes an assignment that uses
Nutch to crawl the FBI Vault dataset for students to explore and
experiment with.

The CommonCrawl project are planning to test-drive Nutch for a future
iteration of their dataset.

16 Jan 2013 [Julien Nioche / Rich]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Nutch 1.6 has been released since the last report. There have been quite a
few improvements and issues fixed since on both trunk and the 2.x branches.
There is currently a discussion about a Nutch Admin GUI using Wicket.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Julien Nioche gave a talk about Nutch at
the ApacheCon Europe in November [http://s.apache.org/ndp] and
did an interview for InfoQ about Nutch 2 [http://s.apache.org/Sz9].
Julien also talked to the people from the CommonCrawl project
[http://commoncrawl.org/] about getting them to contribute some of their
code to Apache Nutch and get them to use it for their crawls. No tangible
results yet.

No new Committers or PMC Members since the previous report but a vote is
under way for inviting Tejas Patil to become a committer and PMC member.

17 Oct 2012 [Julien Nioche / Sam]

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as
a crawler, a link-graph database and parsing support handled by Apache Tika
for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have been quite active lately. Nutch 2.1 has been released since the last
report and we should have a 1.6 release soon.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Julien Nioche will give a talk about Nutch at
the ApacheCon Europe.

No new Committers or PMC Members since the previous report.

25 Jul 2012 [Julien Nioche / Greg]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from Apache
Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a
link-graph database and parsing support handled by Apache Tika for HTML and an
array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

We have been very active lately. Nutch 1.5 has been released since the last
report and we have just released 1.5.1 which addresses some blocking issues
in 1.5.  We have also released Nutch 2.0 on the 7th July which is a major
milestone. We are working on a press announcement with Sally.

The Apache Nutch PMC has voted Sebastian Nagel to become a Nutch committer and
PMC member in April.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high level
in the last quarter. There have not been any meetings or talks related to
Nutch since the previous report.

(Nutch)

18 Apr 2012 [Julien Nioche / Jim]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming
from Apache Lucene, it now builds on Apache Solr adding web-specifics,
such as a crawler, a link-graph database and parsing support handled
by Apache Tika for HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

No release has been made since the last report. A number of issues has
been filled in JIRA for 1.5 and we are planning to release Nutch 1.5
and the nutchgora branch shortly. Recent patches have upgraded some of
our dependencies on other Apache projects such as Tika 1.1 and Hadoop
1.0. A functionality which was often mentioned on the user lists had
been committed (parse-metatags).  The documentation on the WIKI has
been improved.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high
level in the last quarter. Questions from users are usually answered
promptly by the community. There have not been any meetings or talks related
to Nutch since the previous report.

24 Jan 2012 [Julien Nioche / Shane]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a
crawler, a link-graph database and parsing support handled by Apache Tika for
HTML and an array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Nutch 1.4 has been released on 26th November 2011. A number of issues has been
filled in JIRA for 1.5. The GORA-based branch (2.0) is benefiting from the
progress made in GORA but a release is not yet planned. Recent patches have
upgraded some of our dependencies on other Apache projects such as Tika,
Hadoop or Gora.

COMMUNITY

The traffic on the user and dev mailing lists has kept a relatively high level
in the last quarter. Some of the committers met at ApacheCon NA 2011 in
November and discussed Nutch and its interaction with GORA. Finally the book
Tika in Action (one of the co-authors Chris Mattmann is a Nutch committer and
PMC member) contains quite a few references to Nutch, which should contribute
to exposing the project to a wider audience.

26 Oct 2011 [Julien Nioche / Sam]

DESCRIPTION

Apache Nutch is an open source web-search software project. Stemming from Apache
Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a
link-graph database and parsing support handled by Apache Tika for HTML and an
array of other document formats.

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

Due to the lack of progress of Nutch 2.0, it has been decided to move it from
trunk to a separate branch (nutchgora) and move the stable version 1.4 back into
trunk.

Work has started towards a release of 1.4, which should happen before the end of
this month and quite a few issues on JIRA have already been earmarked for v 1.5

The website has been modified to reflect the recommendations of the foundation
and our wiki pages have been greatly updated (mostly thanks to our new committer
Lewis J. Mc Gibbney).

COMMUNITY

There is increased traffic on the user and dev mailing lists, with more
non-committers providing help and advice but also contributing suggestions and
patches.

A few members of the community have expressed their concern about 2.0
(nutchgora) not being the main focus of the development but the majority of
users/committers seems happy to leverage the stable 1.x branch.

17 Aug 2011

Change the Apache Nutch Project Chair

    WHEREAS, the Board of Directors heretofore appointed Andrzej Bialecki
    to the office of Vice President, Apache Nutch, and

    WHEREAS, the Board of Directors is in receipt of the
    resignation of Andrzej Bialecki from the office of Vice President,
    Apache Nutch, and

    WHEREAS, the Project Management Committee of the Apache
    Nutch project has chosen by vote to recommend Julien Nioche
    as the successor to the post;

    NOW, THEREFORE, BE IT RESOLVED, that Andrzej Bialecki is relieved
    and discharged from the duties and responsibilities of the office
    of Vice President, Apache Nutch, and

    BE IT FURTHER RESOLVED, that Julien Nioche be and
    hereby is appointed to the office of Vice President, Apache Nutch, to
    serve in accordance with and subject to the direction of the
    Board of Directors and the Bylaws of the Foundation until
    death, resignation, retirement, removal or disqualification, or
    until a successor is appointed.

 Resolution 7C passed by unanimous roll call vote.

20 Jul 2011 [Andrzej Bialecki / Shane]

Andrzej Bialecki does not have as much time anymore
in his role as VP of the Nutch project and he has started a
thread on stepping down and electing a new chair.

It's probably not in time for this month's board meeting,
but we'll have a resolution ready for next month.

Releases:

We're still working on the 2.0 Nutch branch. Nutch 2.0
integrates Gora to provide backend independence, allowing
Nutch to store its content in HBase, MySQL, HSQL and
Cassandra. We are currently focusing on the testing phase,
and trying to benchmark 2.0 compared to the 1.x series.

We rolled a 1.3 release, the most stable release of Nutch
to date. We also created a 1.4 branch and are actively working
on developing it. Chris Mattmann volunteered to do a 1.4 release
when the time comes.

Community:

Lewis John McGibbney was elected as a Nutch PMC member and
committer.

Mailing list activity is steady, alternating between folks
using Nutch 1.x, and those bleeding-edgers who are using the
2.0 trunk.

Students in Chris Mattmann's CSCI 572 Search Engines and
Information Retrieval course at USC are actively looking at
final projects involving Nutch.

20 Apr 2011 [Andrzej Bialecki / Doug]

Report for the Apache Nutch project: April 2011

There are no board level issues at this point in time.

Releases:

Work still progresses on the 2.0 Nutch branch which integrates
Gora to provide backend independence, allowing Nutch to store
its content in HBase, MySQL, HSQL and Cassandra. We are currently
focusing on the testing phase, and trying to benchmark 2.0 compared
to the 1.x series.

A number of improvements from 2.0 have been backported into 1.3.

Chris Mattmann has volunteered to RM the 1.3 release and hopefully
will cut an RC within the next few weeks.

Apache Gora made its first incubating release (0.1-incubating) and
we are working to upgrade Nutch to use the released version of Gora.

Community:

The Nutch PMC added Alexis Detreglode as a Committer and PMC member.

Mailing list activity is steady, alternating between folks using
Nutch 1.x, and those bleeding-edgers who are using the 2.0 trunk.

19 Jan 2011 [Andrzej Bialecki / Noirin]

Report for the Apache Nutch project: January 2011

There are no board level issues at this point in time.

Releases:

Work progresses on the 2.0 Nutch branch which integrates Gora to provide
backend independence, allowing Nutch to store its content in HBase, MySQL,
HSQL and Cassandra. We are currently focusing on the testing phase, and
trying to benchmark 2.0 compared to the 1.x series.

There has been some desire for patches and updates to the 1.x and we are
considering rolling a 1.3 release. If this comes to pass, Chris Mattmann has
volunteered to RM the release.

Community:

No new PMC members or committers were elected in this quarter. Otis
Gospodnetic decided to go Emeritus from the PMC, and the board has
ACK-ed.

Mailing list activity is steady, alternating between folks using Nutch 1.x,
and those bleeding-edgers who are using the 2.0 trunk.

Chris Mattmann gave a talk at ApacheCon NA on Nutch titled
"Lessons Learned in the Development of a Web-scale Search Engine:
Nutch2 and beyond".

20 Oct 2010 [Andrzej Bialecki / Bertrand]

=== Nutch Status Report: October 2010 ===

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

In September the project released a 1.2 release from the stable branch.

Nutch trunk has been merged with the so called "nutchbase" branch,
which constitutes a major architectural change - Nutch storage layer
uses now an object-relational mapping API called Gora (currently
undergoing incubation), with implementations for SQL databases, HBase
and Cassandra. This means that data collected and processed with Nutch
becomes now available to all third-party tools that can work with
these storage frameworks. The merge is complete now and bugfixing
continues, with the goal to reach a 2.0 release some time during Q1.

Additional branch was created with a snapshot of codebase before
merging the Gora framework, but which includes other refactoring and
delegation of functionality to external projects (such as Tika and
Solr). The purpose of this branch is to allow for some level of
backward-compatibility with Nutch 1.2, though most efforts now
concentrate on the trunk.

COMMUNITY

Markus Jelsma was voted as a new committer.

Andrzej Bialecki gave a talk on "Integration of Solr with crawlers:
Nutch, LCF and Aperture" at the Lucene Revolution conference in
Boston.

21 Jul 2010 [Andrzej Bialecki / Noirin]

=== Nutch Status Report: July 2010 ===

ISSUES

There are no issues requiring board attention at this time.

CURRENT ACTIVITY

The move to the TLP has been completed.

The 1.1 release has been published. We plan to maintain the 1.x
branch in preparation for a maintenance 1.2 release some time in
Q3/Q4, and bugfixes are being applied to both trunk and 1.x as relevant.
Significant progress has been made in cleaning up the trunk version
according to the roadmap and delegating large parts of functionality to
Solr and Tika, and during the next two months we plan to merge trunk
with a branch known as Nutchbase, which uses
a lightweight ORM framework Gora to enable Nutch to use multiple storage
backends.

COMMUNITY

No changes to the PMC or committers.

Chris A. Mattmann will give a talk at the ApacheCon Atlanta in November
on "Nutch 2.0 and beyond". Andrzej Bialecki will give a talk on
"Integration of Solr with crawlers: Nutch, LCF and Aperture" at the
Lucene Revolution conference in Boston in October.

Jim complemented the project on the format of their report.

16 Jun 2010 [Andrzej Bialecki / Greg]

Greg to pursue a report for Nutch.

19 May 2010 [Andrzej Bialecki / Brett]

This is the first report of the Nutch project as a TLP. Before April
2010 Nutch was a subproject of Apache Lucene.

Moving to TLP
=============
User, dev, and private mailing lists have been migrated to their new
locations under @nutch.apache.org. SVN and site have been moved to new
locations as well - see INFRA-2656 and INFRA-2657. In the following
weeks we plan to complete the move to restore all environment to a
working state under the new locations.

Development
===========
The project is in the process of releasing version 1.1, expected to be
completed within a week. Community started discussing the design of the
next version of Nutch. There are many significant architectural changes
planned for the next version, in order to reduce code duplication and to
benefit from other Apache components, such as Tika, Solr and HBase. A
version of Nutch that uses an ORM framework to support different storage
implementations is expected to be merged with trunk/ some time during Q3.

21 Apr 2010

Establish the Apache Nutch Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the "Apache Nutch Project",
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further

 RESOLVED, that the office of "Vice President, Apache Nutch" be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:

   * Andrzej Bialecki <ab@apache.org>
   * Otis Gospodnetic <otis@apache.org>
   * Dogacan Guney <dogacan@apache.org>
   * Dennis Kubes <kubes@apache.org>
   * Chris Mattmann <mattmann@apache.org>
   * Julien Nioche <jnioche@apache.org>
   * Sami Siren <siren@apache.org>

 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.

 Special Order 7B, Establish the Apache Nutch Project, was
 approved by Unanimous Vote of the directors present.

27 Apr 2005

Nutch is nearly ready to attempt graduation.  Recently we ported our
wiki from Sourceforge, so now the project is entirely hosted at Apache.

All committers are active.  We disabled several components when we
moved to Apache, due to license compatibility problems, but nearly all
of these have now been resolved.  The Nutch Organization filed a
Software Grant with the Apache Software Foundation, formally giving
all Nutch software to Apache.