This was extracted (@ 2024-10-16 22:10) from a list of minutes
which have been approved by the Board.
Please Note
The Board typically approves the minutes of the previous meeting at the
beginning of every Board meeting; therefore, the list below does not
normally contain details from the minutes of the most recent Board meeting.
WARNING: these pages may omit some original contents of the minutes.
Meeting times vary, the exact schedule is available to ASF Members and Officers, search for "calendar" in the Foundation's private index page (svn:foundation/private-index.html).
Report was filed, but display is awaiting the approval of the Board minutes.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Project Status: Current project status: ongoing with medium to low activity Issues for the board: none ## Membership Data: Apache Nutch was founded 2010-04-21 (14 years ago) There are currently 22 committers and 22 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - Tim Allison was added as committer and PMC member on 2023-07-19 ## Project Activity: 1.20 was released on 2024-04-24. Development activity was focused on bug fixes and minor improvements. We discussed an upgrade of the project to Java 17. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Project Status: Current project status: ongoing with medium to low activity Issues for the board: none ## Membership Data: Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 22 committers and 22 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - Tim Allison was added as committer and PMC member on 2023-07-19 ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release 1.20 continued and the release should land during the next weeks. Code contributions focused in the last quarter focused on upgrades of dependencies and the build configuration for the next release but also included bug fixes and improvements. We proposed a project for GSoC'24: "Overhaul the Nutch plugin framework". ## Community Health: The number of contributions has slightly increased during the last three months.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Project Status: Current project status: ongoing with medium to low activity Issues for the board: none ## Membership Data: Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 22 committers and 22 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - Tim Allison was added as committer and PMC member on 2023-07-19 ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release 1.20 continued at slow pace, with few bug fixes, improvements to the code and build system, and important dependency upgrade (Tika 2.9.0 to 2.9.1) to address a CVE. ## Community Health: The number of contributions decreased during the last month after a very active preceding quarter.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Project Status: Current project status: ongoing with medium to low activity Issues for the board: none ## Membership Data: Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 22 committers and 22 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - Tim Allison was added as committer and PMC member on 2023-07-19 ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release 1.20 continues. Important contributions were the resolution of dependency conflicts around logging libraries (slf4j2 and reload4j) to finally remove Log4j 1.x from all Nutch plugins, upgrade of the Apache Tika dependency which requires to resolve a dependency conflict (commons-io required in different versions by Tika and Hadoop), and improvements for robots.txt handling to implement RFC 9309 entirely. We discussed a road map for future Nutch features. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) significantly went up during the last weeks after summer, because of a new active committer and increased preparations for the next release.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Project Status: Current project status: Ongoing with low activity Issues for the board: none ## Membership Data: Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release continues with some bug fixes, improvements and dependency upgrades. An indexer plugin for OpenSearch 2.x is under discussion. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a low level, below that of the previous quarter.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (13 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. - No new committers. Last addition was Shashanka Balakuntala Srinivasa on 2020-07-25. ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release continues. A notable additions is the OpenSearch indexer plugin. Remaining work was about bug fixes and dependency upgrades. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (12 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Nutch 1.19 was released on 2022-08-22. Work on the next Nutch release continues. One focus was on resolving dependency conflicts around logging libraries (slf4j2 and reload4j) to finally remove Log4j 1.x from all Nutch plugins. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (12 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Nutch 1.19 was released on 2022-08-22. Final work on the release candidate focused on dependency upgrades and adapting Docker build files to changes introduced with 1.19. We discussed whether (or not) we need to care about a growth of the community given the already long time since the last committer addition. We agreed that we need to keep an eye on this, but also came to the conclusion, that Nutch is a success story when you consider that it still has a community after 12 years being an Apache TLP and 22 years after the first commit, also given that batch-based crawlers aren't cutting-edge technology anymore. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a steady level and a visible increase in activity during the release of 1.19.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (12 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.19 is ongoing with 11 Jira issues opened, 9 resolved during the last quarter. Ongoing work includes the transition from Ant to Gradle to build Nutch, dependency upgrades, improved error handling in the fetcher and support for non-standard protocol implementations (eg. smb://) and the corresponding URLStreamHandlers by Nutch plugins. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) is on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (12 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.19 is ongoing with 15 Jira issues opened, 10 resolved during the last quarter. Ongoing work includes the transition from Ant to Gradle to build Nutch, fixes to the fetcher and the protocol layer. ## Community Health: The number of contributions (Jira issues, commits, activity on the mailing lists) was on a comparable low level in the past quarter.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.19 is ongoing with 30 Jira issues opened, 27 resolved during the last quarter. Ongoing work includes major dependency upgrades (Tika, log4j), a review of Nutch job metrics, improvements of the protocol layer to also include non-standard URL schemes (eg. smb). The migration of our website away from the Apache CMS to Hugo was finally done. ## Community Health: Contributions (Jira issues, commits, activity on the mailing lists) have shown an increase after a quiet summer quarter.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.19 is ongoing. Focus was upgrading dependencies and improvements on the HTTP protocol plugins. We've split the Nutch WebApp (a web GUI to setup and run crawls) into a separate repository to reduce the number of core dependencies and to make the Nutch core codebase better maintainable and more secure given that there is little development happening on the WebApp code. The migration away from the Apache CMS is still pending. We made some progress during the last 3 months but work is proceeding slowly. ## Community Health: Contributions (Jira issues, commits, activity on the mailing lists) slowed down over summer but showing some increase during the last weeks.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.19 is ongoing. The Nutch Docker image was upgraded to be based on Java 11 together with a significant reduction of the Docker image size. We voted to accept the donation of the Nutch-Helm project which enables the deployment of Nutch containers on Kubernetes. The project was written by Lewis John McGibbney (also a committer/PMC of Nutch) and shall continue as a separate code base under the hood of the Nutch project. The Nutch PMC recently participated in the University of Southern California Computer Science Senior Capstone Program which delivered Fireant; a Dependabot-like service (tailored to Apache Ant + Ivy projects) which creates pull requests to keep your dependencies secure and up-to-date. More info can be found at https://github.com/fireant-bot/fireant. We will most likely engage in an IP CLEARANCE process to donate Fireant to either the Nutch or Ant PMC's in due course. The Nutch PMC will participate in the Oregon State University Computer Science Senior Capstone Program with a 9-month project which will primarily focus on reimplementing the legacy Nutch build system (Ant + Ivy) with either Maven or Gradle. The migration away from the Apache CMS is still pending and has not made any progress during the last 3 months. ## Community Health: Traffic on mailing lists, issue reports and code contributions are on a steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new committers and PMC members. Last addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Nutch 1.18 was released on 2021-01-24 and fixes the XXE injection vulnerability (CVE-2021-23901) reported on 2021-01-04. Work on Nutch 1.19 is ongoing. As important step forward we completed the upgrade to build and run on JDK 11. The migration away from the Apache CMS is still pending and has not made any progress during the last 3 months. ## Community Health: Traffic on mailing lists, issue reports and code contributions are on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project based on Apache Hadoop® data structures and the MapReduce data processing framework. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. The last committer and PMC addition was Shashanka Balakuntala Srinivasa on 2020-08-01. ## Project Activity: Work on Nutch 1.18 continues with 13 JIRA issues opened and 12 resolved since the last report. The migration away from the Apache CMS has not made any progress during the last 3 months. We started to work to run Nutch on Apache Tez alternatively to MapReduce resp. Hadoop YARN. ## Community Health: Traffic on mailing lists and development activity are on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (10 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - Shashanka Balakuntala Srinivasa was added to the PMC on 2020-08-01 - Shashanka Balakuntala Srinivasa was added as committer on 2020-07-25 ## Project Activity: Work on Nutch 1.18 continued (12 issues opened, 21 issues resolved). Key aspects of the issues opened and worked on are improvements in the build system (Github workflows for PRs, integration of Spotbugs targets) and documentation of third-party licenses. Notified from Infra about the end of the Apache CMS, we decided to migrate the Nutch site to Hugo, following the Apache Jena migration path. While initial steps are done, the long road of converting content and templates is still ahead. ## Community Health: Commit activity continued over summer but slowed down in the last weeks. Traffic on the user mailing list is on a low but steady level.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (10 years ago) There are currently 20 committers and 20 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Roannel Fernandez on 2018-06-23. - No new committers. Last addition was Roannel Fernandez on 2018-06-23. ## Project Activity: In April we celebrated 10 years being an Apache top-level project. Nutch 1.17 was released on 2020-06-18 with 60 issues resolved. Work on 1.18 has started. ## Community Health: Traffic on mailing lists has somewhat increased and we see contributions from new users.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (10 years ago) There are currently 20 committers and 20 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Roannel Fernandez on 2018-06-23. - No new committers. Last addition was Roannel Fernandez on 2018-06-23. ## Project Activity: Work on 1.17 is proceeding with about 25 issues resolved so far, 14 more since the last board report. ## Community Health: Traffic on mailing lists has decreased significantly. Questions about Nutch usage have been moved away from the user mailing list (13 mails during the last quarter) to stackoverflow (about 25 questions, see [1]). [1] https://stackoverflow.com/search?tab=Newest&q=nutch%20is%3aquestion
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (9 years ago) No new PMC members added in the last 3 months. Currently 20 committers and PMC members. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. ## Project Activity: On Oct 11, 2019 both 1.16 and 2.4 have been released With the release of 2.4 we resolved CVE-2016-6809, a vulnerability caused by an upstream dependency. We also announced that we retire the development on the 2.x branch and advice users to use the 1.x/master branch instead. Work on 1.17 is proceeding with about 11 issues resolved so far. ## Community Health: The traffic on the mailing lists is on a steady level. We received a couple of code contributions (PRs addressing open issues) from a new contributor.
## Description: Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Nutch was founded 2010-04-21 (9 years ago) No new PMC members added in the last 3 months. Currently 20 committers and PMC members. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. ## Project Activity: The issues with the dependency management (see NUTCH-2669 and linked issues) disappeared after an upgrade to the latest Tika version. The releases of 1.16 and 2.4 are in progress, so far the votes have passed successfully and we wait for the release artifacts to be mirrored. With the release of 2.4 we will - retire the development on the 2.x branch as no committer is actively working on it. We will advice users to use the 1.x/master branch instead. The migration of the Nutch Wiki from MoinMoin to Confluence was finished in July but cleanup and restructuring are still desired. There have been discussions about getting people attracted via the Outreachy initiatives resp. the Hacktoberfest event. ## Community Health: There was a significant increase in code commits and resolved Jira issues. Also the traffic on the mailing lists went up slightly. Reviews of and votes for the release candidates have been done by - (1.16) 6 committers + 1 user - (2.4) 4 committers
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.15 was released on Aug 09 2018. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Development on Nutch 1.16 has been continued at slower speed. Issues with the dependency management (see NUTCH-2669 and linked issues) currently block the releases. We still need to wait for a fix in an upcoming Ivy release, find a reliable work-around or move to Maven as build system (NUTCH-2292). Migration of the Wiki from MoinMoin to Confluence is still in progress, waiting for INFRA-18528 to add missing pages which failed to convert. COMMUNITY No new PMC members added in the last 3 months. Currently 20 committers and PMC members. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. The traffic on the user mailing list has dropped to a lower level: - user@nutch.apache.org: - 995 subscribers (down -12 in the last 3 months): - 29 emails sent to list (63 in previous quarter) - dev@nutch.apache.org: - 480 subscribers (down -9 in the last 3 months): - 166 emails sent to list (276 in previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) JIRA ACTIVITY - 13 JIRA tickets created in the last 3 months - 20 JIRA tickets closed/resolved in the last 3 months
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.15 was released on Aug 09 2018. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY While the work on Nutch 1.16 is continued we still plan to release Nutch 2.4 first. Issues with the dependency management (see NUTCH-2669 and linked issues) currently block the releases. We need to wait for a fix in an upcoming Ivy release or find a reliable work-around. Migration of the Wiki from MoinMoin to Confluence is in progress (waiting for INFRA-18076). We also plan to review the Wiki content and structure cleaning up outdated pages. COMMUNITY No new PMC members added in the last 3 months. Currently 20 committers and PMC members. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 492 subscribers (down -8 in the last 3 months): - 278 emails sent to list (379 in previous quarter) - user@nutch.apache.org: - 1020 subscribers (down -15 in the last 3 months): - 63 emails sent to list (116 in previous quarter) JIRA ACTIVITY - 25 JIRA tickets created in the last 3 months - 30 JIRA tickets closed/resolved in the last 3 months
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.15 was released on Aug 09 2018. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY A discussion on the dev mailing list resulted in overall agreement to move the build from Ant/Ivy to Maven - users are more adapted to Maven and we expect to simplify the release procedure The release of 1.16 should land in the next weeks with already 40 issues resolved. COMMUNITY No new PMC members added in the last 3 months. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 500 subscribers (down -10 in the last 3 months): - 419 emails sent to list (309 in previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) - user@nutch.apache.org: - 1035 subscribers (down -1 in the last 3 months): - 116 emails sent to list (86 in previous quarter) JIRA ACTIVITY - 36 JIRA tickets created in the last 3 months - 32 JIRA tickets closed/resolved in the last 3 months
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.15 was released on Aug 09 2018. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY The last release of the Nutch 1.x branch (1.15) brought several major improvements: an upgrade to the new MapReduce API, the possibility to index documents into multiple Solr or Elasticsearch instances with configurable "routing", improvements and fixes of the existing HTTP/HTTPS protocol plugin and a new HTTP protocol plugin which supports http/2. Current work includes minor improvements to the HTTP/HTTPS protocol plugins and fixes of regressions relating to the MapReduce API upgrades. COMMUNITY No new PMC members added in the last 3 months. Last committer and PMC addition was Roannel Fernandez at Sat Jun 23 2018. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 511 subscribers (down -8 in the last 3 months): - 343 emails sent to list (742 in previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) - user@nutch.apache.org: - 1037 subscribers (down -5 in the last 3 months): - 99 emails sent to list (78 in previous quarter) JIRA ACTIVITY - 32 JIRA tickets created in the last 3 months - 34 JIRA tickets closed/resolved in the last 3 months
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.14 was released on Dec 22 2017. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY A release of the Nutch 1.x branch (1.15) is in preparation and should be made during the next two weeks. It will include an upgrade to the new MapReduce API (NUTCH-2375, now considered stable), improvements to index data using various indexing backends (Solr, etc.), improvements and fixes of the existing HTTP/HTTPS protocol plugin and a new HTTP protocol plugin which supports http/2. COMMUNITY We got two new committers since the last report, both having joined us in June 2018: - Omkar Reddy who completed his Nutch GSoC project in 2017 - Roannel Fernandez who contributed multiple improvements of indexer plugins The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 521 subscribers (down -5 in the last 3 months): - 750 emails sent to list (562 in previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) - user@nutch.apache.org: - 1043 subscribers (down -9 in the last 3 months): - 80 emails sent to list (189 in previous quarter) JIRA ACTIVITY - 61 JIRA tickets created in the last 3 months - 88 JIRA tickets closed/resolved in the last 3 months
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.14 was released on Dec 22 2017. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY We are working hard to make the Nutch 1.x branch stable after the upgrade to the new MapReduce API (NUTCH-2375), a huge change over 200 files which introduced several regressions. The release of 2.4 is on the agenda. We still hope to participate again in GSoC but have no student applications yet. JIRA ACTIVITY - 59 JIRA tickets created in the last 3 months - 34 JIRA tickets closed/resolved in the last 3 months COMMUNITY No new committers and PMC members in the last 3 months, last PMC addition was Ralf Kotowski on Wed Jun 14 2017. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 525 subscribers (down -4 in the last 3 months): - 571 emails sent to list (872 in previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) - user@nutch.apache.org: - 1050 subscribers (down -10 in the last 3 months): - 189 emails sent to list (166 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.14 was released on Dec 22 2017. The last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY We released Nutch 1.14 in December with 37 issues fixed and 41 new features and improvements, contributions made by 20 developers. We see an increased activity on Jira and github with contributions from new developers kept on radar for committership. We plan to release 2.4 during the next weeks. We hope to participate again in GSoC and continue with the project "Graph Generator Tool for Nutch" which was started last year during GSoC 2017 by Omkar Reddy and made huge progress upgrading the Hadoop API but did not address the graph generation as described. JIRA ACTIVITY - 54 JIRA tickets created in the last 3 months - 64 JIRA tickets closed/resolved in the last 3 months COMMUNITY No new committers and PMC members in the last 3 months, last PMC addition was Ralf Kotowski on Wed Jun 14 2017. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 530 subscribers (down -10 in the last 3 months) - 877 emails sent in the past 3 months (484 in the previous quarter) (mostly machine-generated emails from Jira, Jenkins, Wiki) - user@nutch.apache.org: - 1060 subscribers (down -8 in the last 3 months) - 166 emails sent in the past 3 months (212 in the previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There was no release since the last board report: - Nutch 1.13 was released on Apr 1, 2017 - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Issues - 39 JIRA tickets created in the last 3 months - 28 JIRA tickets closed/resolved in the last 3 months Omkar Reddy has finished his GSoC 2017 project. COMMUNITY No new committers and PMC members in the last 3 months, last PMC addition was Ralf Kotowski on Wed Jun 14 2017. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 539 subscribers (up 3 in the last 3 months): - 537 emails sent to list (390 in previous quarter) (including 500 machine-generated emails from Jira, Jenkins, Apache Nutch Wiki) - user@nutch.apache.org: - 1069 subscribers (down -12 in the last 3 months): - 216 emails sent to list (182 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There was no release since the last board report: - Nutch 1.13 was released on Apr 1, 2017 - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Issues - 28 JIRA tickets created in the last 3 months - 18 JIRA tickets closed/resolved in the last 3 months Omkar Reddy is working on the GSoC 2017 project "NUTCH-2369 - Graph Generator Tool for Nutch". COMMUNITY Ralf Kotowski became a committer and PMC member on Wed Jun 14 2017 The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 536 subscribers (down -4 in the last 3 months): - 390 emails sent to list (356 in previous quarter) - user@nutch.apache.org: - 1084 subscribers (down -11 in the last 3 months): - 182 emails sent to list (172 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There was one release since the last board report: - Nutch 1.13 was released on Apr 1, 2017 - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Issues - 28 JIRA tickets created in the last 3 months - 26 JIRA tickets closed/resolved in the last 3 months We have one student application for GSoC 2017. COMMUNITY Furkan Kamaci became a committer and PMC member on Tue Jan 31 2017. The traffic on the user mailing list is at a steady level: - dev@nutch.apache.org: - 539 subscribers (up 6 in the last 3 months) - 396 emails sent in the past 3 months (164 in the previous cycle) - user@nutch.apache.org: - 1092 subscribers (down -2 in the last 3 months) - 174 emails sent in the past 3 months (240 in the previous cycle)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There have been no releases since the last board report: - Nutch 1.12 was released on Jun 19 2016, - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Issues - 21 JIRA tickets created in the last 3 months - 8 JIRA tickets closed/resolved in the last 3 months COMMUNITY No new PMC members in the last 3 months, last PMC addition on May 23 2016. While the traffic on the user mailing list is at a steady level, the traffic on the development list has dropped within the last 3 months: - dev@nutch.apache.org: - 534 subscribers (down -3 in the last 3 months): - 164 emails sent to list (523 in previous quarter) - user@nutch.apache.org: - 1095 subscribers (down -4 in the last 3 months): - 242 emails sent to list (284 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There have been no releases since the last board report: - Nutch 1.12 was released on Jun 19 2016, - the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY Furkan Kamacı completed his GSoC project successfully and the developed code is merged into Nutch's 2.x branch. Four Nutch committers will give talks about crawler technology (not only Nutch) at ApacheCon and Apache Big Data Europe in Seville. Issues - 29 JIRA tickets created in the last 3 months - 25 JIRA tickets closed/resolved in the last 3 months COMMUNITY No new PMC members in the last 3 months, last PMC addition on May 23 2016. The traffic on the mailing lists is at a steady level: - dev@nutch.apache.org: - 537 subscribers (down -8 in the last 3 months): - 487 emails sent to list (608 in previous quarter) - user@nutch.apache.org: - 1098 subscribers (down -7 in the last 3 months): - 280 emails sent to list (258 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.12 was released on Jun 19 2016, the last release on the 2.x branch (2.3.1) dates to Jan 20 2016. CURRENT ACTIVITY We are participating in Google Summer of Code 2016 with one student (Furkan Kamacı). Issues - 44 JIRA tickets created in the last 3 months - 27 JIRA tickets closed/resolved in the last 3 months COMMUNITY Karanjeet Singh and Thamme Gowda became committers and PMC members on Sat May 21 2016. The traffic on the mailing lists is at a steady level: - dev@nutch.apache.org: - 546 subscribers (down -1 in the last 3 months): - 618 emails sent to list (914 in previous quarter) - user@nutch.apache.org: - 1108 subscribers (same as in the last 3 months): - 269 emails sent to list (335 in previous quarter)
Apache Nutch™ is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop® data structures and Apache Gora™ for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 2.3.1 was released on Wed Jan 20 2016, the last release on the 1.x branch dates to Dec 06 2015. CURRENT ACTIVITY The version control was moved from svn to git in February. Issues - 54 JIRA tickets created in the last 3 months - 42 JIRA tickets closed/resolved in the last 3 months COMMUNITY No new committers and PMC members in the last 3 months. Last committer addition was on Nov 08 2015. We hope to participate in Google Summer of Code 2016 and have two student applications. The traffic on the mailing lists is at a steady level: - dev@nutch.apache.org: - 547 subscribers (down -5 in the last 3 months): - 935 emails sent to list (787 in previous quarter) - user@nutch.apache.org: - 1105 subscribers (down -9 in the last 3 months): - 331 emails sent to list (245 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.11 was released on Dec 06 2015. CURRENT ACTIVITY Issues - 62 JIRA tickets created in the last 3 months - 47 JIRA tickets closed/resolved in the last 3 months A vote about moving from svn to git is in process. Last November this option has been discussed with positive response. COMMUNITY Michael James Joyce was added as committer and PMC on Nov 08 2015. The traffic on the mailing lists is at a steady level: - dev@nutch.apache.org: - 552 subscribers (down -15 in the last 3 months): - 818 emails sent to list (954 in previous quarter) - user@nutch.apache.org: - 1114 subscribers (down -5 in the last 3 months): - 259 emails sent to list (183 in previous quarter)
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.10 was released on May 06 2015. A vote on the next release of the 2.x branch is ongoing. CURRENT ACTIVITY Cihad Güzel successfully finished his GSoC project. COMMUNITY Two people have joined the PMC and have become a committers: - Asitang Mishra on Sep 09 2015, and - Sujen Shah on Sep 15 2015 The traffic on the mailing lists is at a steady level.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 1.10 was released on May 6th, 2015. The last release on the 2.x branch was in January, 2015. CURRENT ACTIVITY We are participating in Google Summer of Code 2015 with two students. One failed to pass the mid-term evaluation. COMMUNITY Guiseppe Totaro has joined the PMC and become a committer on April 21st, 2015. The traffic on the mailing lists is at a steady level.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene®, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES There have been no releases since the last board report: - Nutch 2.3 was released on January 24, 2015, - Nutch 1.9 in August 2014 The release of Nutch 1.10 is planned to be soon after release of Tika 1.8 which will fix a licensing issue of a library dependency (TIKA-1581). CURRENT ACTIVITY We hope to participate in Google Summer of Code 2015 and have 3 mentors and 6 students registered. Nutch is used to prepare datasets for the TREC Dynamic Domain Track (http://trec-dd.org/) as part of Memex and NSF Polar projects. COMMUNITY Jorge Luis Betancourt Gonzalez has joined the PMC and become a committer on February 18, 2015. Mo Omer followed on March 21.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. RELEASES Nutch 2.3 was released on January 24, 2015. The release includes an important upgrade of the Gora persistence layer. It also adds a REST API based Web Application which has been written within the Google Summer of Code 2014. There has been no release of the Nutch 1.x branch since the previous report (Nutch 1.9 was released in August 2014). The release of Nutch 1.10 is planned for the next weeks. CURRENT ACTIVITY Chris Mattmann has begun projects related to Nutch in his CSCI 572 Search Engines class at USC. This includes dynamic page rendering and parsing with Ajax: porting of REST services from 2.x to 1.x and visualization of the crawl graph. We plan to participate in Google Summer of Code 2015. COMMUNITY Jorge Luis Betancourt Gonzalez has been invited to become a PMC member and committer on Feb 7, 2015. Boarding process is ongoing. Last new committer: Talat Uyarer joined the PMC and committers on Mar 31, 2014. The traffic on the mailing lists is at a steady level.
No report was submitted.
WHEREAS, the Board of Directors heretofore appointed Julien Nioche to the office of Vice President, Apache Nutch, and WHEREAS, the Board of Directors is in receipt of the resignation of Julien Nioche from the office of Vice President, Apache Nutch, and WHEREAS, the Project Management Committee of the Apache Nutch project has chosen by vote to recommend Sebastian Nagel as the successor to the post; NOW, THEREFORE, BE IT RESOLVED, that Julien Nioche is relieved and discharged from the duties and responsibilities of the office of Vice President, Apache Nutch, and BE IT FURTHER RESOLVED, that Sebastian Nagel be and hereby is appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Special Order 7C, Change the Apache Nutch Project Chair, was approved by Unanimous Vote of the directors present.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop datastructures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY We released Nutch 1.9 in August which fixed several important bugs and added quite a few improvements. There has been no releases of the Nutch 2.x branch since the previous report. We are still planning to release 2.3 soon and are waiting for a few JIRA issues to be resolved first. The next release will benefit from improvements made to Apache Gora. We succesfully completed our first participation in GSoC and the Nutch 2.X branch now comes packaged with a self contained Apache Wicket-based Web Application. - See more at [1]. COMMUNITY There has been no change in the composition of the PMC and committers list since Talat Uyarer joined the PMC and committers on 31/03/2014. The traffic on the mailing lists is at a steady level. We got 1 talk and 1 workshop accepted for ApacheCon EU. The workshop will be conducted by 3 Nutch committers [2]. Nutch and related projects (Tika, Hadoop, Lucene) will probably get used in a new DARPA project called Memex where 3 current and 1 emeritus Nutch committers will be involved [3]. [1] http://nutch.apache.org/#sthash.WX93FBI4.dpuf [2] http://s.apache.org/kwN [3] http://www.darpa.mil/NewsEvents/Releases/2014/02/09.aspx
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop datastructures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY There has been a lot of activity since the previous report and we are discussing releasing Nutch 1.9 within the next month or so. We fixed several important bugs and committed various improvements to the trunk. There has been no releases of the Nutch 2.x branch since the previous report and there is limited activity on that branch. We are still planning to release 2.3 during the summer and are waiting for a few JIRA issues to be resolved first. Apache Nutch has, for the first time, engaged in GSoC. Lewis John McGibbney is working with student Fjodor Vershinin on the project "Create a Wicket-based Web Application for Nutch" [0] which essentailly will alow ANYONE to access, run, configure, provision, queue and execute Nutch crawl jobs within the browser. Progress is as follows * 1st report progress is here http://wiki.apache.org/nutch/FirstReport * Documentation on REST API => http://wiki.apache.org/nutch/NutchRESTAPI * Mentors comments are positive. The project is going to succeed. It will be a large step forward for the Nutch project in general. We also ported the Nutch website to Apache CMS, setup an IRC channel and a Twitter account for the project (https://twitter.com/ApacheNutch). COMMUNITY There has been no change in the composition of the PMC and committers list since Talat Uyarer joined the PMC and committers on 31/03/2014. The traffic on the mailing lists is at a steady level. We submitted 2 talks and 1 workshop for ApacheCon EU. One of the talks will present the results of a user survey [1] conducted by DigitalPebble Ltd. The workshop would be done by 3 Nutch committers. A Nutch-related talk has been submitted for LuceneRevolution by a member of the community. [0] https://issues.apache.org/jira/browse/NUTCH-841 [1] http://s.apache.org/zf
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop datastructures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY Apache Nutch v1.8 was released on 17th March 2014 and contained many improvements, bug fixes and dependencies upgrades. There has been no releases of the Nutch 2.x branch but that branch is benefiting from the work being done on Apache GORA, partly by Nutch users and contributors. COMMUNITY Talat Uyarer joined the PMC and committers on 31/03/2014. Ferdy Galema became emeritus in Feb 2014. The traffic on the mailing lists is at a steady level.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop data structures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY No releases since the previous board report but quite a few bugfixes and improvements, notably a more abstract document de-duplication mechanism (https://issues.apache.org/jira/browse/NUTCH-656) and the removal of deprecated code. Nutch 2.x should soon benefit from improvements being done in Apache Gora, in particular GORA-117. We are seeing contributions and bugfixes from new users. COMMUNITY No new committers/ PMC member since the previous report. The traffic on the user and dev mailing lists is quite steady and questions from new users usually get replied to reasonably quickly. Julien Nioche gave a talk on Nutch at the Lucene/Solr Revolution EU conference.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop datastructures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY No releases since the previous board report but quite a few bugfixes and improvements, notably a contribution from Amazon for indexing to AWS CloudSearch (https://issues.apache.org/jira/browse/NUTCH-1517) (which needs additional work) and a discussion on how to improve document deduplication (https://issues.apache.org/jira/browse/NUTCH-656). We are seeing contributions and bugfixes from new users. COMMUNITY No new committers/ PMC member since the previous report. The traffic on the user and dev mailing lists has come back to its usual levels after a few months of exceptional high activity. A talk on Nutch by Julien Nioche has been accepted for Lucene/Solr Revolution EU 2013. DigitalPebble Ltd published a benchmark comparing the performance of both versions of Nutch which should help improving Apache Gora and hence Nutch 2.x in the longer term.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, based respectively on Apache Hadoop datastructures and Apache Gora for leveraging NoSQL databases. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY We have done 3 releases since the previous board report: * Nutch 1.7 : release for the trunk branch * Nutch 2.2 : release for the 2.x branch * Nutch 2.2.1 : minor release to include an important bug fix (NUTCH-1591) These releases contain numerous bugfixes and improvements, notably the upgrades of various Apache-related dependencies (Hadoop, Tika) and the addition of NUTCH-1047 in the 1.x branch which allows to plug new indexers. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. Our user@ list in June had the highest traffic since July 2011. We are also getting contributions and bugfixes from new users. It has been announced during the BerlinBuzzwords conference that the CommonCrawl project [http://commoncrawl.org/] are now using Apache Nutch for their future crawls.
Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY There have been no new releases since the last report but quite a few improvements and issues fixed on both trunk and the 2.x branches, in particular (NUTCH-1047) Pluggable indexing backends, which is a major improvement and gives more flexibility to the indexing. The parsing of robots.txt has been delegated to the Crawler Commons project. Work has been done on improving the WIKI pages and limiting their access as we were getting loads of spam. There is one issue planned for GSOC 2013 [NUTCH-841]. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. No less than 3 new Committers / PMC Members have joined Nutch since the previous report (Tejas Patil / Kiran Chitturi / Lufeng). Chris Mattmann is actively teaching Nutch in his CSCI 572 Search Engines and Information Retrieval class [http://www-scf.usc.edu/~csci572/] at USC this semester (Spring 2013) and includes an assignment that uses Nutch to crawl the FBI Vault dataset for students to explore and experiment with. The CommonCrawl project are planning to test-drive Nutch for a future iteration of their dataset.
Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY Nutch 1.6 has been released since the last report. There have been quite a few improvements and issues fixed since on both trunk and the 2.x branches. There is currently a discussion about a Nutch Admin GUI using Wicket. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. Julien Nioche gave a talk about Nutch at the ApacheCon Europe in November [http://s.apache.org/ndp] and did an interview for InfoQ about Nutch 2 [http://s.apache.org/Sz9]. Julien also talked to the people from the CommonCrawl project [http://commoncrawl.org/] about getting them to contribute some of their code to Apache Nutch and get them to use it for their crawls. No tangible results yet. No new Committers or PMC Members since the previous report but a vote is under way for inviting Tejas Patil to become a committer and PMC member.
Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY We have been quite active lately. Nutch 2.1 has been released since the last report and we should have a 1.6 release soon. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. Julien Nioche will give a talk about Nutch at the ApacheCon Europe. No new Committers or PMC Members since the previous report.
DESCRIPTION Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY We have been very active lately. Nutch 1.5 has been released since the last report and we have just released 1.5.1 which addresses some blocking issues in 1.5. We have also released Nutch 2.0 on the 7th July which is a major milestone. We are working on a press announcement with Sally. The Apache Nutch PMC has voted Sebastian Nagel to become a Nutch committer and PMC member in April. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. There have not been any meetings or talks related to Nutch since the previous report.
(Nutch)
DESCRIPTION Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY No release has been made since the last report. A number of issues has been filled in JIRA for 1.5 and we are planning to release Nutch 1.5 and the nutchgora branch shortly. Recent patches have upgraded some of our dependencies on other Apache projects such as Tika 1.1 and Hadoop 1.0. A functionality which was often mentioned on the user lists had been committed (parse-metatags). The documentation on the WIKI has been improved. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. Questions from users are usually answered promptly by the community. There have not been any meetings or talks related to Nutch since the previous report.
DESCRIPTION Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY Nutch 1.4 has been released on 26th November 2011. A number of issues has been filled in JIRA for 1.5. The GORA-based branch (2.0) is benefiting from the progress made in GORA but a release is not yet planned. Recent patches have upgraded some of our dependencies on other Apache projects such as Tika, Hadoop or Gora. COMMUNITY The traffic on the user and dev mailing lists has kept a relatively high level in the last quarter. Some of the committers met at ApacheCon NA 2011 in November and discussed Nutch and its interaction with GORA. Finally the book Tika in Action (one of the co-authors Chris Mattmann is a Nutch committer and PMC member) contains quite a few references to Nutch, which should contribute to exposing the project to a wider audience.
DESCRIPTION Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array of other document formats. ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY Due to the lack of progress of Nutch 2.0, it has been decided to move it from trunk to a separate branch (nutchgora) and move the stable version 1.4 back into trunk. Work has started towards a release of 1.4, which should happen before the end of this month and quite a few issues on JIRA have already been earmarked for v 1.5 The website has been modified to reflect the recommendations of the foundation and our wiki pages have been greatly updated (mostly thanks to our new committer Lewis J. Mc Gibbney). COMMUNITY There is increased traffic on the user and dev mailing lists, with more non-committers providing help and advice but also contributing suggestions and patches. A few members of the community have expressed their concern about 2.0 (nutchgora) not being the main focus of the development but the majority of users/committers seems happy to leverage the stable 1.x branch.
WHEREAS, the Board of Directors heretofore appointed Andrzej Bialecki to the office of Vice President, Apache Nutch, and WHEREAS, the Board of Directors is in receipt of the resignation of Andrzej Bialecki from the office of Vice President, Apache Nutch, and WHEREAS, the Project Management Committee of the Apache Nutch project has chosen by vote to recommend Julien Nioche as the successor to the post; NOW, THEREFORE, BE IT RESOLVED, that Andrzej Bialecki is relieved and discharged from the duties and responsibilities of the office of Vice President, Apache Nutch, and BE IT FURTHER RESOLVED, that Julien Nioche be and hereby is appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Resolution 7C passed by unanimous roll call vote.
Andrzej Bialecki does not have as much time anymore in his role as VP of the Nutch project and he has started a thread on stepping down and electing a new chair. It's probably not in time for this month's board meeting, but we'll have a resolution ready for next month. Releases: We're still working on the 2.0 Nutch branch. Nutch 2.0 integrates Gora to provide backend independence, allowing Nutch to store its content in HBase, MySQL, HSQL and Cassandra. We are currently focusing on the testing phase, and trying to benchmark 2.0 compared to the 1.x series. We rolled a 1.3 release, the most stable release of Nutch to date. We also created a 1.4 branch and are actively working on developing it. Chris Mattmann volunteered to do a 1.4 release when the time comes. Community: Lewis John McGibbney was elected as a Nutch PMC member and committer. Mailing list activity is steady, alternating between folks using Nutch 1.x, and those bleeding-edgers who are using the 2.0 trunk. Students in Chris Mattmann's CSCI 572 Search Engines and Information Retrieval course at USC are actively looking at final projects involving Nutch.
Report for the Apache Nutch project: April 2011 There are no board level issues at this point in time. Releases: Work still progresses on the 2.0 Nutch branch which integrates Gora to provide backend independence, allowing Nutch to store its content in HBase, MySQL, HSQL and Cassandra. We are currently focusing on the testing phase, and trying to benchmark 2.0 compared to the 1.x series. A number of improvements from 2.0 have been backported into 1.3. Chris Mattmann has volunteered to RM the 1.3 release and hopefully will cut an RC within the next few weeks. Apache Gora made its first incubating release (0.1-incubating) and we are working to upgrade Nutch to use the released version of Gora. Community: The Nutch PMC added Alexis Detreglode as a Committer and PMC member. Mailing list activity is steady, alternating between folks using Nutch 1.x, and those bleeding-edgers who are using the 2.0 trunk.
Report for the Apache Nutch project: January 2011 There are no board level issues at this point in time. Releases: Work progresses on the 2.0 Nutch branch which integrates Gora to provide backend independence, allowing Nutch to store its content in HBase, MySQL, HSQL and Cassandra. We are currently focusing on the testing phase, and trying to benchmark 2.0 compared to the 1.x series. There has been some desire for patches and updates to the 1.x and we are considering rolling a 1.3 release. If this comes to pass, Chris Mattmann has volunteered to RM the release. Community: No new PMC members or committers were elected in this quarter. Otis Gospodnetic decided to go Emeritus from the PMC, and the board has ACK-ed. Mailing list activity is steady, alternating between folks using Nutch 1.x, and those bleeding-edgers who are using the 2.0 trunk. Chris Mattmann gave a talk at ApacheCon NA on Nutch titled "Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond".
=== Nutch Status Report: October 2010 === ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY In September the project released a 1.2 release from the stable branch. Nutch trunk has been merged with the so called "nutchbase" branch, which constitutes a major architectural change - Nutch storage layer uses now an object-relational mapping API called Gora (currently undergoing incubation), with implementations for SQL databases, HBase and Cassandra. This means that data collected and processed with Nutch becomes now available to all third-party tools that can work with these storage frameworks. The merge is complete now and bugfixing continues, with the goal to reach a 2.0 release some time during Q1. Additional branch was created with a snapshot of codebase before merging the Gora framework, but which includes other refactoring and delegation of functionality to external projects (such as Tika and Solr). The purpose of this branch is to allow for some level of backward-compatibility with Nutch 1.2, though most efforts now concentrate on the trunk. COMMUNITY Markus Jelsma was voted as a new committer. Andrzej Bialecki gave a talk on "Integration of Solr with crawlers: Nutch, LCF and Aperture" at the Lucene Revolution conference in Boston.
=== Nutch Status Report: July 2010 === ISSUES There are no issues requiring board attention at this time. CURRENT ACTIVITY The move to the TLP has been completed. The 1.1 release has been published. We plan to maintain the 1.x branch in preparation for a maintenance 1.2 release some time in Q3/Q4, and bugfixes are being applied to both trunk and 1.x as relevant. Significant progress has been made in cleaning up the trunk version according to the roadmap and delegating large parts of functionality to Solr and Tika, and during the next two months we plan to merge trunk with a branch known as Nutchbase, which uses a lightweight ORM framework Gora to enable Nutch to use multiple storage backends. COMMUNITY No changes to the PMC or committers. Chris A. Mattmann will give a talk at the ApacheCon Atlanta in November on "Nutch 2.0 and beyond". Andrzej Bialecki will give a talk on "Integration of Solr with crawlers: Nutch, LCF and Aperture" at the Lucene Revolution conference in Boston in October.
Jim complemented the project on the format of their report.
Greg to pursue a report for Nutch.
This is the first report of the Nutch project as a TLP. Before April 2010 Nutch was a subproject of Apache Lucene. Moving to TLP ============= User, dev, and private mailing lists have been migrated to their new locations under @nutch.apache.org. SVN and site have been moved to new locations as well - see INFRA-2656 and INFRA-2657. In the following weeks we plan to complete the move to restore all environment to a working state under the new locations. Development =========== The project is in the process of releasing version 1.1, expected to be completed within a week. Community started discussing the design of the next version of Nutch. There are many significant architectural changes planned for the next version, in order to reduce code duplication and to benefit from other Apache components, such as Tika, Solr and HBase. A version of Nutch that uses an ORM framework to support different storage implementations is expected to be merged with trunk/ some time during Q3.
WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the "Apache Nutch Project", be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of "Vice President, Apache Nutch" be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: * Andrzej Bialecki <ab@apache.org> * Otis Gospodnetic <otis@apache.org> * Dogacan Guney <dogacan@apache.org> * Dennis Kubes <kubes@apache.org> * Chris Mattmann <mattmann@apache.org> * Julien Nioche <jnioche@apache.org> * Sami Siren <siren@apache.org> RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Special Order 7B, Establish the Apache Nutch Project, was approved by Unanimous Vote of the directors present.
Nutch is nearly ready to attempt graduation. Recently we ported our wiki from Sourceforge, so now the project is entirely hosted at Apache. All committers are active. We disabled several components when we moved to Apache, due to license compatibility problems, but nearly all of these have now been resolved. The Nutch Organization filed a Software Grant with the Apache Software Foundation, formally giving all Nutch software to Apache.