This was extracted (@ 2024-11-20 22:10) from a list of minutes
which have been approved by the Board.
Please Note
The Board typically approves the minutes of the previous meeting at the
beginning of every Board meeting; therefore, the list below does not
normally contain details from the minutes of the most recent Board meeting.
WARNING: these pages may omit some original contents of the minutes.
Meeting times vary, the exact schedule is available to ASF Members and Officers, search for "calendar" in the Foundation's private index page (svn:foundation/private-index.html).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status: Ongoing Issues for the board:none ## Membership Data: Apache Tika was founded 2010-04-20 (14 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: Released 3.0.0-BETA2 in July. Working towards a 3.0.0 release. ## Community Health: Statistics reporter is down as of this writing (Oct 9 2024). General sense is that community health is the same as last quarter.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status: Ongoing Issues for the board: None ## Membership Data: Apache Tika was founded 2010-04-20 (14 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: Our last 2.x release was at the beginning of April, and we're in the process of releasing 3.0.0-BETA2. We've dramatically improved configurability in the tika-pipes modules, and we added a GRPC server. We've made numerous other improvements throughout the project. We've also managed to keep up with @dependabot. :) ## Community Health: CHI is at 4.70. The project stats are not available as I write this report. There has been some slowdown in activity because of $DAYJOBs, but we've seen some great activity in the GRPC server and towards increasing configurability in tika-pipes.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status:Ongoing Issues for the board: None ## Membership Data: Apache Tika was founded 2010-04-20 (14 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: Released a 2.9.2 on April 2. This included upgrades to dependencies and a few bug fixes. The project is working towards a 3.0.0-BETA2, and hopefully a 3.0.0 shortly thereafter. We're making improvements to our docker deployments and helm charts. We're working on integrating fully recursive extraction of raw bytes and text+metadata for embedded files into our pipes modules. We're making progress towards adding a gRPC server. ## Community Health: Chi is still a healthy 4.7. We're continuing to be on the lookout for new PMC/committers.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status: Ongoing Issues for the board: None ## Membership Data: Apache Tika was founded 2010-04-20 (14 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released 3.0.0-BETA in mid December. We're aiming for a 3.0.0 release in the next few weeks. The big difference between the 2.x and 3.x branch is that the 3.x branch will require Java 11. We plan to maintain the 2.x branch for 6 months after we release 3.0.0. We've seen a decrease in CVEs over the last quarter. ## Community Health: Community health is still a robust 4.7. We saw a decline in traffic on dev@ likely due to the end of year/holiday season.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status: Ongoing Issues for the board: None ## Membership Data: Apache Tika was founded 2010-04-20 (13 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released 2.9.0 on 28 August. We're working towards 3.0.0-BETA, which will require Java 11 and transition from javax to jakarta. We anticipate starting that release process in mid to late October. We continue to improve file type detection, fix small bugs and update dependencies. We're discussing running a 2.9.1 release soon to benefit from commons-compress's recent fix of CVE-2023-42503. ## Community Health: Our community health score is a Healthy 4.70. We've seen no significant changes in in email traffic, commits or JIRA issues since last quarter.
@Christofer: follow up about ghost vote
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Project Status: Current project status: Ongoing Issues for the board: None ## Membership Data: Apache Tika was founded 2010-04-20 (13 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released 2.8.0 on 15 May, and we've started the preliminary regression tests in preparation for the next release. We've had some great contributions from a new user of Tika for file type detection via a collaboration using Common Crawl data. We updated our regression corpus to remove most of the truncated PDFs from Common Crawl and to add ~100k new PDFs from https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ ## Community Health: We've seen increases in issues opened and closed (largely driven by the new mime patterns). Our CHI is 4.70 (Healthy).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (13 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: Our last release (2.7.0) was on 3 Feb 2023, and we're working towards the release of 2.8.0 in the next few weeks. We're expanding our file type detection based on data from Common Crawl. We continue to make bug fixes and improvements throughout the code base. ## Community Health: We've seen a decrease in opened issues, commits and dev list emails. The new requirement to register for a JIRA account may be suppressing opened issues, but we have no evidence of that. Our CHI is 4.70 (Healthy).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (13 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We had a minor release of 2.6.0 in November. We are starting to discuss goals for a 3.x release, although we have no planned release date. ## Community Health: The project's Chi is unchanged from last quarter -- 4.7, healthy. We've seen decreases in email, JIRA and commits. We suspect this is because of the stabilization and adoption of the 2.x branch. There's a trivial increase in GitHub PRs likely driven by dependabot.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (12 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: Our 1.x branch reached EoL on 30 September 2022. We released the last 1.x version, 1.28.5, on 14 September. We also released the next minor revision of our 2.x. branch, 2.5.0, on 3 October. ## Community Health: No major changes. The project's Chi is 4.7, still healthy. There's an increase in GitHub PRs driven by dependabot and our addition of some dependencies that have daily/weekly releases.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list of over 1200 different file types including most of the major types in existence (MS Office, PDF, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (12 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released 2.4.0 on May 2 and 2.4.1 on June 17. We released two security-related fixes to our 1.x branch in May and one in June. The new functionality in our 2.x branch has included dramatically improving customization entry-points and adding a generalized rendering interface. ## Community Health: We saw an increase in JIRA activity and commits and a modest decrease in PRs opened and closed and activity on our mailing lists. Our CHI is 5.5 (Healthy).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (12 years ago) There are currently 32 committers and 32 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: On February 10th, we announced that our 1.x branch is in security-only maintenance until a final end-of-life on 30 September 2022. We made a 2.3.0 release and a 1.28.1 release in February, and we're on the cusp of new releases for both branches. These releases include security related fixes in our code base and in our dependencies. We continue to improve our documentation for the 2.x branch, and we've helped several people with questions on the breaking changes in the new branch. We had a painful antisemitic/Nazi Google-Meet bomb during our Meetup in January, and we've taken steps to limit membership and access to our Meetup account. ## Community Health: We haven't seen any significant changes in community health. Since we added dependabot, we've seen a significant increase in PRs, but otherwise we have slight decreases in issues, commits and mail traffic. We take it as a good sign that traffic has decreased as people are migrating to our 2.x branch, apparently without too many surprises. Our Community Health Score (Chi) is 6.33 (Healthy).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (12 years ago) There are currently 33 committers and 32 PMC members in this project. The Committer-to-PMC ratio is roughly 9:8. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released two versions of our main branch 2.2.0 and 2.2.1 to upgrade dependencies (partly in response to CVEs in log4j2 and jdom2) and to fix some regressions, the first release was on December 16 and the second was on December 23. We made a breaking change in our 1.x branch to upgrade from log4j 1.x to Log4j2, and we released Tika 1.28 on December 23. We started a virtual Meetup group for Tika, and we've held two hands-on tutorials (one on tika-eval, and one on the tika-pipes module). We have another meetup planned for January 2022, and we look forward to holding these every month or so. ## Community Health: We haven't seen any significant changes in community health. We have seen an increase in PRs and commit activity and a slight decrease in email traffic and JIRA issues. Our Community Health Score (Chi) is 6.33 (Healthy).
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: On 23 September, ASF's V.P., Data Privacy, requested via private@ that our project shutdown our public regression corpus (largely files pulled out of Common Crawl). We asked to carry on that discussion publicly and received no response. The regression corpus is an unparalleled resource for parser developers on Tika, POI, PDFBox and others. As evidenced by its use in processing files in both the Panama Papers and, recently, in the Pandora papers (see below), Apache Tika and its dependencies need to be tested on large corpora of naturally occurring files. The Chief Technology Officer of the PDF Association has written two posts on the critical need and transformative power of our bug tracker corpus (https://www.pdfa.org/a-new-stressful-pdf-corpus/ and https://www.pdfa.org/stressful-pdf-corpus-grows/). We need to find a solution that will enable this resource to continue, perhaps with a strict robots.txt file and password protection. We look forward to working with ASF's Data Privacy V.P. to find a solution. ## Membership Data: Apache Tika was founded 2010-04-20 (11 years ago) There are currently 33 committers and 32 PMC members in this project. The Committer-to-PMC ratio is roughly 9:8. Community changes, past quarter: - No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05. - No new committers. Last addition was Nicholas DiPiazza on 2021-06-03. ## Project Activity: We released 2.0.0 on 2021-07-19 and 2.1.0 on 2021-08-23. We've gotten useful feedback on our 2.x branch and we're continuing to improve that. We're also working to improve the documentation to help users migrate to the 2.x branch. We're grateful to see some major projects making the migration, including Datafari/FranceLabs (https://twitter.com/francelabs/status/1447470094783819778). Apache Tika appeared in the news this quarter as being a critical component of the International Consortium of Investigative Journalists' (ICIJ) platform (https://github.com/ICIJ/datashare) used to analyze the Pandora Papers (https://www.wired.co.uk/article/pandora-papers-leak). Previously, the ICIJ reported using Tika to process the Panama Papers (https://source.opennews.org/articles/people-and-tech-behind-panama-papers/). ## Community Health: The CHI score went up from 8.37 in the last quarter to 9.6. We've seen modest decreases in JIRA activity, email lists and PRs. We're not sure of the underlying causes.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (11 years ago) There are currently 33 committers and 32 PMC members in this project. The Committer-to-PMC ratio is roughly 9:8. Community changes, past quarter: - Nicholas DiPiazza was added as a PMC member on 2021-07-06 - Nicholas DiPiazza was added as committer on 2021-06-03 ## Project Activity: We released 2.0.0-BETA on 2021-05-25. This includes the new pipes module which improves robustness, scalability and ease of integration at scale. We look forward to release a stable 2.0.0 in the next quarter. In Tika 2.x, we're also ending the reliance on two custom forks we had to create for security and backwards compatibility reasons. That we're once more relying on external, maintained dependencies for these capabilities is an important step forward. We released 1.27 on 2021-07-06. This includes numerous bug fixes and dependency updates. ## Community Health: The project continues to have strong community health (8.37 Chi score). Activity on mailing lists, commits and pull requests was slightly down over the last quarter. However, the number of JIRA issues opened and closed both increased.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (11 years ago) There are currently 32 committers and 31 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - No new PMC members. Last addition was Peter Lee on 2020-11-24. - No new committers. Last addition was Peter Lee on 2020-11-25. ## Project Activity: We released 2.0.0-ALPHA on 2021-01-16 and a stable release, 1.26, on 2021-03-29. The ALPHA release is an exciting step towards a BETA or stable 2.x release in the next month or so. We recently added several languages to our language detector and made improvements in our mock parser, which allows users to harden their pipelines against parser failures. We're nearing completion on a new pipes module that will allow for easier integration with datastores (e.g. S3) and search engines, such as Apache Solr. ## Community Health: Code contributors have increased in the last quarter, and we've seen an impressive increase in email traffic. Our Community Health Score (Chi) is "Super Healthy". We'll continue to be on the lookout for potential new committers/PMC members.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (11 years ago) There are currently 32 committers and 31 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - Peter Lee was added to the PMC on 2020-11-24 - Peter Lee was added as committer on 2020-11-25 ## Project Activity: We released 1.25 on 2020-11-40. This version included numerous dependency upgrades, a critical license issue with Adobe's xmpcore, and several new parsers. We are on the cusp of a release of Tika 2.0.0-ALPHA. On our file corpus development side project, we gathered "stressful" attachments from 35 parser issue trackers. This includes more than a million files (551GB). These are critical for stress testing our own parsers, and we're making the corpus available to other open source and commercial projects: https://corpora.tika.apache.org/base/docs/bug_trackers/. See, for example: https://www.pdfa.org/a-new-stressful-pdf-corpus/ and https://www.pdfa.org/stressful-pdf-corpus-grows/ ## Community Health: As noted above, we've added Peter Lee as a committer/PMC. A number of our JIRA and GitHub health metrics were down slightly in the last quarter. We attribute this to the holidays/new year. However, we saw an uptick in user@ traffic and a slight increase in commits.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (10 years ago) There are currently 31 committers and 30 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - No new PMC members. Last addition was Tilman Hausherr on 2019-10-02. - No new committers. Last addition was Tilman Hausherr on 2019-10-03. ## Project Activity: We're on the cusp of releasing 1.25. We have a blocker with an accidentally included, ASL-incompatible license in a dependency from Adobe (https://issues.apache.org/jira/browse/TIKA-3204). We've made good progress in a major refactoring for Tika 2.0.0 that was based on a significant amount of earlier work by committer/PMC Bob Paulin. This refactoring will allow for cleaner dependency management and more modularized parsers. Once we release 1.25, we should be ready to start releasing 2.0.0-ALPHA. We completed the migration for our primary branch from 'master' to 'main.' ## Community Health: We've seen an uptick in issues and in PRs. Much of this increase in activity is driven by a commons committer who has taken an interest in improving our codebase. We are on the lookout to expand our committership/PMC.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention. ## Membership Data: Apache Tika was founded 2010-04-20 (10 years ago) There are currently 31 committers and 30 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - No new PMC members. Last addition was Tilman Hausherr on 2019-10-02. - No new committers. Last addition was Tilman Hausherr on 2019-10-03. ## Project Activity: We released 1.24.1 on April 21. This release included numerous security fixes (CVE-2020-9489) which we identified through a new fuzzing module. We've moved our regression testing server and corpus from Rackspace to a new server kindly hosted by a committer on PDFBox. We started a new mailing list (corpora-dev@tika.apache.org) for this resource to enable cross-project discussion (POI, PDFBox, Tika and Commons Compress) and to encourage contributions and input from a wider audience. We've removed whitelist/blacklist terminology from the project, and we are in the process of migrating from 'master' branch to 'main'. ## Community Health: Our Community Health Score of 6.33 suggests we are doing well. We've seen a slight increase in traffic on our dev list. Commits, issues and traffic on the dev list have decreased slightly, but nothing worthy of board attention.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention at this time. ## Membership Data: Apache Tika was founded 2010-04-20 (10 years ago) There are currently 31 committers and 30 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - No new PMC members. Last addition was Tilman Hausherr on 2019-10-02. - No new committers. Last addition was Tilman Hausherr on 2019-10-03. ## Project Activity: 1.24 was released on 2020-03-17. We adding a fuzzing module to identify denial of service (DoS) vulnerabilities, and we're currently preparing a 1.24.1 release that fixes several DoS vulnerabilities, primarily in our dependencies. We've had mixed success in getting some of our (ASF-licensed but non-ASF) dependencies to fix their code in a timely manner, and we've had to fork some dependencies and release them separately. We continue to work with with these projects to improve security. ## Community Health: We've seen decreases in email and issue traffic in the past quarter, but nothing alarming.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. ## Issues: There are no issues requiring board attention at this time. ## Membership Data: Apache Tika was founded 2010-04-20 (10 years ago) There are currently 32 committers and 31 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - No new PMC members. Last addition was Tilman Hausherr on 2019-10-02. - No new committers. Last addition was Tilman Hausherr on 2019-10-03. ## Project Activity: 1.23 was released on 2019-12-06. This included improved file type detection and a new parser for XLIFF files. Nicholas DiPiazza recently contributed a parser for OneNote files, which will be available in the next release. Over the last two or three quarters, we've seen an increase in reports of vulnerabilities in our dependencies. We upgrade when we can, but there are some upstream dependencies out of our control that have required some non-trivial solutions, including forking and fixing (as a last resort). ## Community Health: We've seen an increase in email, commits and and other activity compared with last quarter, but overall, no significant changes.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: There are no issues requiring board attention at this time. ## Membership Data: Apache Tika was founded 2010-04-20 (9 years ago) There are currently 32 committers and 31 PMC members in this project. The Committer-to-PMC ratio is roughly 1:1. Community changes, past quarter: - Tilman Hausherr was added to the PMC on 2019-10-02 - Tilman Hausherr was added as committer on 2019-10-02 ## Project Activity: 1.22 was released on 2019-08-01 SooMyung Lee (soomyung) and JinSup Kim (ddoleye) contributed a parser for HWP v5 files. We added significant improvements in language coverage for the tika-eval module by collaborating with OpenNLP: we now detect 121 (vs. 75) languages and have common words lists for 121 (vs. 21) languages. This means we can now identify potential problems/regressions in content extraction for 121 languages. ## Community Health: No significant changes in community health. There were decreases in @dev email traffic and in closed tickets, but we saw a slight increase in contributors and opened PRs. The team gave two conference presentations on Tika -- ApacheCon NA and Activate -- and we have an upcoming talk at ApacheCon EU.
## Description: - Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues requiring board attention at this time. ## Activity: - The project had one release in the last quarter. - We're seeing increased reports of vulnerabilities in our dependencies on our JIRA. We added ossindex-maven-plugin a while ago and update as indicated, but a non-trivial amount of communications/bugs reported are now focused on this topic. - We plan to start the release process in the next few weeks for the next version. ## Health report: - Nothing noteworthy in email, commits, etc. ## PMC changes: - Currently 30 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018 ## Committer base changes: - Currently 31 committers. - No new committers added in the last 3 months - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018 ## Releases: - 1.21 was released on Sun May 19 2019 ## Mailing list activity: - Regular fluctuations; nothing significant to report. - dev@tika.apache.org: - 190 subscribers (down -1 in the last 3 months): - 716 emails sent to list (358 in previous quarter) - user@tika.apache.org: - 363 subscribers (down -3 in the last 3 months): - 70 emails sent to list (54 in previous quarter) ## JIRA activity: - 53 JIRA tickets created in the last 3 months - 45 JIRA tickets closed/resolved in the last 3 months
## Description: - Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: There are no issues requiring board attention at this time. ## Activity: - We've been making improvements to our PDFParser, and we added a CSV detector/parser. Other than that, it has been a quiet quarter. We look forward to our next release as soon as the next versions of PDFBox and POI are available. ## Health report: - Health is in good shape. No significant changes in health. ## PMC changes: - Currently 30 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018 ## Committer base changes: - Currently 31 committers. - No new committers added in the last 3 months - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018 ## Releases: - Last release was 1.20 on Fri Dec 21 2018 ## Mailing list activity: - We've seen a drop-off in emails to the dev list compared with last quarter. This may reflect a lower commit rate, but we do not know what is driving this. - dev@tika.apache.org: - 192 subscribers (down -3 in the last 3 months): - 363 emails sent to list (767 in previous quarter) - user@tika.apache.org: - 366 subscribers (up 1 in the last 3 months): - 54 emails sent to list (54 in previous quarter) ## JIRA activity: - 44 JIRA tickets created in the last 3 months - 23 JIRA tickets closed/resolved in the last 3 months
## Description: - Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues requiring board attention at this time. ## Activity: - The project had two releases in the last quarter, one a bug fix; and one with more substantial changes. - The project refreshed its 1TB regression corpus (hosted by Rackspace) from Common Crawl data, with heavy oversampling in binary file formats and a better diversity of character encodings and languages. This revealed several areas for improvements in Tika and its dependencies. - Work continues on the new 2.x based master. ## Health report: - Health is in decent shape. No significant changes in health. ## PMC changes: - Currently 30 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018 ## Committer base changes: - Currently 31 committers. - No new committers added in the last 3 months - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018 ## Releases: - 1.19.1 was released on Mon Oct 08 2018 - 1.20 was released on Fri Dec 21 2018 ## JIRA activity: - 60 JIRA tickets created in the last 3 months - 39 JIRA tickets closed/resolved in the last 3 months
## Description: - Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues requiring board attention at this time. ## Activity: - The project had two releases in the last quarter. In addition to numerous bug fixes, we've focused on improving robustness via tika-server and some new options in our ForkParser. - In the two recent releases, we've fixed several CVEs in our own code or by notifying and then upgrading dependencies. Two security researchers, Tobias Ospelt and Rohan Padhye, have reported numerous issues identified via fuzzing. We've granted Tobias access to our regression vm to continue his work on our 1TB of regression files. - We added a security page to our site to record vulnerabilities: http://tika.apache.org/security.html - Work continues on the new 2.x based master. ## PMC changes: - Currently 30 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018 ## Committer base changes: - Currently 31 committers. - No new committers added in the last 3 months - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018 ## Releases: - 1.19.1 was released on Tue Oct 9 2018 - 1.19 was released on Mon Sep 17 2018 ## Mailing list activity: - dev@tika.apache.org: - 198 subscribers (up 3 in the last 3 months): - 703 emails sent to list (588 in previous quarter) - user@tika.apache.org: - 361 subscribers (up 4 in the last 3 months): - 55 emails sent to list (49 in previous quarter) ## JIRA activity: - 64 JIRA tickets created in the last 3 months - 45 JIRA tickets closed/resolved in the last 3 months
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues requiring board attention at this time. ## Activity: - The project is currently finalizing the contents of a new 1.19 release with updates to the Mimetype detection and deep learning modules as well as several other important upgrades. - Work continues on the new 2.x based master. ## PMC changes: - Currently 30 PMC members. - Thejan Wijesinghe was added to the PMC on Tue Apr 17 2018 ## Committer base changes: - Currently 31 committers. - Thejan Wijesinghe was added as a committer on Wed Apr 18 2018 ## Releases: - 1.18 was released on Mon Apr 23 2018 ## Mailing list activity: - dev@tika.apache.org: - 194 subscribers (down -1 in the last 3 months): - 639 emails sent to list (762 in previous quarter) - user@tika.apache.org: - 356 subscribers (up 0 in the last 3 months): - 50 emails sent to list (59 in previous quarter) ## JIRA activity: - 58 JIRA tickets created in the last 3 months - 33 JIRA tickets closed/resolved in the last 3 months
WHEREAS, the Board of Directors heretofore appointed David Meikle (dmeikle) to the office of Vice President, Apache Tika, and WHEREAS, the Board of Directors is in receipt of the resignation of David Meikle from the office of Vice President, Apache Tika, and WHEREAS, the Project Management Committee of the Apache Tika project has chosen by vote to recommend Tim Allison (tallison) as the successor to the post; NOW, THEREFORE, BE IT RESOLVED, that David Meikle is relieved and discharged from the duties and responsibilities of the office of Vice President, Apache Tika, and BE IT FURTHER RESOLVED, that Tim Allison be and hereby is appointed to the office of Vice President, Apache Tika, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Special Order 7F, Change the Apache Tika Project Chair, was approved by Unanimous Vote of the directors present.
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues that require the board's attention at this time. ## Activity: - The project is currently finalising the contents of a new 1.18 release with updated to the Mimetype detection, a new XPS parser, improvements to the OCR Parser and updated base libraries. - Work continues on the new 2.x based master with a significant addition being the new composite parser strategy work (TIKA-1509 [0]) added by Nick Burch ## PMC changes: - Currently 29 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Madhav Sharan on Thu Aug 31 2017 - A vote has just finished to add a new PMC member ## Committer base changes: - Currently 30 committers. - No new committers added in the last 3 months - Last committer addition was Madhav Sharan at Thu Aug 31 2017 ## Releases: - 1.17 was released on Wed Dec 13 2017 ## Mailing list activity: - dev@tika.apache.org: - 195 subscribers (down -1 in the last 3 months): - 776 emails sent to list (575 in previous quarter) - user@tika.apache.org: - 357 subscribers (up 7 in the last 3 months): - 65 emails sent to list (66 in previous quarter) ## JIRA activity: - 71 JIRA tickets created in the last 3 months - 47 JIRA tickets closed/resolved in the last 3 months [0] https://issues.apache.org/jira/projects/TIKA/issues/TIKA-1509
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: - There are no issues that require the board's attention at this time. ## Board Questions: - mt: Why was the "2.0.6 release" thread on private@ ? It looks as if it could/should have been on dev@ - It started off on dev@ but it looks like Tim was trying to ask can it be pushed ahead of ApacheCon and 'moved' it on private@ - We will avoid this in the future. - bd: Note that the names of people who haven been invited to join the PMC but haven't replied yet shouldn't be included in reports, in case they decline. - Apologies, will note this for the future. ## Activity: - Apache Tika 1.17 was released in December with key updates including: - Automatic image captioning - Phonetic run handling in Excel and Word - Many bug fixes and improvements - We've now changed our repository with master now focusing on the 2.x series - Discussion has now started on some of the key changes previously discussed in the Tika 2.0 roadmap[1] - Sergey Beryozkin's TikaIO Apache Beam component has now stabalised and will be available in Beam 2.3.0 ## PMC changes: - Currently 29 PMC members. - No new PMC members added in the last 3 months - Last PMC addition was Madhav Sharan on Thu Aug 31 2017 ## Committer base changes: - Currently 30 committers. - No new committers added in the last 3 months - Last committer addition was Madhav Sharan at Thu Aug 31 2017 ## Releases: - 1.17 was released on Wed Dec 13 2017 ## Mailing list activity: - dev@tika.apache.org: - 196 subscribers (down -4 in the last 3 months): - 582 emails sent to list (592 in previous quarter) - user@tika.apache.org: - 349 subscribers (down -3 in the last 3 months): - 76 emails sent to list (35 in previous quarter) ## JIRA activity: - 71 JIRA tickets created in the last 3 months - 39 JIRA tickets closed/resolved in the last 3 months [1] https://wiki.apache.org/tika/Tika2_0RoadMap
## Description: Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. ## Issues: There are no issues that require the board's attention at this time. ## Activity: - Progress is steady towards a 1.17 release with bug fixes and new features such as Image-to-Text captioning and improved Phonetic string handling. - Sergey Beryozkin added Apache Tika into Apache Beam[1] as an input component which has triggered cross community collaboration. - We've also started catch up work on 2.x branch to align with updates made on the master branch (1.x series). - Google has released a Tika package for Go (thanks Tyler Bui-Palsulich) which adds to the community released bindings [2]. ## PMC changes: - Currently 29 PMC members. - Madhav Sharan was added to the PMC on Thu Aug 31 2017 ## Committer base changes: - Currently 30 committers. - Madhav Sharan was added as a committer on Thu Aug 31 2017 ## Releases: - Last release was 1.16 on Wed Jul 12 2017 ## Mailing list activity: - dev@tika.apache.org: - 202 subscribers (down -2 in the last 3 months): - 595 emails sent to list (1396 in previous quarter) - user@tika.apache.org: - 352 subscribers (up 6 in the last 3 months): - 35 emails sent to list (135 in previous quarter) ## JIRA activity: - 46 JIRA tickets created in the last 3 months - 30 JIRA tickets closed/resolved in the last 3 months [1] https://issues.apache.org/jira/browse/BEAM-2328 [2] https://s.apache.org/CJ2a
What is Tika? ============= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. Issues ====== There are no issues that need the boards attention. Releases ======== The last release Tika 1.16 was released on 17 Jul 2017 and 1.15 on the 23 May 2017. These releases contain some great features including Age Recognition, Image Captioning based on this paper[2], new tika-eval module to compare output between versions, new parsers for WordPerfect and QuattroPro, and much more. Work has now begun on 1.17 and continues on the 2.X branch. Community ========= A community member was voted[1] in as both a committer and PMC member on 27 Jun 2017 and is currently in the process of been invited to join. Prior to that was Luis Filipe Nassif on Wed Apr 12 2017 Mailing list activity on dev@ was 204 subscribers (down -3 in the last 3 months), with 1473 emails sent to list (1010 in previous quarter). Mailing list activity on user@ was 347 subscribers (up 1 in the last 3 months), with 135 emails sent to list (36 in previous quarter). [1] https://s.apache.org/Sp4E [2] https://arxiv.org/abs/1411.4555
=== Apache Tika Status Report : April 2017 === What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release Tika 1.14 was released on Wed Oct 19 2016. Version 1.15 has been the focus of the last period, with a release candidate to be built once the new POI release is out. There a lots of new features such as WordPerfect and QuattroPro parsers, SAX based parser for Office formats, new language detectors, inline image extraction in PDFs, and a new tika-eval module to allow evaluation of extraction between different systems. Work has also continued on the new 2.X branch. Community ========================= The team have been actively publicising different uses of Apache Tika including the Panama Papers Investigation which won the Pulitzer prize[1] and the new Google / elastic cloud search offering [2]. Apache Tika is participating in the Google Summer of Code with two mentors in Chris Mattmann and Thamme Gowda. We are looking forward to receiving proposals. Mailing list activity on dev@ was at 235, 405 and 175 messages in Feb, Mar and Apr 2017, respectively. user@ was at 14, 6 and 6 messages, during the same timeframe. [1] https://s.apache.org/XW3A [2] https://s.apache.org/yzvc
=== Apache Tika Status Report : January 2017 === What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release Tika 1.14 was released on Wed Oct 19 2016. Version 1.15 is now underway with current new features such as the WordPerfect and QuattroPro parsers, SAX based parser for Office formats, and inline image extraction in PDFs. Work has also continued on the new 2.X branch. Community ========================= Luís Filipe Nassif was added as a committer on Tue Oct 18 2016 Tika has featured an article from Chris Mattman entitled 'Searching deep and dark: Building a Google for the less visible parts of the web'[1]. Mailing list activity on dev@ was at 351, 359 and 162 messages in Nov, Dec and Jan 2016/17, respectively. user@ was at 41, 2 and 17 messages, during the same timeframe. [1] https://s.apache.org/qzen
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release of Tika was made in May 2016. Version 1.14 has been progressing and will be released soon including new features such as the Image Recognition parser, OCR within PDFs, and a range of new mime type support. Work has also continuted on the new 2.X branch. Community ========================= There have been no new committers added in this period. The last committer joined in June 2016. Tim Allison has blogged via the Open Preservation Foundation on the regression pack work he kicked of to make Apache PDFBox, Apache POI and Apache Tika more robust[1]. Mailing list activity on dev@ was at 242, 476 and 115 messages in Aug, Sep and Oct 2016, respectively. user@ was at 5, 56 and 14 messages, during the same timeframe. [1] https://s.apache.org/z9QL
=== Apache Tika Status Report : July 2016 === What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The 1.13 release of Tika was made in May 2016[1], including various upgrades of dependences, fixes including a security vulnerability (CVE-2016-4434)[2], and improvements around Name Entity Recognition. The 2.X stream is now being actively worked on. Community ========================= The Tika PMC added Thamme Gowda as a committer and PMC member in June 2016. Chris Mattmann is mentoring Anastasija Mensikova as part of the Google Sumer of Code 2016. She is working on integrating OpenNLP's Sentiment Analysis in Tika. Mailing list activity on dev@ was at 344, 328 and 77 messages in May, Jun and Jul 2016, respectively. user@ was at 41, 6 and 17 messages, during the same timeframe. [1] https://s.apache.org/AFE1 [2] https://s.apache.org/Bbth
=== Apache Tika Status Report : April 2016 === What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The 1.12 release of Tika was made in February 2016[1], including various fixes and a new NamedEntityParser using Apache OpenNLP. A 1.13 release is imminent with work well underway on the 2.X stream. Community ========================= Along with Apache Solr, Tika was part of stack that powered[1] the analysis of the Panama Papers leak. A Twitter account, @ApacheTika, has been setup for the project to help aid announcements and engage with the community. The Tika PMC added Bob Paulin as a committer and PMC member in September 2015. Mailing list activity on dev@ was at 463, 423 and 259 messages in Feb, Mar and Apr 2016, respectively. user@ was at 85, 10 and 21 messages, during the same timeframe. [1] https://s.apache.org/6A7M [2] https://s.apache.org/QMJ3
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The 1.11 release of Tika was made in October 2015[1], improving MIME type support and adding a parser for GROBID (GeneRation Of BIbliographic Data Discussions). Work has also started on the 2.X stream. Community ========================= The Tika PMC added Bob Paulin as a committer and PMC member in September 2015. Mailing list activity on dev@ was at 170, 185 and 136 messages in Nov, Dec and Jan 2015/16, respectively. user@ was at 9, 3 and 6 messages, during the same timeframe. [1] http://s.apache.org/TDD
Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The 1.10 release of Tika was made in August, upgrading support to Java 7 and adding new features to make configuration easier. Discussions are underway for the 1.11 release as well as progression towards as 2.X stream. Community ========================= The Tika PMC added Bob Paulin as a committer and PMC member in September. There were two talks on Tika by Nick Burch and Michael Starch[1][2] at ApacheCon Big Data in Budapest, as well as a gathering of interested parties. Apache Tika has now been wrapped as a Perl Module[4], extending the list of community client libraries available. Mailing list activity on dev@ was at 397, 388 and 150 messages in Aug, Sep and Oct 2015, respectively. user@ was at 45, 25 and 23 messages, during the same timeframe. [1] http://sched.co/3zt7 [2] http://sched.co/40Zd [3] http://s.apache.org/Mmo [4] https://metacpan.org/release/RIBUGENT/Apache-Tika-0.04
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The 1.9 release of Tika was made last month (June 2015) with new features such as cTakes[1] integration and probabilistic MIME detection. Work has now started on the 1.10 development stream. Community ========================= The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in April 2015 as committers and PMC Members. The authors (Chris Mattmann and Jukka Zitting) of the Tika in Action book have donated the examples from the book to Tika. These have been included in a new tika-examples sub-module. There have been articles published on NASA and the Jet Propulsion Lab’s involvement in the Memex project, which features Apache projects including Tika[1]. Mailing list activity on dev@ was at 317, 307 and 63 messages in May, June and July 2015, respectively. user@ was at 27, 39 and 9 messages, during the same timeframe. [1] https://wiki.apache.org/tika/cTAKESParser [2] http://s.apache.org/Mmo
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.7) was made in January 2015. There has been much progress since then with a release candidate for 1.8 currently being voted on. Community ========================= The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in April 2015 as committers and PMC Members. There are six talks related to Tika scheduled to take place at ApacheCon NA 2015. Chris Mattman has registered to be a mentor for Google Summer of Code 2015 with some Tika issues marked as potential projects. Mailing list activity on dev@ was at 445, 891 and 107 messages in February, March and April 2014, respectively. user@ was at 15, 15 and 1 messages, during the same timeframe.
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.7) was made in January 2015[1], with many new features including an OCR Parser based on Tesseract, improvements to the Tika JAXRS Server and a number of parser fixes & enhancements. Discussions have now started on the dev@ list to outline a roadmap[2] for what a new 2.X stream could look like for the evolution of Tika. Community ========================= The Tika PMC added Konstantin Gribov as a committer and PMC Member in January 2015. Mailing list activity on dev@ was at 389, 253 and 289 messages in November, December and January 2014, respectively. user@ was at 10, 28 and 29 messages, during the same timeframe. [1] http://s.apache.org/u0p [2] http://s.apache.org/DSm
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.6) was made in September 2014. Work has started on version 1.7 with bug fixes already complete and new features such as OCR Parsing being worked on. Community ========================= The Tika PMC added Ann Bryant Burgess as a committer and PMC Member in August 2014. New community developed bindings[1] have been created for Tika including a binding for NodeJS and an OpenShift Cartridge for Apache Tika Server. Discussions have started to take place on the mailing list about a potential Tika meetup at ApacheCon EU 2014. Nick Burch is also presenting a talk titled 'What's With The 1s and 0s?' on using Tika and other related tools to analyse binary content[2]. Mailing list activity on dev@ was at 393, 364 and 119 messages in August, September and October 2014, respectively. user@ was at 23, 38 and 12 messages, during the same timeframe. [1] http://s.apache.org/Y7w [2] http://sched.co/1pbkX7n
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.5) was made in February 2014. Work has continued on version 1.6 with a many bug fixes and new features, including many new file formats. A discussion thread has started for a 1.6 release candidate. Community ========================= The Tika PMC added Lewis John McGibbney as a committer and PMC Member in June 2014. A Tika Hackathon session took place at the ApacheCon NA 2014 conference, kicking off improvements to our JAX-RS module. There were also presentations by Annie Bryant, Nick Burch and Jukka Zitting as part of the main conference. Mailing list activity on dev@ was at 298, 671 and 17 messages in May, June and July 2014, respectively. user@ was at 15, 31 and 18 messages, during the same timeframe.
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.5) was made in February 2014. Since then progress has been steady on version 1.6 with a number of bug fixes and improvements. Community ========================= No new committers or PMC members were added since the last report. Prior to this the last new Committer and PMC Member was added in January 2014. Tika is well represented at ApacheCon NA with four talks from three different speakers (Jukka Zitting, Nick Burch and Annie Burgess). There is also plans to conduct a couple of Hackathon sessions during the conference. Mailing list activity on dev@ was at 293, 226 and 29 messages in February, March and April 2014, respectively. user@ was at 26, 35 and 1 messages, during the same timeframe.
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.4) was made in July 2013. Work has progressed on version 1.5 with a number of bug fixes and improvements. A discussion thread is underway on creating a version 1.5 release candidate. Community ========================= The Tika PMC has voted to add Hong-Thai Nguyen as a committer and PMC Member, with ACK to board earlier this week. Prior to this the last Tika PMC and Committer and PMC Member was added in July 2013. Discussion has progressed around integrating Any23 components of value into Tika. This is not in full swing yet however there is broad agreement on the approach, with some initial patches being proposed and integrated. Mailing list activity on dev@ was at 84, 155 and 28 messages in November, December and January 2014, respectively. user@ was at 14, 10 and 0 messages, during the same timeframe.
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that need the boards attention. Releases ========================= The last release (1.4) was made in July 2013 and work is currently underway on version 1.5 with 23 issues currently resolved, comprising a mixture of bug fixes and new features. Community ========================= The Tika PMC added Tim Allison as a committer and PMC Member in July 2013. Chris Mattmann has won a National Science Foundation proposal for a project at the University of Southern California to deliver an open source framework for metadata exploration, automatic text mining and information retrieval of polar data using Apache Tika[1]. Mailing list activity on dev@ was at 103, 86 and 29 messages in August, September and October 2013, respectively. user@ was at 4, 18 and 0 messages, during the same timeframe. [1] http://s.apache.org/QqY
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that needs the board's attention. Releases ========================= Version 1.4 was released on the 2nd of July[1]. This release included several important bugfixes and new features, including improvements to the REST server and parser components. Work is now underway on version 1.5. Community ========================= No new committers or PMC members were added since the last report, with both the last committer and PMC member added in August 2012. Mailing list activity on dev@ was at 126, 138 and 54 messages in May, June and July 2013, respectively. user@ was at 7, 15 and 8 messages, during the same timeframe. [1] http://s.apache.org/7hB
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that needs the boards attention. Releases ========================= Version 1.3 was released on the 22nd of January[1]. This release included several important bugfixes and new features, including better handling of embedded files. Work is now underway on version 1.4 with 15 issues resolved and 20 open to date. Community ========================= No new committers or PMC members were added since the last report. We have added a potential new feature as part of the ASFs potential projects within the Google Summer of Code program[2]. Mailing list activity on dev@ was at 174, 70 and 11 messages in February, March and April 2012, respectively. user@ was at 53, 32 and 10 messages, during the same timeframe. [1] http://s.apache.org/PDH [2] https://issues.apache.org/jira/browse/TIKA-605
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Issues ========================= There are no issues that needs the boards attention. Releases ========================= Work continues on version 1.3 with 47 resolved and 18 open/in progress JIRA tickets adding new features and providing bug fixes, with a discussion thread underway to assess the need for a new release. Community ========================= No new committers or PMC members were added since the last report. Jukka Zitting presented a session at ApacheCon EU titled "Content Extraction With Apache Tika" [1]. Mailing list activity on dev@ was at 124, 109 and 21 messages in November, December 2012 and January 2013, respectively. user@ was at 19, 18 and 6 messages, during the same timeframe. [1] http://s.apache.org/CUc
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognised by IANA and other standards bodies. Releases ========================= We released version 1.2 on the 12th of July 2012[1]. This contained new features such as the JAX-RS based network server and XMP metadata handling, along with new file formats and parser improvements. Work is currently underway on version 1.3 with 22 resolved and 19 open/in progress JIRA tickets, adding support for open graph metadata, correct rounding of geodata information and improved mime type detection for JPEG 2000 formats. Community ========================= The Tika PMC added Sergey Beryozkin (July 2012), Ingo Renner (July 2012) and Jörg Ehrlich (August 2012) as PMC members and committers. As sponsor the Tika PMC voted to recommend the graduation of the Any23 incubator project to a TLP. This passed[3] and following the Incubator PMC vote, the board approved the graduation resolution. The Tika PMC voted to recommend Dave Meikle as the new chair[4]. This was accepted by the board in August 2012 and Chris has now handed duties over to Dave. Jukka Zitting is scheduled to speak about Tika at ApacheCon Europe. The session is titled 'Content extraction with Apache Tika'[5] and shows how Tika can be used with a Lucene or Solr search index. Mailing list activity on dev@ was at 173, 61 and 7 messages in August, September and October 2012, respectively. user@ was at 34, 31 and 1 messages, during the same timeframe. [1] http://s.apache.org/Vzr [2] http://s.apache.org/HoO [3] http://s.apache.org/gHE [4] http://s.apache.org/kBQ [5] http://s.apache.org/lnR
WHEREAS, the Board of Directors heretofore appointed Chris Mattmann to the office of Vice President, Apache Tika, and WHEREAS, the Board of Directors is in receipt of the resignation of Chris Mattmann from the office of Vice President, Apache Tika, and WHEREAS, the Project Management Committee of the Apache Tika project has chosen to recommend David Meikle the successor to the post; NOW, THEREFORE, BE IT RESOLVED, that Chris Mattmann is relieved and discharged from the duties and responsibilities of the office of Vice President, Apache Tika, and BE IT FURTHER RESOLVED, that David Meikle and hereby is appointed to the office of Vice President, Apache Tika, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Special Order 7A, Change the Apache Tika Project Chair, was approved by Unanimous Vote of the directors present.
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. Releases ========================= Progress towards the 1.2 release continues. There have been a few recent threads discussing making an RC ([1 and [2]). We anticipate the 1.2 RC and official release arriving in the next month or so. The 1.2 RC addresses 63 issues [3] including new features (e.g., Tika JAX-RS network server [4]), bug fixes (e.g., misuse of HTTP content-encoding header [5]) and a enhanced approach to dealing with metadata key naming and representation [6] including XMP support. Community ========================= The Tika PMC added Ray Gauss as a Tika PMC member and committer in May 2012. The Tika PMC is still sponsoring the Any23 incubator project [7], which is progressing along nicely and getting ready to make their first Incubator release. Chris started a thread [8] on private to discuss potentially rotating the chair. So far there hasn't been strong positive or negative reception to this suggestion. Mailing list activity on dev@ was at 174, 102 and 134 messages in May, June and July 2012, respectively. user@ was at 26, 28 and 45 messages, during the same timeframe. [1] http://s.apache.org/MFq [2] http://s.apache.org/AMZ [3] http://s.apache.org/w9 [4] https://issues.apache.org/jira/browse/TIKA-593 [5] https://issues.apache.org/jira/browse/TIKA-431 [6] http://wiki.apache.org/tika/MetadataRoadmap [7] http://incubator.apache.org/any23/ [8] http://s.apache.org/P7c
(Tika)
What is Tika? ========================= Apache Tika is a dynamictoolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. Releases ========================= We released Tika 1.1 on 3/23/12. The current work on Tika 1.2 includes 14 of 33 issues already fixed. These issues include a cool new Tika JAX-RS network server [2] that really helped foster good will between the Apache Tika and CXF communities. Sergey Beryozkin from CXF, and Maxim Valyanskiy from Tika really led the way. Besides the network server, MIME type support for the scientific data file format FITS, used heavily in the astronomy community was added [3], and the ability to extract embedded images from Powerpoint files [4] and the improvements to the way that Tika load Detectors and Parsers in an OSGI environment were also added [5] in the current trunk development branch. There has been discussion of adding GDAL support to TIKA [6], which would add hundreds spatial formats and the ability to parse and detect them to Tika. Community ========================= No new PMC members/committers were added in the last quarter. The Tika PMC is still sponsoring the Any23 incubator project [7], which is progressing along nicely and getting ready to make their first Incubator release. Mailing list activity on dev@ remained steady in January, February, and March 2012 (189, 125, 200 messages) but slowed in April 2012 (48 messages, respectively), while the user activity remained consistent in January, February and March 2012 (51, 46 and 37 messages). No user questions in April 2012 yet. [1] http://s.apache.org/dWz [2] https://issues.apache.org/jira/browse/TIKA-593 [3] https://issues.apache.org/jira/browse/TIKA-874 [4] https://issues.apache.org/jira/browse/TIKA-883 [5] https://issues.apache.org/jira/browse/TIKA-884 [6] https://issues.apache.org/jira/browse/TIKA-605 [7] http://s.apache.org/gJ
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. Releases ========================= We released Tika 1.0 on 11/7/11 [1] to coincide with Apache Con NA 2011. Since then we've been working on the 1.1 release, with 42 issues already resolved in JIRA [2] and 1.1 likely to be shipped in the next quarter. Community ========================= We added Jerome Charron and Antoni Mylkato the Tika PMC in November 2011. At ApacheCon NA, and since, there have been some discussions regarding Tika and the ODF Toolkit [3], as well as the Tika PMC's sponsorship of the Any23 incubator project [4], which is progressing along. Mailing list activity on dev@ remained steady in November and December 2011 (259, 287 messages) but slowed in January 2012 (21 messages, respectively), while the user activity remained consistent in November and December 2011 (47 and 65 messages), but has been quiet in January 2012 (12 messages). Chris Mattmann gave a talk at ApacheCon NA 2011 on "Apache Tika: One Point Oh!" [5] commemorating the upcoming 1.0 Tika release. Chris also had some discussions about Any23 and Tika with Lewis John McGibbney at ApacheCon NA 2012. Lewis is a Nutch PMC member, the candidate for the Gora VP with the current board resolution and an Any23 Incubator committer. Press ========================= Chris and Jukka and Sally have published a press release [6] about Tika's use at NASA, as well as its use at Day Software and other companies. The press release went out during ApacheCon NA 2011 and was perfect timing with the event. Tika In Action ========================== Chris Mattmann and Jukka Zitting completed the Manning book called "Tika in Action" [7] in time for ApacheCon NA 2011. The book is now available in print, ebook and mobi editions. Yay! [1] http://s.apache.org/AE6 [2] http://s.apache.org/HLX [3] https://issues.apache.org/jira/browse/TIKA-737 [4] http://s.apache.org/gJ [5] http://na11.apachecon.com/talks/19391 [6] http://s.apache.org/N0I [7] http://manning.com/mattmann/
What is Tika? ========================= Apache Tika is a dynamic toolkit for content detection, analysis, and extraction. It allows a user to understand, and leverage information from, a growing a list over 1200 different file types including most of the major types in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as recognized by IANA and other standards bodies. Releases ========================= We rolled the Tika 0.10 release on 9/30/2011 [1]. We opted for a 0.10 instead of 1.0 to try and time the 1.0 release with ApacheCon. We've got a few weeks left and are going to try and make it! Community ========================= We added Michael McCandless to the Tika PMC on August 29th, 2011. The Tika PMC agreed to sponsor the Any23 (Anything to Triples) Incubator project [3]. Any23 is a semantic understanding toolkit, whose goal is to extraction information from, to detect, and to reason over most of the current semantic document formats including RDF, OWL, etc. Any23 leveraged Tika in its existing framework at Googlecode, and we see the projects hopefully having a lot of synergy going forward. Any23 was accepted into the Incubator on October 1, 2011 [4]. Mailing list activity on dev@ is growing (197, 356, and 186 in August, September and October 2011, respectively), while the user activity grew a little bit in August and September (70 and 66 messages), but has been relatively quiet in October (3 messages). Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache Tika: One Point Oh!" [5] commemorating the upcoming 1.0 Tika release. Press ========================= Chris and Jukka and Sally have drafted a press release about Tika's use at NASA, as well as its use at Day Software and other companies. The goal is to have this coincide with the 1.0 release and ApacheCon NA 2011. Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action" [6] and the book is in its final copyediting stages and it should be in print in time for ApacheCon NA 2011. [1] http://tika.apache.org/0.10/index.html [2] http://s.apache.org/HzK [3] http://s.apache.org/HWO [4] http://s.apache.org/gJ [5] http://na11.apachecon.com/talks/19391 [6] http://manning.com/mattmann/
Releases ========================= Progress towards the 1.0 release continues and we hope to roll a 1.0 in the Q3 timeframe. Development activity is progressing, and there are a number of issues being worked on the user and dev lists. Notably, the Tika command line interface now outputs JSON [1], new document formats are being worked and/or improved: ole2 and ooxml [2], the pcap format [3], the CHM format [4], the PRT format [5] and some new Font parsers [6]. Based on dev discussion, Tika's MIME identifier is becoming more prominently used in the Aperture project [7]. There was also some discussion regarding Tika's relationship with Apache OOo [8]. Community ========================= No new committers or PMC members elected in this quarter. Mailing list activity on dev@ remains steady (near the ~150 message range), while user@ is coming in at around ~50 or so messages per month. Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache Tika: One Point Oh!" commemorating the upcoming 1.0 Tika release. Chris Mattmann's CS572 Search Engines class students at USC are doing final projects related to Search technologies, and one of them is from Fernando Arreola [6] who is contributing new font parsers as mentioned above. Press ========================= Chris and Jukka and Sally are coordinating with Priscilla Vega from JPL to make a press release about Tika's use at NASA, as well as its use at Day Software and other companies. The goal is to have this coincide with the 1.0 release and OSCON. Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is now complete and handed off to production. All Chapters of the book and the appendices are now available on the MEAP page [9]. The book is set to be published in the Q2/Q3 timeframe of 2011. [1] https://issues.apache.org/jira/browse/TIKA-213 [2] https://issues.apache.org/jira/browse/TIKA-652 [3] https://issues.apache.org/jira/browse/TIKA-658 [4] https://issues.apache.org/jira/browse/TIKA-245 [5] https://issues.apache.org/jira/browse/TIKA-679 [6] http://s.apache.org/17Z [7] http://s.apache.org/Z4r [8] http://s.apache.org/3aC [9] http://www.manning.com/mattmann/
Releases ========================= We made our 0.9 release [1] in February 2011. This fix included some critical bug fixes including a fix that re-enabled extraction of metadata from Scientific Data File Formats (HDF+NetCDF) from the command line. In addition, the release included a patch [2] that significantly reduced the number of pulled in dependencies having to do with NetCDF. Support for parsing via external forking was also added [3]. See [4] for a full list of changes. Progress towards the 1.0 release continues and we hope to roll a 1.0 in the Q2/Q3 timeframe. Community ========================= We added Oleg Tikhonov to the Tika PMC in April 2011 [5]. Mailing list activity on user@ and dev@ remain steady (near the ~100 message range). Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is progressing steadily. We completed a full draft of the entire book, and Chapters 9 and 10 are now available on the MEAP page [6]. The book is set to be published in the Q2/Q3 timeframe of 2011. [1] http://s.apache.org/1lE [2] http://issues.apache.org/jira/browse/TIKA-596 [3] http://issues.apache.org/jira/browse/TIKA-556 [4] http://www.apache.org/dist/tika/CHANGES-0.9.txt [5] http://s.apache.org/Jur [6] http://www.manning.com/mattmann/
Releases ========================= We've made our 0.8 release [1] in November 2010. It's been a long time coming, and there were over 98 JIRA issues [2] addressed in the release. Work progresses towards a patch release (either 0.8.1 or 0.9) and Chris Mattmann plans to RM it and roll a release candidate hopefully in the next month. This release should fix a number of smaller issues found after folks have upgraded to 0.8. Community ========================= We added Maxim Valyanskiy to Tika PMC in November 2010 [3]. Chris Mattmann gave a talk on Tika titled Scientific Data Curation and Processing with Apache Tika [4] at ApacheCon NA in November 2010 during the Lucene and friends session. The morning talk was well attended and it was great to finally meet everyone in person! Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is progressing steadily. We completed our 2/3 book review and Chapters 1-8 of the book are now available [5] through Manning's Early Access Program or MEAP. [1] http://s.apache.org/W1Dh [2] http://s.apache.org/73R [3[ http://s.apache.org/dj0 [4] http://s.apache.org/2ak [5] http://www.manning.com/mattmann/
Releases ========================= Progress towards an 0.8 release continues. There has been some recent activity by Jukka Zitting [1] towards allowing for easily using Apache Tika in environments requiring Compressed RTF / TNEF / LZFU parsing. Along these lines a broader discussion [2] of how to use Apache Tika with parsing libraries that are GPL-licensed ensued and it was determined that plugins could be developed and hosted externally (e.g., at Github) and used with an official Apache Tika release from the ASF just by dropping the plugin jars onto a deployed version of Apache Tika's classpath. This provides a great solution for folks who want to use Apache Tika in environments where they need to parse formats which require non-ASLv2 friendly libraries. We've had some nice documentation updates occur since the last report as well. Arturo Beltran contributed a "get Tika parsing up and running in 5 minutes" quick start guide in TIKA-464 [3]. Nick Burch also added documentation on the Container-aware Tika detection capabilities in TIKA-477 [4]. Paul Jakubik started a great wiki discussion on container-based Metadata formats in [5], the first major use of the Tika wiki [6]. Chris Mattmann has since done a bit of reorganizing of information, adding the Tika logo to the wiki as well. Other various notable development items include an RSS Feed Parser (TIKA-466 [7]) from Julien Nioche and Chris Mattmann, Container-aware Mime Detection (TIKA-477 [4]) from Nick Burch, Jukka Zitting et al., and various Charset improvements (e.g., TIKA-471 [8] and TIKA-529 [9]) from Ken Krugler. Tika's website is now hooked up to SVNpubsub thanks to Jukka Zitting [10]. One of the blockers to the 0.8 release, getting NetCDF pushed to Maven Central, has seen some good progress as of late. Chris Mattmann has created a space for the NetCDF jars in the Sonatype OSS free hosting area that is synced to Maven Central [11]. Community ========================= There have been no new Tika PMC members or committers elected in this quarter. A podcast for Tika recorded in April 2010 between Chris Mattmann and Rich Bowen was posted on the Feathercast.org website and mentioned on the Tika mailing lists [12]. The Tika community was consulted for two website updates. The first site update was to post a link and image of the Tika in Action book [13]. The second site update was to allow for the selection of search engine provider used to search the Tika website [14]. Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is progressing steadily. We are currently doing our 2/3 book review and Chapters 1-5 of the book are now available [15] through Manning's Early Access Program or MEAP. [1] http://github.com/jukka/jtnef [2] http://s.apache.org/JGz [3] https://issues.apache.org/jira/browse/TIKA-464 [4] https://issues.apache.org/jira/browse/TIKA-477 [5] http://wiki.apache.org/tika/MetadataDiscussion [6] http://wiki.apache.org/tika/ [7] https://issues.apache.org/jira/browse/TIKA-466 [8] https://issues.apache.org/jira/browse/TIKA-471 [9] https://issues.apache.org/jira/browse/TIKA-529 [10] https://issues.apache.org/jira/browse/TIKA-473 [11] https://issues.apache.org/jira/browse/TIKA-407 [12] http://s.apache.org/Zmd [13] http://s.apache.org/XUK [14] http://s.apache.org/RgA [15] http://manning.com/mattmann/
Shane appreciates the detail in this report.
Releases ========================= There's been a bunch of development activity and mailing list activity in Tika on a broad range of issues: related to charset detection improvement [1] from Ken Krugler, to BoilerPlate extraction [2] (also from Ken), to improving HTML parsing overall [3] [4] [5] with contributions from Julien Nioche, Ken Krugler, Jukka Zitting, and our new committer Nick Burch (see Community section). There has also been a flurry of activity on new Parsers, improving parsing detection with Timeouts, Geospatial information representation, improving PDF parsing and a host of other issues that indicate the development community of Tika is thriving. We've slipped a bit on our original intention to release 0.8 in this month's timeframe. One thing that will need to get solved is the upload of some current external dependencies (e.g., NetCDF, Boilerpipe) to Maven Central. Chris Mattmann and Ken Krugler volunteered to take the lead on this. 0.8 isn't far off, but it's dependent on getting those jars up to Maven central. Community ========================= The Tika PMC elected to add Nick Burch [6] to the Tika PMC and committers group. Nick is an ASF member, and the VP of Apache POI, an important external library dependency of many of the Tika parsers. Welcome, Nick! Chris Mattmann is teaching CS572: Information Retrieval and Web Search Engines at USC this summer [7], and several of the final projects in his class are related to Tika. There is a plan to contribute the final code produced from some of the students in the form of JIRA issues and patches. Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is progressing steadily. We have completed our 1/3 book review and the book is now available [8] through Manning's Early Access Program or MEAP. IBM Developerworks Article on Tika ================================== Chris Mattmann and Oleg Tikhonov published a short IBM Developerworks intro article on Tika [9]. [1] http://issues.apache.org/jira/browse/TIKA-459 [2] http://issues.apache.org/jira/browse/TIKA-462 [3] http://issues.apache.org/jira/browse/TIKA-394 [4] http://issues.apache.org/jira/browse/TIKA-460 [5] http://issues.apache.org/jira/browse/TIKA-463 [6] MID: AANLkTinayUCu9pgv8p-ds3SFFMxjpCMLGkxhMTQiTFm7@mail.gmail.com [7] http://sunset.usc.edu/classes/cs572_2010/ [8] http://manning.com/mattmann/ [9] http://s.apache.org/FXa
This is the second report from Apache Tika in its new TLP status approved at the April 2010 board meeting. TLP migration ========================= All complete! Website has been updated, SVN taken care of, ML lists migrated and UNIX groups and domain taken care of. Thanks to Gavin for handling this. Releases ========================= Since our last report, we've committed some important bug fixes including TIKA-379 [1] which fixed and allowed HTML elements and attributes to be available in the parsed XHTML provided by Tika, and some mime type detection fixes in particular TIKA-417 [2]. We still think we are on target to release 0.8 within the next month or so. License Issue ========================= During the last board meeting, there was a question brought up regarding the licensing issue of UCAR/NCAR's NetCDF java library, which was used to implement TIKA-400 netCDF Tika Parser [3]. The question pertained to the question of an "advertising" clause in the UCAR/NCAR license, and the board asked Chris to follow up on it. Chris took the issue over to legal-discuss@ and a resolution to the matter was arrived upon [4] which included adding some text to NOTICE.txt and LICENSE.txt in Tika, fulfilling the advertising clause via the Apache NOTICE mechanism. This issue was tracked and fixed in Tika in TIKA-432 [5], and is now solved in the current 0.8 trunk, removing a roadblock to the 0.8 release. Community ========================= The Tika PMC elected to add Julien Nioche [6] to the Tika PMC and committers group. Julien is a Nutch committer, and has been providing quality patches to Tika (incl. the fix for TIKA-379) and good mailing list support. Jukka Zitting presented on Tika at the Lucene Eurocon [7] and Berlin Buzzwords [8] conferences. Tika In Action ========================== Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in Action", and it is progressing steadily. We are preparing for a 1/3 book review likely in the next week or so, and the book is about to be available in Manning's Early Access Program or MEAP, electronically, any day now. Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-379 [2] http://issues.apache.org/jira/browse/TIKA-417 [3] http://issues.apache.org/jira/browse/TIKA-400 [4]http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201005.mbox/%3CAANLkTinXO4AZkYDm0L83SipPXl48kYC8enFRhamJCBvJ@mail.gmail.com%3E [5] http://issues.apache.org/jira/browse/TIKA-432 [6]http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3CC83020F3.14F6B%25Chris.A.Mattmann@jpl.nasa.gov%3E [7] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika [8] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630
This is the first report from Apache Tika in its new TLP status approved at the April 2010 board meeting. TLP migration ========================= Progress is being made in transitioning Tika off of the Lucene site and infrastructure and into its new TLP home. The current status is: (Gavin from the INFRA team grouped the below into a TLP migration task [1]) 1. mailing lists migration, filed INFRA-2645 [2], Gavin working on it as of last Saturday 2. SVN, filed INFRA-2646 [3], status is same as for #1 3. UNIX groups, and home on www.apache.org/dist [4], status same as #1 4. Domain name for tika.apache.org [5], status same as #1 Releases ========================= Steady progress on the 0.8 Tika release is being made, with contributions made to add more file formats (netCDF is a recent addition), to allow Tika to be used in server environments with their own classloaders (like Apache SOLR), and to improve image metadata extraction. We hope to release 0.8 within the next month or so. Community ========================= The Tika community reached out to the NetCDF community in order to get their NetCDF jars released to Maven central (currently TIKA-400 [6] relies on NetCDF jars from an external Maven2 repository that's not synced with Central). Jukka Zitting pointed out that this isn't a best practice, so we are working with the NetCDF community to resolve this, and they have been receptive [7], in particular John Caron from UCAR/NCAR is working to help us out. [1] https://issues.apache.org/jira/browse/INFRA-2692 [2] https://issues.apache.org/jira/browse/INFRA-2645 [3] https://issues.apache.org/jira/browse/INFRA-2646 [4] https://issues.apache.org/jira/browse/INFRA-2647 [5] https://issues.apache.org/jira/browse/INFRA-2676 [6] https://issues.apache.org/jira/browse/TIKA-400 [7] http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/201004.mbox/%3C4BD758B7.60202@unidata.ucar.edu%3E
The thredds/cdm dependency appears to have an "advertising" clause. To be followed up on legal-discuss.
WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to content detection and analysis for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the "Apache Tika Project", be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Tika Project be and hereby is responsible for the creation and maintenance of software related to a content analysis and detection toolkit; and be it further RESOLVED, that the office of "Vice President, Apache Tika" be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Tika Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Tika Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Tika Project: * Chris A. Mattmann (mattmann@apache.org) * Jukka Zitting (jukka@apache.org) * Ken Krugler (kkrugler@apache.org) * Keith Bennett (kbennett@apache.org) * Mark Harwood (mharwood@apache.org) * Dave Meikle (dmeikle@apache.org) * Sami Siren (siren@apache.org) * Rida Benjelloun (ridabenjelloun@apache.org) NOW, THEREFORE, BE IT FURTHER RESOLVED, that Chris A. Mattmann be appointed to the office of Vice President, Apache Tika, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed; and be it further RESOLVED, that the Apache Tika Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Tika sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Tika sub-project encumbered upon the Apache Lucene Project are hereafter discharged Special Order 7C, Establish the Apache Tika Project, was approved by Unanimous Vote of the directors present.
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika entered incubation on March 22nd, 2007. Community * Dave Meikle was just voted in as a new committer. * Paolo Mottadelli will present Tika at ApacheCon US. Development * Tika 0.2 should be released soon. * Usage documentation has been added to the website. Issues before graduation: * The current plan is to graduate as a Lucene subproject, which could happen soon as the incubation criteria seem to be met.
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika entered incubation on March 22nd, 2007. Community * Tika community remains relatively small, with just a handful of active members Development * Work towards Tika 0.2 continues, Chris Mattman has volunteered to be the release manager Issues before graduation: * Increase the size and diversity of the community (or graduate into a Lucene subproject?)
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika entered incubation on March 22nd, 2007. Community * Niall Pemberton joined the project as a committer and PPMC member * The number of issues reported by external contributors is growing gradually * There was a Fast Feather Talk on Tika in ApacheCon EU 2008 * We have good contacts especially with Apache POI and PDFBox Development * We are working towards Tika 0.2 * Metadata handling improvements are being discussed Issues before graduation: * Increase the size of the community
Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser Libraries. Tika entered incubation on March 22nd, 2007. Community There have been a number of positive items within Tika during the last few months. The traffic on the Tika mailing list has increased significantly (with typically 2, 3 questions, and 1 or 2 commits every day, or every other day), and there have been a lot of recent inquiries from external projects wanting to collaborate with Tika (including Aperture, PDFBox and a fellow developing a JSon library currently hosted at Google code). In addition, Tika's architecture has become a recent discussion of interest (as we'll see below). We recently elected Keith Bennett as a new committer to Tika. Keith has been spearheading many of the new patches committed to Tika, as well as participating in discussions about the architecture, and future direction of the project. Tika will be represented at the "Fast Feather" track at Apache Con US by Jukka Zitting. The rest of the community is helping to create the content for the presentation. The abstract is listed below: Tika is a new content analysis framework borne from the desire to factor our commonality from the Apache Nutch search engine framework. Tika provides a mime detection framework, an extensible parsing framework and metadata environment for content analysis. Though in its nascent stages, progress on Tika has recently taken shape and the project is nearing a stable 0.1 release. In this talk, we'll describe the core APIs of Tika and discuss its use in several distinct domains including search engines, scientific data dissemination and an industrial setting. Development There have been a flurry of JIRA issues and code activity [1] including 47 issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2 open major/minor issues in progress). Tika's Parser interface (one of its key components) has just undergone a major overhaul led by Jukka Zitting, and Chris Mattmann has recently contributed a MimeType system (with help from fellow Apache Nutch committer Jerome Charron) to Tika. We also cleaned up and refactored large parts of the rest of the code (removing references to LuisLite and branding the project wherever possible with the Tika name), in preparation for an upcoming 0.1 release. Chris Mattmann has led an effort to carve out the existing MimeType detection system in Apache Nutch [2] and replace it with Tika's improved MimeType detection system. There is a patch sitting in JIRA right now [3], and barring objections, Nutch will rely on Tika for its MimeType detection abilities. Also active recently were committers Bertrand Delacretaz, Sami Siren and Rida Benjelloun, committing patches and improvements wherever needed. Issues before graduation No changes since our last report: the Tika project is still at an early stage of incubation. We need to continue bringing in the initial codebases and are targeting an initial incubating release (0.1) probably within the next month. We also need to work on growing the community and figuring out how to best interact with external parser projects. 1. http://issues.apache.org/jira/browse/TIKA 2. http://lucene.apache.org/nutch/ 3. http://issues.apache.org/jira/browse/NUTCH-562
Tika is a toolkit for detecting and extracting metadata and structured text content from various document formats using existing parser libraries. Tika entered incubation on March 22nd, 2007. Community: * The Tika mailing list has seen increased activity in the last weeks, with some new people showing interest for Tika's goals. * Grant Ingersoll brought the Aperture framework to our attention (http://aperture.sourceforge.net/), which has similar goals to Tika. We will look at possible synergies. Development: * No code has been committed since our last report, but some initial code is ready in JIRA and should be committed soon. Issues before graduation: * No changes since our last report: the Tika project is still at an early stage of incubation. We need to continue bringing in the initial codebases and probably target for an initial incubating release later this year. We also need to work on growing the community and figuring out how to best interact with external parser projects.
Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika entered incubation on March 22nd, 2007. Community The Tika mailing lists have been relatively quiet lately, probably because with little code we don't yet have many concrete issues to talk about. Development We saw the first piece of Tika code when Chris A. Mattmann ported the Nutch metadata framework to Tika. Rida Benjelloun has created a version of the Lius codebase to be included in Tika, and the code is currently in the issue tracker. Issues before graduation The Tika project is still at an early stage of incubation. We need to continue bringing in the initial codebases and probably target for an initial incubating release later this year. We also need to work on growing the community and figuring out how to best interact with external parser projects.
Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Incubating since: March 22nd, 2007. __Community__ We had a good project bootstrap meeting as a part of the text analysis BOF at the ApacheCon EU in Amsterdam. The resulting ideas were summarized on the project mailing list, and the first design threads have started. __Development__ We've started discussing the design of the Tika toolkit. It seems like we will select one of the existing codebases listed in the project proposal as the basis of an early 0.1 release, and start refactoring the code into a more generic toolkit. The Tika svn tree is still empty, but I expect us to see the first code commits before the next report. __Infrastructure__ All the initial infrastructure is now in place. There is still some activity on the temporary Tika wiki on the Google Project hosting service, so we may end up requesting a Tika wiki to be set up on the ASF infrastructure. __Issues before graduation__ The Tika project is still at an early stage of incubation. The most important tasks before graduation are to develop and release the Tika codebase and to grow a diverse and sustainable project community.
iPMC Reviewers: rdonkin Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika entered incubation on March 22nd, 2007. The Tika project has just started. The basic infrastructure (mailing lists, subversion, issue tracker, web site) is mostly in place; the only thing still missing is one committer account. We expect to get started with the actual design and code work during the next few weeks.