Skip to Main Content
The Apache Software Foundation
Apache 20th Anniversary Logo

This was extracted (@ 2024-11-20 22:10) from a list of minutes which have been approved by the Board.
Please Note The Board typically approves the minutes of the previous meeting at the beginning of every Board meeting; therefore, the list below does not normally contain details from the minutes of the most recent Board meeting.

WARNING: these pages may omit some original contents of the minutes.
This is due to changes in the layout of the source minutes over the years. Fixes are being worked on.

Meeting times vary, the exact schedule is available to ASF Members and Officers, search for "calendar" in the Foundation's private index page (svn:foundation/private-index.html).

Tika

16 Oct 2024 [Tim Allison / Jeff]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status: Ongoing
Issues for the board:none
## Membership Data:
Apache Tika was founded 2010-04-20 (14 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
Released 3.0.0-BETA2 in July. Working towards a 3.0.0 release.

## Community Health:
Statistics reporter is down as of this writing (Oct 9 2024). General sense is
that community health is the same as last quarter.

17 Jul 2024 [Tim Allison / Justin]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status: Ongoing
Issues for the board: None

## Membership Data:
Apache Tika was founded 2010-04-20 (14 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
Our last 2.x release was at the beginning of April, and we're in the process
of releasing 3.0.0-BETA2. We've dramatically improved configurability in the
tika-pipes modules, and we added a GRPC server. We've made numerous other
improvements throughout the project. We've also managed to keep up with
@dependabot. :)

## Community Health:
CHI is at 4.70. The project stats are not available as I write this report.
There has been some slowdown in activity because of $DAYJOBs, but we've seen
some great activity in the GRPC server and towards increasing configurability
in tika-pipes.

17 Apr 2024 [Tim Allison / Craig]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status:Ongoing
Issues for the board: None

## Membership Data:
Apache Tika was founded 2010-04-20 (14 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
Released a 2.9.2 on April 2. This included upgrades to dependencies and a few
bug fixes. The project is working towards a 3.0.0-BETA2, and hopefully a 3.0.0
shortly thereafter. We're making improvements to our docker deployments and
helm charts. We're working on integrating fully recursive extraction of raw
bytes and text+metadata for embedded files into our pipes modules. We're
making progress towards adding a gRPC server.

## Community Health:
Chi is still a healthy 4.7. We're continuing to be on the lookout for new
PMC/committers.

17 Jan 2024 [Tim Allison / Sander]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status: Ongoing
Issues for the board: None

## Membership Data:
Apache Tika was founded 2010-04-20 (14 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released 3.0.0-BETA in mid December. We're aiming for a 3.0.0 release in
the next few weeks. The big difference between the 2.x and 3.x branch is that
the 3.x branch will require Java 11. We plan to maintain the 2.x branch for 6
months after we release 3.0.0. We've seen a decrease in CVEs over the last
quarter.

## Community Health:
Community health is still a robust 4.7. We saw a decline in traffic on dev@
likely due to the end of year/holiday season.

18 Oct 2023 [Tim Allison / Sharan]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status: Ongoing
Issues for the board: None

## Membership Data:
Apache Tika was founded 2010-04-20 (13 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released 2.9.0 on 28 August. We're working towards 3.0.0-BETA, which will
require Java 11 and transition from javax to jakarta.  We anticipate starting
that release process in mid to late October. We continue to improve file type
detection, fix small bugs and update dependencies.

We're discussing running a 2.9.1 release soon to benefit from
commons-compress's recent fix of CVE-2023-42503.

## Community Health:
Our community health score is a Healthy 4.70. We've seen no significant
changes in in email traffic, commits or JIRA issues since last quarter.

@Christofer: follow up about ghost vote

19 Jul 2023 [Tim Allison / Christofer]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Project Status:
Current project status: Ongoing
Issues for the board: None

## Membership Data:
Apache Tika was founded 2010-04-20 (13 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released 2.8.0 on 15 May, and we've started the preliminary regression
tests in preparation for the next release.  We've had some great contributions
from a new user of Tika for file type detection via a collaboration using
Common Crawl data.

We updated our regression corpus to remove most of the truncated PDFs from
Common Crawl and to add ~100k new PDFs from
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

## Community Health:
We've seen increases in issues opened and closed (largely driven by the new
mime patterns). Our CHI is 4.70 (Healthy).

19 Apr 2023 [Tim Allison / Sharan]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (13 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
Our last release (2.7.0) was on 3 Feb 2023, and we're working towards the
release of 2.8.0 in the next few weeks. We're expanding our file type
detection based on data from Common Crawl.  We continue to make bug fixes and
improvements throughout the code base.

## Community Health:
We've seen a decrease in opened issues, commits and dev list emails.  The new
requirement to register for a JIRA account may be suppressing opened issues,
but we have no evidence of that. Our CHI is 4.70 (Healthy).

18 Jan 2023 [Tim Allison / Roy]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (13 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We had a minor release of 2.6.0 in November. We are starting to discuss goals
for a 3.x release, although we have no planned release date.

## Community Health:
The project's Chi is unchanged from last quarter -- 4.7, healthy. We've seen
decreases in email, JIRA and commits.  We suspect this is because of the
stabilization and adoption of the 2.x branch. There's a trivial increase in
GitHub PRs likely driven by dependabot.

19 Oct 2022 [Tim Allison / Willem]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (12 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
Our 1.x branch reached EoL on 30 September 2022.  We released the last 1.x
version, 1.28.5, on 14 September.  We also released the next minor revision of
our 2.x. branch, 2.5.0, on 3 October.

## Community Health:
No major changes.  The project's Chi is 4.7, still healthy. There's an
increase in GitHub PRs driven by dependabot and our addition of some
dependencies that have daily/weekly releases.

20 Jul 2022 [Tim Allison / Bertrand]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list of over 1200 different file types including most of the major
types in existence (MS Office, PDF, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (12 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released 2.4.0 on May 2 and 2.4.1 on June 17. We released two
security-related fixes to our 1.x branch in May and one in June.  The new
functionality in our 2.x branch has included dramatically improving
customization entry-points and adding a generalized rendering interface.

## Community Health:
We saw an increase in JIRA activity and commits and a modest decrease in PRs
opened and closed and activity on our mailing lists. Our CHI is 5.5 (Healthy).

20 Apr 2022 [Tim Allison / Sander]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (12 years ago)
There are currently 32 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
On February 10th, we announced that our 1.x branch is in security-only
maintenance until a final end-of-life on 30 September 2022. We made a 2.3.0
release and a 1.28.1 release in February, and we're on the cusp of new
releases for both branches.  These releases include security related fixes in
our code base and in our dependencies.

We continue to improve our documentation for the 2.x branch, and we've helped
several people with questions on the breaking changes in the new branch.

We had a painful antisemitic/Nazi Google-Meet bomb during our Meetup in
January, and we've taken steps to limit membership and access to our Meetup
account.

## Community Health:
We haven't seen any significant changes in community health.  Since we added
dependabot, we've seen a significant increase in PRs, but otherwise we have
slight decreases in issues, commits and mail traffic. We take it as a good
sign that traffic has decreased as people are migrating to our 2.x branch,
apparently without too many surprises. Our Community Health Score (Chi) is
6.33 (Healthy).

19 Jan 2022 [Tim Allison / Justin]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (12 years ago)
There are currently 33 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is roughly 9:8.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released two versions of our main branch 2.2.0 and 2.2.1 to upgrade
dependencies (partly in response to CVEs in log4j2 and jdom2) and to fix some
regressions, the first release was on December 16 and the second was on
December 23.  We made a breaking change in our 1.x branch to upgrade from
log4j 1.x to Log4j2, and we released Tika 1.28 on December 23.

We started a virtual Meetup group for Tika, and we've held two hands-on
tutorials (one on tika-eval, and one on the tika-pipes module).  We have
another meetup planned for January 2022, and we look forward to holding these
every month or so.

## Community Health:
We haven't seen any significant changes in community health.  We have seen an
increase in PRs and commit activity and a slight decrease in email traffic and
JIRA issues.  Our Community Health Score (Chi) is 6.33 (Healthy).

20 Oct 2021 [Tim Allison / Roman]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
On 23 September, ASF's V.P., Data Privacy, requested via private@ that our
project shutdown our public regression corpus (largely files pulled out of
Common Crawl).  We asked to carry on that discussion publicly and received no
response.

The regression corpus is an unparalleled resource for parser developers on
Tika, POI, PDFBox and others. As evidenced by its use in processing files in
both the Panama Papers and, recently, in the Pandora papers (see below),
Apache Tika and its dependencies need to be tested on large corpora of
naturally occurring files.  The Chief Technology Officer of the PDF
Association has written two posts on the critical need and transformative
power of our bug tracker corpus
(https://www.pdfa.org/a-new-stressful-pdf-corpus/ and
 https://www.pdfa.org/stressful-pdf-corpus-grows/).

We need to find a solution that will enable this resource to continue, perhaps
with a strict robots.txt file and password protection.  We look forward to
working with ASF's Data Privacy V.P. to find a solution.

## Membership Data:
Apache Tika was founded 2010-04-20 (11 years ago)
There are currently 33 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is roughly 9:8.

Community changes, past quarter:
- No new PMC members. Last addition was Nicholas DiPiazza on 2021-07-05.
- No new committers. Last addition was Nicholas DiPiazza on 2021-06-03.

## Project Activity:
We released 2.0.0 on 2021-07-19 and 2.1.0 on 2021-08-23.  We've gotten useful
feedback on our 2.x branch and we're continuing to improve that.  We're also
working to improve the documentation to help users migrate to the 2.x branch.
We're grateful to see some major projects making the migration, including
Datafari/FranceLabs
(https://twitter.com/francelabs/status/1447470094783819778).

Apache Tika appeared in the news this quarter as being a critical component of
the International Consortium of Investigative Journalists' (ICIJ) platform
(https://github.com/ICIJ/datashare) used to analyze the Pandora Papers
(https://www.wired.co.uk/article/pandora-papers-leak).  Previously, the ICIJ
 reported using Tika to process the Panama Papers
 (https://source.opennews.org/articles/people-and-tech-behind-panama-papers/).

## Community Health:
The CHI score went up from 8.37 in the last quarter to 9.6.  We've seen modest
decreases in JIRA activity, email lists and PRs.  We're not sure of the
underlying causes.

21 Jul 2021 [Tim Allison / Sharan]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (11 years ago)
There are currently 33 committers and 32 PMC members in this project.
The Committer-to-PMC ratio is roughly 9:8.

Community changes, past quarter:
- Nicholas DiPiazza was added as a PMC member on 2021-07-06
- Nicholas DiPiazza was added as committer on 2021-06-03

## Project Activity:
We released 2.0.0-BETA on 2021-05-25.  This includes the new pipes module
which improves robustness, scalability and ease of integration at scale. We
look forward to release a stable 2.0.0 in the next quarter.

In Tika 2.x, we're also ending the reliance on two custom forks we had to
create for security and backwards compatibility reasons. That we're once
more relying on external, maintained dependencies for these capabilities
is an important step forward.

We released 1.27 on 2021-07-06.  This includes numerous bug fixes and
dependency updates.

## Community Health:
The project continues to have strong community health (8.37 Chi score).
Activity on mailing lists, commits and pull requests was slightly down over
the last quarter.  However, the number of JIRA issues opened and closed both
increased.

21 Apr 2021 [Tim Allison / Sharan]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (11 years ago)
There are currently 32 committers and 31 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Peter Lee on 2020-11-24.
- No new committers. Last addition was Peter Lee on 2020-11-25.

## Project Activity:
We released 2.0.0-ALPHA on 2021-01-16 and a stable release, 1.26,
on 2021-03-29. The ALPHA release is an exciting step towards a BETA
or stable 2.x release in the next month or so.

We recently added several languages to our language detector and
made improvements in our mock parser, which allows users to harden
their pipelines against parser failures.  We're nearing completion on
a new pipes module that will allow for easier integration with datastores
(e.g. S3) and search engines, such as Apache Solr.


## Community Health:
Code contributors have increased in the last quarter, and we've seen an
impressive increase in email traffic. Our Community Health Score (Chi)
is "Super Healthy". We'll continue to be on the lookout for potential
new committers/PMC members.

20 Jan 2021 [Tim Allison / Bertrand]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (11 years ago)
There are currently 32 committers and 31 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- Peter Lee was added to the PMC on 2020-11-24
- Peter Lee was added as committer on 2020-11-25

## Project Activity:
We released 1.25 on 2020-11-40.  This version included numerous dependency
upgrades, a critical license issue with Adobe's xmpcore, and several new
parsers. We are on the cusp of a release of Tika 2.0.0-ALPHA.

On our file corpus development side project, we gathered "stressful"
attachments from 35 parser issue trackers. This includes more than a million
files (551GB).  These are critical for stress testing our own parsers, and
we're making the corpus available to other open source and commercial
projects: https://corpora.tika.apache.org/base/docs/bug_trackers/. See, for
example: https://www.pdfa.org/a-new-stressful-pdf-corpus/ and
https://www.pdfa.org/stressful-pdf-corpus-grows/

## Community Health:
As noted above, we've added Peter Lee as a committer/PMC. A number of our JIRA
and GitHub health metrics were down slightly in the last quarter.  We attribute
this to the holidays/new year. However, we saw an uptick in user@ traffic and
a slight increase in commits.

21 Oct 2020 [Tim Allison / Patricia]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (10 years ago)
There are currently 31 committers and 30 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Tilman Hausherr on 2019-10-02.
- No new committers. Last addition was Tilman Hausherr on 2019-10-03.

## Project Activity:
We're on the cusp of releasing 1.25.  We have a blocker with an accidentally
included, ASL-incompatible license in a dependency from Adobe
(https://issues.apache.org/jira/browse/TIKA-3204).

We've made good progress in a major refactoring for Tika 2.0.0 that was based
on a significant amount of earlier work by committer/PMC Bob Paulin.  This
refactoring will allow for cleaner dependency management and more modularized
parsers.  Once we release 1.25, we should be ready to start releasing
2.0.0-ALPHA.

We completed the migration for our primary branch from 'master' to 'main.'

## Community Health:
We've seen an uptick in issues and in PRs.  Much of this increase in activity
is driven by a commons committer who has taken an interest in improving our
codebase.  We are on the lookout to expand our committership/PMC.

15 Jul 2020 [Tim Allison / Craig]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Tika was founded 2010-04-20 (10 years ago)
There are currently 31 committers and 30 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Tilman Hausherr on 2019-10-02.
- No new committers. Last addition was Tilman Hausherr on 2019-10-03.

## Project Activity:
We released 1.24.1 on April 21.  This release included numerous security fixes
(CVE-2020-9489) which we identified through a new fuzzing module.

We've moved our regression testing server and corpus from Rackspace to a new
server kindly hosted by a committer on PDFBox.  We started a new mailing list
(corpora-dev@tika.apache.org) for this resource to enable cross-project
discussion (POI, PDFBox, Tika and Commons Compress) and to encourage
contributions and input from a wider audience.

We've removed whitelist/blacklist terminology from the project, and we are in
the process of migrating from 'master' branch to 'main'.

## Community Health:
Our Community Health Score of 6.33 suggests we are doing well. We've seen a
slight increase in traffic on our dev list.  Commits, issues and traffic on
the dev list have decreased slightly, but nothing worthy of board attention.

15 Apr 2020 [Tim Allison / Niclas]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.


## Issues:
There are no issues requiring board attention at this time.

## Membership Data:
Apache Tika was founded 2010-04-20 (10 years ago)
There are currently 31 committers and 30 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Tilman Hausherr on 2019-10-02.
- No new committers. Last addition was Tilman Hausherr on 2019-10-03.

## Project Activity:
1.24 was released on 2020-03-17. We adding a fuzzing module to identify denial
of service (DoS) vulnerabilities, and we're currently preparing a 1.24.1 release
that fixes several DoS vulnerabilities, primarily in our dependencies. We've
had mixed success in getting some of our (ASF-licensed but non-ASF) dependencies
to fix their code in a timely manner, and we've had to fork some dependencies
and release them separately.  We continue to work with with these projects to
improve security.

## Community Health:
We've seen decreases in email and issue traffic in the past quarter,
but nothing alarming.

15 Jan 2020 [Tim Allison / Shane]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention at this time.

## Membership Data:
Apache Tika was founded 2010-04-20 (10 years ago)
There are currently 32 committers and 31 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Tilman Hausherr on 2019-10-02.
- No new committers. Last addition was Tilman Hausherr on 2019-10-03.

## Project Activity:
1.23 was released on 2019-12-06. This included improved file type
detection and a new parser for XLIFF files.

Nicholas DiPiazza recently contributed a parser for OneNote files, which
will be available in the next release.

Over the last two or three quarters, we've seen an increase in reports
of vulnerabilities in our dependencies.  We upgrade when we can, but
there are some upstream dependencies out of our control that have required
some non-trivial solutions, including forking and fixing (as a last resort).

## Community Health:
We've seen an increase in email, commits and and other activity
compared with last quarter, but overall, no significant changes.

16 Oct 2019 [Tim Allison / Dave]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

## Issues:
There are no issues requiring board attention at this time.

## Membership Data:
Apache Tika was founded 2010-04-20 (9 years ago) There are currently 32
committers and 31 PMC members in this project. The Committer-to-PMC ratio is
roughly 1:1.

Community changes, past quarter:
- Tilman Hausherr was added to the PMC on 2019-10-02
- Tilman Hausherr was added as committer on 2019-10-02

## Project Activity:
1.22 was released on 2019-08-01

SooMyung Lee (soomyung) and JinSup Kim (ddoleye) contributed a parser for HWP
v5 files.

We added significant improvements in language coverage for the tika-eval
module by collaborating with OpenNLP: we now detect 121 (vs. 75) languages and
have common words lists for 121 (vs. 21) languages. This means we can now
identify potential problems/regressions in content extraction for 121
languages.

## Community Health:
No significant changes in community health. There were decreases in @dev email
traffic and in closed tickets, but we saw a slight increase in contributors
and opened PRs. The team gave two conference presentations on Tika --
ApacheCon NA and Activate -- and we have an upcoming talk at ApacheCon EU.

17 Jul 2019 [Tim Allison / Shane]

## Description:
 - Apache Tika is a dynamic toolkit for content detection, analysis, and
   extraction. It allows a user to understand, and leverage information from,
   a growing a list over 1200 different file types including most of the major
   types in existence (MS Office, Adobe, Text, Images, Video, Code, and
   science data) as recognised by IANA and other standards bodies.


## Issues:
 - There are no issues requiring board attention at this time.

## Activity:
 - The project had one release in the last quarter.
 - We're seeing increased reports of vulnerabilities in our dependencies on
   our JIRA. We added ossindex-maven-plugin a while ago and update as
   indicated, but a non-trivial amount of communications/bugs reported are now
   focused on this topic.
 - We plan to start the release process in the next few weeks for the next
   version.

## Health report:
 - Nothing noteworthy in email, commits, etc.

## PMC changes:

 - Currently 30 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018

## Releases:

 - 1.21 was released on Sun May 19 2019

## Mailing list activity:

 - Regular fluctuations; nothing significant to report.

 - dev@tika.apache.org:
    - 190 subscribers (down -1 in the last 3 months):
    - 716 emails sent to list (358 in previous quarter)

 - user@tika.apache.org:
    - 363 subscribers (down -3 in the last 3 months):
    - 70 emails sent to list (54 in previous quarter)


## JIRA activity:

 - 53 JIRA tickets created in the last 3 months
 - 45 JIRA tickets closed/resolved in the last 3 months

17 Apr 2019 [Tim Allison / Joan]

## Description:
 - Apache Tika is a dynamic toolkit for content detection, analysis, and
   extraction. It allows a user to understand, and leverage information from,
   a growing a list over 1200 different file types including most of the major
   types in existence (MS Office, Adobe, Text, Images, Video, Code, and
   science data) as recognised by IANA and other standards bodies.

## Issues:
 There are no issues requiring board attention at this time.

## Activity:
 - We've been making improvements to our PDFParser, and we added a CSV
   detector/parser.  Other than that, it has been a quiet quarter.  We look
   forward to our next release as soon as the next versions of PDFBox and POI
   are available.

## Health report:
 - Health is in good shape. No significant changes in health.

## PMC changes:

 - Currently 30 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018

## Releases:

 - Last release was 1.20 on Fri Dec 21 2018

## Mailing list activity:

 - We've seen a drop-off in emails to the dev list compared with last quarter.
   This may reflect a lower commit rate, but we do not know what is driving
   this.

 - dev@tika.apache.org:
    - 192 subscribers (down -3 in the last 3 months):
    - 363 emails sent to list (767 in previous quarter)

 - user@tika.apache.org:
    - 366 subscribers (up 1 in the last 3 months):
    - 54 emails sent to list (54 in previous quarter)


## JIRA activity:

 - 44 JIRA tickets created in the last 3 months
 - 23 JIRA tickets closed/resolved in the last 3 months

16 Jan 2019 [Tim Allison / Phil]

## Description:

 - Apache Tika is a dynamic toolkit for content detection, analysis, and
   extraction. It allows a user to understand, and leverage information from,
   a growing a list over 1200 different file types including most of the major
   types in existence (MS Office, Adobe, Text, Images, Video, Code, and
   science data) as recognised by IANA and other standards bodies.

## Issues:

 - There are no issues requiring board attention at this time.

## Activity:

 - The project had two releases in the last quarter, one a bug fix; and one
   with more substantial changes.

 - The project refreshed its 1TB regression corpus (hosted by Rackspace) from
   Common Crawl data, with heavy oversampling in binary file formats and a
   better diversity of character encodings and languages. This revealed
   several areas for improvements in Tika and its dependencies.

 - Work continues on the new 2.x based master.

## Health report:

 - Health is in decent shape. No significant changes in health.

## PMC changes:

 - Currently 30 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018

## Releases:

 - 1.19.1 was released on Mon Oct 08 2018
 - 1.20 was released on Fri Dec 21 2018

## JIRA activity:

 - 60 JIRA tickets created in the last 3 months
 - 39 JIRA tickets closed/resolved in the last 3 months

17 Oct 2018 [Tim Allison / Phil]

## Description:

 - Apache Tika is a dynamic toolkit for content detection, analysis, and
   extraction. It allows a user to understand, and leverage information from,
   a growing a list over 1200 different file types including most of the major
   types in existence (MS Office, Adobe, Text, Images, Video, Code, and
   science data) as recognised by IANA and other standards bodies.

## Issues:

 - There are no issues requiring board attention at this time.

## Activity:

 - The project had two releases in the last quarter.  In addition to numerous
   bug fixes, we've focused on improving robustness via tika-server and some
   new options in our ForkParser.

 - In the two recent releases, we've fixed several CVEs in our own code or by
   notifying and then upgrading dependencies. Two security researchers, Tobias
   Ospelt and Rohan Padhye, have reported numerous issues identified via
   fuzzing. We've granted Tobias access to our regression vm to continue his
   work on our 1TB of regression files.

 - We added a security page to our site to record vulnerabilities:
   http://tika.apache.org/security.html

 - Work continues on the new 2.x based master.

## PMC changes:

 - Currently 30 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Thejan Wijesinghe on Tue Apr 17 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Thejan Wijesinghe at Wed Apr 18 2018

## Releases:

 - 1.19.1 was released on Tue Oct 9 2018
 - 1.19 was released on Mon Sep 17 2018

## Mailing list activity:

 - dev@tika.apache.org:
    - 198 subscribers (up 3 in the last 3 months):
    - 703 emails sent to list (588 in previous quarter)

 - user@tika.apache.org:
    - 361 subscribers (up 4 in the last 3 months):
    - 55 emails sent to list (49 in previous quarter)


## JIRA activity:

 - 64 JIRA tickets created in the last 3 months
 - 45 JIRA tickets closed/resolved in the last 3 months

18 Jul 2018 [Tim Allison / Mark]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

## Issues:
 - There are no issues requiring board attention at this time.

## Activity:
 - The project is currently finalizing the contents of a new 1.19 release with
   updates to the Mimetype detection and deep learning modules as well as
   several other important upgrades.

 - Work continues on the new 2.x based master.

## PMC changes:

 - Currently 30 PMC members.
 - Thejan Wijesinghe was added to the PMC on Tue Apr 17 2018

## Committer base changes:

 - Currently 31 committers.
 - Thejan Wijesinghe was added as a committer on Wed Apr 18 2018

## Releases:

 - 1.18 was released on Mon Apr 23 2018

## Mailing list activity:

 - dev@tika.apache.org:
    - 194 subscribers (down -1 in the last 3 months):
    - 639 emails sent to list (762 in previous quarter)

 - user@tika.apache.org:
    - 356 subscribers (up 0 in the last 3 months):
    - 50 emails sent to list (59 in previous quarter)


## JIRA activity:

 - 58 JIRA tickets created in the last 3 months
 - 33 JIRA tickets closed/resolved in the last 3 months

20 Jun 2018

Change the Apache Tika Project Chair

 WHEREAS, the Board of Directors heretofore appointed David Meikle
 (dmeikle) to the office of Vice President, Apache Tika, and

 WHEREAS, the Board of Directors is in receipt of the resignation of
 David Meikle from the office of Vice President, Apache Tika, and

 WHEREAS, the Project Management Committee of the Apache Tika project
 has chosen by vote to recommend Tim Allison (tallison) as the successor
 to the post;

 NOW, THEREFORE, BE IT RESOLVED, that David Meikle is relieved and
 discharged from the duties and responsibilities of the office of Vice
 President, Apache Tika, and

 BE IT FURTHER RESOLVED, that Tim Allison be and hereby is appointed to
 the office of Vice President, Apache Tika, to serve in accordance with
 and subject to the direction of the Board of Directors and the Bylaws
 of the Foundation until death, resignation, retirement, removal or
 disqualification, or until a successor is appointed.

 Special Order 7F, Change the Apache Tika Project Chair, was
 approved by Unanimous Vote of the directors present.

18 Apr 2018 [David Meikle / Brett]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

## Issues:
- There are no issues that require the board's attention at this time.

## Activity:

- The project is currently finalising the contents of a new 1.18 release with
  updated to the Mimetype detection, a new XPS parser, improvements to the OCR
  Parser and updated base libraries.

- Work continues on the new 2.x based master with a significant addition being
  the new composite parser strategy work (TIKA-1509 [0]) added by Nick Burch

## PMC changes:

 - Currently 29 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Madhav Sharan on Thu Aug 31 2017
 - A vote has just finished to add a new PMC member

## Committer base changes:

 - Currently 30 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Madhav Sharan at Thu Aug 31 2017

 ## Releases:

  - 1.17 was released on Wed Dec 13 2017

 ## Mailing list activity:

 - dev@tika.apache.org:
    - 195 subscribers (down -1 in the last 3 months):
    - 776 emails sent to list (575 in previous quarter)

 - user@tika.apache.org:
    - 357 subscribers (up 7 in the last 3 months):
    - 65 emails sent to list (66 in previous quarter)

 ## JIRA activity:

  - 71 JIRA tickets created in the last 3 months
  - 47 JIRA tickets closed/resolved in the last 3 months

[0] https://issues.apache.org/jira/projects/TIKA/issues/TIKA-1509

17 Jan 2018 [David Meikle / Phil]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

## Issues:
- There are no issues that require the board's attention at this time.

## Board Questions:

  - mt: Why was the "2.0.6 release" thread on private@ ? It looks as if it
    could/should have been on dev@
    - It started off on dev@ but it looks like Tim was trying to ask can it be
      pushed ahead of ApacheCon and 'moved' it on private@
    - We will avoid this in the future.

  - bd: Note that the names of people who haven been invited to join the PMC
    but haven't replied yet shouldn't be included in reports, in case they
    decline.
    - Apologies, will note this for the future.

## Activity:

- Apache Tika 1.17 was released in December with key updates including:
  - Automatic image captioning
  - Phonetic run handling in Excel and Word
  - Many bug fixes and improvements
- We've now changed our repository with master now focusing on the 2.x series
- Discussion has now started on some of the key changes previously discussed
  in the Tika 2.0 roadmap[1]
- Sergey Beryozkin's TikaIO Apache Beam component has now stabalised and will
  be available in Beam 2.3.0

## PMC changes:

 - Currently 29 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Madhav Sharan on Thu Aug 31 2017

## Committer base changes:

 - Currently 30 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Madhav Sharan at Thu Aug 31 2017

 ## Releases:

  - 1.17 was released on Wed Dec 13 2017

 ## Mailing list activity:

  - dev@tika.apache.org:
     - 196 subscribers (down -4 in the last 3 months):
     - 582 emails sent to list (592 in previous quarter)

  - user@tika.apache.org:
     - 349 subscribers (down -3 in the last 3 months):
     - 76 emails sent to list (35 in previous quarter)

 ## JIRA activity:

  - 71 JIRA tickets created in the last 3 months
  - 39 JIRA tickets closed/resolved in the last 3 months

[1] https://wiki.apache.org/tika/Tika2_0RoadMap

18 Oct 2017 [David Meikle / Jim]

## Description:
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

## Issues:
There are no issues that require the board's attention at this time.

## Activity:
- Progress is steady towards a 1.17 release with bug fixes and new features
such as Image-to-Text captioning and improved Phonetic string handling.
- Sergey Beryozkin added Apache Tika into Apache Beam[1] as an input component
which has triggered cross community collaboration.
- We've also started  catch up work on 2.x branch to align with updates made
on the master branch (1.x series).
- Google has released a Tika package for Go (thanks Tyler Bui-Palsulich) which
adds to the community released bindings [2].

## PMC changes:

- Currently 29 PMC members.
- Madhav Sharan was added to the PMC on Thu Aug 31 2017

## Committer base changes:

- Currently 30 committers.
- Madhav Sharan was added as a committer on Thu Aug 31 2017

## Releases:

- Last release was 1.16 on Wed Jul 12 2017

## Mailing list activity:

- dev@tika.apache.org:
 - 202 subscribers (down -2 in the last 3 months):
 - 595 emails sent to list (1396 in previous quarter)

- user@tika.apache.org:
 - 352 subscribers (up 6 in the last 3 months):
 - 35 emails sent to list (135 in previous quarter)

## JIRA activity:

- 46 JIRA tickets created in the last 3 months
- 30 JIRA tickets closed/resolved in the last 3 months

[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://s.apache.org/CJ2a

19 Jul 2017 [Dave Meikle / Bertrand]

What is Tika?
=============
Apache Tika is a dynamic toolkit for content detection, analysis,
and extraction. It allows a user to understand, and leverage
information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe,
Text, Images, Video, Code, and science data) as recognized by IANA
and other standards bodies.

Issues
======
There are no issues that need the boards attention.

Releases
========
The last release Tika 1.16 was
released on 17 Jul 2017 and 1.15 on the 23 May 2017.

These releases contain some great features including Age Recognition,
Image Captioning based on this paper[2], new tika-eval module to
compare output between versions, new parsers for  WordPerfect and
QuattroPro, and much more.

Work has now begun on 1.17 and continues on the 2.X branch.

Community
=========
A community member was voted[1] in as both a committer and PMC member
on 27 Jun 2017 and is currently in the process of been invited to
join. Prior to that was Luis Filipe Nassif on Wed Apr 12 2017

Mailing list activity on dev@ was 204 subscribers (down -3 in the
last 3 months), with 1473 emails sent to list (1010 in previous
quarter).

Mailing list activity on user@ was 347 subscribers (up 1 in the
last 3 months), with 135 emails sent to list (36 in previous quarter).

[1] https://s.apache.org/Sp4E
[2] https://arxiv.org/abs/1411.4555

19 Apr 2017 [Dave Meikle / Shane]

=== Apache Tika Status Report : April 2017 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release Tika 1.14 was released on Wed Oct 19 2016.
Version 1.15 has been the focus of the last period, with a release candidate
to be built once the new POI release is out.

There a lots of new features such as WordPerfect and QuattroPro parsers,
SAX based parser for Office formats, new language detectors, inline image
extraction in PDFs, and a new tika-eval module to allow evaluation of extraction
between different systems.

Work has also continued on the new 2.X branch.

Community
=========================

The team have been actively publicising different uses of Apache Tika including
the Panama Papers Investigation which won the Pulitzer prize[1] and the new
Google / elastic cloud search offering [2].

Apache Tika is participating in the Google Summer of Code with two mentors in
Chris Mattmann and Thamme Gowda. We are looking forward to receiving proposals.

Mailing list activity on dev@ was at 235, 405 and 175 messages in Feb, Mar and
Apr 2017, respectively. user@ was at 14, 6 and 6 messages, during the same
timeframe.

[1] https://s.apache.org/XW3A
[2] https://s.apache.org/yzvc

18 Jan 2017 [Dave Meikle / Isabel]

=== Apache Tika Status Report : January 2017 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release Tika 1.14 was released on Wed Oct
19 2016. Version 1.15 is now underway with current new features such as the
WordPerfect and QuattroPro parsers, SAX based parser for Office formats, and
inline image extraction in PDFs.

Work has also continued on the new 2.X branch.

Community
=========================

Luís Filipe Nassif was added as a committer on Tue Oct 18 2016

Tika has featured an article from Chris Mattman entitled 'Searching deep
and dark: Building a Google for the less visible parts of the web'[1].

Mailing list activity on dev@ was at 351, 359 and 162 messages in Nov, Dec and
Jan 2016/17, respectively. user@ was at 41, 2 and 17 messages, during the same
timeframe.

[1] https://s.apache.org/qzen

19 Oct 2016 [Dave Meikle / Isabel]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release of Tika was made in May 2016. Version 1.14 has been
progressing and will be released soon including new
features such as the Image Recognition parser, OCR within PDFs, and a range of
new mime type support.

Work has also continuted on the new 2.X branch.

Community
=========================

There have been no new committers added in this period. The last committer
joined in June 2016.

Tim Allison has blogged via the Open Preservation Foundation on the regression
pack work he kicked of to make Apache PDFBox, Apache POI and Apache Tika more
robust[1].

Mailing list activity on dev@ was at 242, 476 and 115 messages in Aug, Sep and
Oct 2016, respectively. user@ was at 5, 56 and 14 messages, during the same
timeframe.

[1] https://s.apache.org/z9QL

20 Jul 2016 [Dave Meikle / Chris]

=== Apache Tika Status Report : July 2016 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.13 release of Tika was made in May 2016[1], including various upgrades of dependences,
fixes including a security vulnerability (CVE-2016-4434)[2], and improvements around Name
Entity Recognition. The 2.X stream is now being actively worked on.

Community
=========================

The Tika PMC added Thamme Gowda as a committer and PMC member in June 2016.

Chris Mattmann is mentoring Anastasija Mensikova as part of the Google Sumer of Code 2016.
She is working on integrating OpenNLP's Sentiment Analysis in Tika.

Mailing list activity on dev@ was at 344, 328 and 77 messages in May, Jun and Jul
2016, respectively. user@ was at 41, 6 and 17 messages, during the same timeframe.

[1] https://s.apache.org/AFE1
[2] https://s.apache.org/Bbth

20 Apr 2016 [Dave Meikle / Chris]

=== Apache Tika Status Report : April 2016 ===

What is Tika?
=========================
 Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.12 release of Tika was made in February
2016[1], including various fixes and a new NamedEntityParser using Apache
OpenNLP. A 1.13 release is imminent with work well underway on the 2.X stream.

Community
=========================
Along with Apache Solr, Tika was part of stack that
powered[1] the analysis of the Panama Papers leak.

A Twitter account, @ApacheTika, has been setup for the project to help aid
announcements and engage with the community.

The Tika PMC added Bob Paulin as a committer and PMC member in September 2015.

Mailing list activity on dev@ was at 463, 423 and 259 messages in Feb, Mar and
Apr 2016, respectively. user@ was at 85, 10 and 21 messages, during the same
timeframe.

[1] https://s.apache.org/6A7M
[2] https://s.apache.org/QMJ3

20 Jan 2016 [Dave Meikle / Brett]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.11 release of Tika was made in October 2015[1], improving MIME type
support and adding a parser for GROBID (GeneRation Of BIbliographic Data
Discussions). Work has also started on the 2.X stream.

Community
=========================
The Tika PMC added Bob Paulin as a committer and PMC member in September 2015.

Mailing list activity on dev@ was at 170, 185 and 136 messages in Nov, Dec and
Jan 2015/16, respectively. user@ was at 9, 3 and 6 messages, during the same
timeframe.

[1] http://s.apache.org/TDD

21 Oct 2015 [Dave Meikle / Bertrand]

Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.10 release of Tika was made in August, upgrading support to Java 7 and
adding new features to make configuration easier. Discussions are underway for
the 1.11 release as well as progression towards as 2.X stream.

Community
=========================
The Tika PMC added Bob Paulin as a committer and PMC member in September.

There were two talks on Tika by Nick Burch and Michael Starch[1][2] at
ApacheCon Big Data in Budapest, as well as a gathering of interested parties.

Apache Tika has now been wrapped as a Perl Module[4], extending the list of
community client libraries available.

Mailing list activity on dev@ was at 397, 388 and 150 messages in Aug, Sep and
Oct 2015, respectively. user@ was at 45, 25 and 23 messages, during the same
timeframe.

[1] http://sched.co/3zt7
[2] http://sched.co/40Zd
[3] http://s.apache.org/Mmo
[4] https://metacpan.org/release/RIBUGENT/Apache-Tika-0.04

15 Jul 2015 [Dave Meikle / Rich]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major types
in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as
recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.9 release of Tika was made last month (June 2015) with new features such
as cTakes[1] integration and probabilistic MIME detection. Work has now started
on the 1.10 development stream.

Community
=========================
The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in
April 2015 as committers and PMC Members.

The authors (Chris Mattmann and Jukka Zitting) of the Tika in Action book have
donated the examples from the book to Tika. These have been included in a new
tika-examples sub-module.

There have been articles published on NASA and the Jet Propulsion Lab’s
involvement in the Memex project, which features Apache projects including
Tika[1].

Mailing list activity on dev@ was at 317, 307 and 63 messages in May, June and
July 2015, respectively. user@ was at 27, 39 and 9 messages, during the same
timeframe.

[1] https://wiki.apache.org/tika/cTAKESParser
[2] http://s.apache.org/Mmo

22 Apr 2015 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.7) was made in January 2015. There has been much progress
since then with a release candidate for 1.8 currently being voted on.

Community
=========================
The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in
April 2015 as committers and PMC Members.

There are six talks related to Tika scheduled to take place at ApacheCon NA
2015.

Chris Mattman has registered to be a mentor for Google Summer of Code 2015
with some Tika issues marked as potential projects.

Mailing list activity on dev@ was at 445, 891 and 107 messages in February,
March and April 2014, respectively. user@ was at 15, 15 and 1 messages,
during the same timeframe.

21 Jan 2015 [Dave Meikle / Ross]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.7) was made in January 2015[1], with many new features
including an OCR Parser based on Tesseract, improvements to the Tika JAXRS
Server and a number of parser fixes & enhancements.

Discussions have now started on the dev@ list to outline a roadmap[2] for what
a new 2.X stream could look like for the evolution of Tika.

Community
=========================
The Tika PMC added Konstantin Gribov as a committer and PMC Member in January
2015.

Mailing list activity on dev@ was at 389, 253 and 289 messages in November,
December and January 2014, respectively. user@ was at 10, 28 and 29 messages,
during the same timeframe.

[1] http://s.apache.org/u0p
[2] http://s.apache.org/DSm

15 Oct 2014 [Dave Meikle / Chris]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.6) was made in September 2014. Work has started on version
1.7 with bug fixes already complete and new features such as OCR Parsing being
worked on.

Community
=========================
The Tika PMC added Ann Bryant Burgess as a committer and PMC Member in August
2014.

New community developed bindings[1] have been created for Tika including a
binding for NodeJS and an OpenShift Cartridge for Apache Tika Server.

Discussions have started to take place on the mailing list about a potential
Tika meetup at ApacheCon EU 2014.  Nick Burch is also presenting a talk titled
'What's With The 1s and 0s?' on using Tika and other related tools to analyse
binary content[2].

Mailing list activity on dev@ was at 393, 364 and 119 messages in August,
September and October 2014, respectively. user@ was at 23, 38 and 12 messages,
during the same timeframe.

[1] http://s.apache.org/Y7w
[2] http://sched.co/1pbkX7n

16 Jul 2014 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.5) was made in February 2014. Work has continued on version
1.6 with a many bug fixes and new features, including many new file formats.
A discussion thread has started for a 1.6 release candidate.

Community
=========================
The Tika PMC added Lewis John McGibbney as a committer and PMC Member in June
2014.

A Tika Hackathon session took place at the ApacheCon NA 2014 conference, kicking
off improvements to our JAX-RS module. There were also presentations by
Annie Bryant, Nick Burch and Jukka Zitting as part of the main conference.

Mailing list activity on dev@ was at 298, 671 and 17 messages in May,
June and July 2014, respectively. user@ was at 15, 31 and 18 messages,
during the same timeframe.

16 Apr 2014 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.5) was made in February 2014.  Since then progress has
been steady on version 1.6 with a number of bug fixes and improvements.

Community
=========================
No new committers or PMC members were added since the last report.  Prior to
this the last new Committer and PMC Member was added in January 2014.

Tika is well represented at ApacheCon NA with four talks from three different
speakers (Jukka Zitting, Nick Burch and Annie Burgess). There is also plans to
conduct a couple of Hackathon sessions during the conference.

Mailing list activity on dev@ was at 293, 226 and 29 messages in February,
March and April 2014, respectively. user@ was at 26, 35 and 1 messages,
during the same timeframe.

15 Jan 2014 [Dave Meikle / Sam]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.4) was made in July 2013. Work has progressed on version
1.5 with a number of bug fixes and improvements. A discussion thread is
underway on creating a version 1.5 release candidate.

Community
=========================
The Tika PMC has voted to add Hong-Thai Nguyen as a committer and PMC Member,
with ACK to board earlier this week.  Prior to this the last Tika PMC and
Committer and PMC Member was added in July 2013.

Discussion has progressed around integrating Any23 components of value into
Tika. This is not in full swing yet however there is broad agreement on the
approach, with some initial patches being proposed and integrated.

Mailing list activity on dev@ was at 84, 155 and 28 messages in November,
December and January 2014, respectively. user@ was at 14, 10 and 0 messages,
during the same timeframe.

16 Oct 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.4) was made in July 2013 and work is currently underway
on version 1.5 with 23 issues currently resolved, comprising a mixture of
bug fixes and new features.

Community
=========================
The Tika PMC added Tim Allison as a committer and PMC Member in July 2013.

Chris Mattmann has won a National Science Foundation proposal for a project
at the University of Southern California to deliver an open source framework
for metadata exploration, automatic text mining and information retrieval of
polar data using Apache Tika[1].

Mailing list activity on dev@ was at 103, 86 and 29 messages in August,
September and October 2013, respectively. user@ was at 4, 18 and 0 messages,
during the same timeframe.

[1] http://s.apache.org/QqY

17 Jul 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that needs the board's attention.

Releases
=========================
Version 1.4 was released on the 2nd of July[1]. This release included
several important bugfixes and new features, including improvements to the
REST server and parser components.

Work is now underway on version 1.5.

Community
=========================
No new committers or PMC members were added since the last report, with both
the last committer and PMC member added in August 2012.

Mailing list activity on dev@ was at 126, 138 and 54 messages in May,
June and July 2013, respectively. user@ was at 7, 15 and 8 messages,
during the same timeframe.

[1] http://s.apache.org/7hB

17 Apr 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that needs the boards attention.

Releases
=========================
Version 1.3 was released on the 22nd of January[1]. This release included
several important bugfixes and new features, including better handling of
embedded files.

Work is now underway on version 1.4 with 15 issues resolved and 20 open
to date.

Community
=========================
No new committers or PMC members were added since the last report.

We have added a potential new feature as part of the ASFs potential projects
within the Google Summer of Code program[2].

Mailing list activity on dev@ was at 174, 70 and 11 messages in February,
March and April 2012, respectively. user@ was at 53, 32 and 10 messages,
during the same timeframe.

[1] http://s.apache.org/PDH
[2] https://issues.apache.org/jira/browse/TIKA-605

16 Jan 2013 [Dave Meikle / Doug]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that needs the boards attention.

Releases
=========================
Work continues on version 1.3 with 47 resolved and 18 open/in progress
JIRA tickets adding new features and providing bug fixes, with a
discussion thread underway to assess the need for a new release.

Community
=========================
No new committers or PMC members were added since the last report.

Jukka Zitting presented a session at ApacheCon EU titled "Content
Extraction With Apache Tika" [1].

Mailing list activity on dev@ was at 124, 109 and 21 messages in
November, December 2012 and January 2013, respectively. user@ was at
19, 18 and 6 messages, during the same timeframe.

[1] http://s.apache.org/CUc

17 Oct 2012 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognised by IANA and other
standards bodies.

Releases
=========================
We released version 1.2 on the 12th of July 2012[1]. This contained
new features such as the JAX-RS based network server and XMP metadata
handling, along with new file formats and parser improvements.

Work is currently underway on version 1.3 with 22 resolved and 19
open/in progress JIRA tickets, adding support for open graph metadata,
correct rounding of geodata information and improved mime type
detection for JPEG 2000 formats.

Community
=========================
The Tika PMC added Sergey Beryozkin (July 2012), Ingo Renner (July
2012) and Jörg Ehrlich (August 2012) as PMC members and committers.

As sponsor the Tika PMC voted to recommend the graduation of the Any23
incubator project to a TLP. This passed[3] and following the Incubator
PMC vote, the board approved the graduation resolution.

The Tika PMC voted to recommend Dave Meikle as the new chair[4]. This
was accepted by the board in August 2012 and Chris has now handed
duties over to Dave.

Jukka Zitting is scheduled to speak about Tika at ApacheCon Europe.
The session is titled 'Content extraction with Apache Tika'[5] and
shows how Tika can be used with a Lucene or Solr search index.

Mailing list activity on dev@ was at 173, 61 and 7 messages in August,
September and October 2012, respectively. user@ was at 34, 31 and 1
messages, during the same timeframe.

[1] http://s.apache.org/Vzr
[2] http://s.apache.org/HoO
[3] http://s.apache.org/gHE
[4] http://s.apache.org/kBQ
[5] http://s.apache.org/lnR

15 Aug 2012

Change the Apache Tika Project Chair

 WHEREAS, the Board of Directors heretofore appointed Chris Mattmann
 to the office of Vice President, Apache Tika, and

 WHEREAS, the Board of Directors is in receipt of the resignation
 of Chris Mattmann from the office of Vice President, Apache Tika,
 and

 WHEREAS, the Project Management Committee of the Apache Tika
 project has chosen to recommend David Meikle the successor
 to the post;

 NOW, THEREFORE, BE IT RESOLVED, that Chris Mattmann is relieved and
 discharged from the duties and responsibilities of the office
 of Vice President, Apache Tika, and

 BE IT FURTHER RESOLVED, that David Meikle and hereby is
 appointed to the office of Vice President, Apache Tika, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification, or
 until a successor is appointed.

 Special Order 7A, Change the Apache Tika Project Chair, was
 approved by Unanimous Vote of the directors present.

25 Jul 2012 [Chris A. Mattmann / Sam]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

Releases
=========================
Progress towards the 1.2 release continues. There have been a few
recent threads discussing making an RC ([1 and [2]). We anticipate
the 1.2 RC and official release arriving in the next month or so.

The 1.2 RC addresses 63 issues [3] including new features (e.g.,
Tika JAX-RS network server [4]), bug fixes (e.g., misuse of HTTP
content-encoding header [5]) and a enhanced approach to dealing
with metadata key naming and representation [6] including XMP
support.

Community
=========================
The Tika PMC added Ray Gauss as a Tika PMC member and
committer in May 2012.

The Tika PMC is still sponsoring the Any23 incubator project [7],
which is progressing along nicely and getting ready to make their
first Incubator release.

Chris started a thread [8] on private to discuss potentially rotating
the chair. So far there hasn't been strong positive or negative reception
to this suggestion.

Mailing list activity on dev@ was at 174, 102 and 134 messages in May,
June and July 2012, respectively. user@ was at 26, 28 and 45 messages,
during the same timeframe.


[1] http://s.apache.org/MFq
[2] http://s.apache.org/AMZ
[3] http://s.apache.org/w9
[4] https://issues.apache.org/jira/browse/TIKA-593
[5] https://issues.apache.org/jira/browse/TIKA-431
[6] http://wiki.apache.org/tika/MetadataRoadmap
[7] http://incubator.apache.org/any23/
[8] http://s.apache.org/P7c

(Tika)

18 Apr 2012 [Chris A. Mattmann / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamictoolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognized by IANA and other
standards bodies.

Releases
=========================
We released Tika 1.1 on 3/23/12.

The current work on Tika 1.2 includes 14 of 33 issues already fixed.
These issues include a cool new Tika JAX-RS network server [2] that
really helped foster good will between the Apache Tika and CXF
communities. Sergey Beryozkin from CXF, and Maxim Valyanskiy from
Tika really led the way. Besides the network server, MIME type
support for the scientific data file format FITS, used heavily in
the astronomy community was added [3], and the ability to extract
embedded images from Powerpoint files [4] and the improvements to
the way that Tika load Detectors and Parsers in an OSGI environment
were also added [5] in the current trunk development branch.

There has been discussion of adding GDAL support to TIKA [6], which
would add hundreds spatial formats and the ability to parse and
detect them to Tika.

Community
=========================
No new PMC members/committers
were added in the last quarter.

The Tika PMC is still sponsoring the Any23 incubator project [7],
which is progressing along nicely and getting ready to make their
first Incubator release.

Mailing list activity on dev@ remained steady in January, February,
and March 2012 (189, 125, 200 messages) but slowed in April 2012
(48 messages, respectively), while the user activity remained
consistent in January, February and March 2012 (51, 46 and 37
messages). No user questions in April 2012 yet.


[1] http://s.apache.org/dWz
[2] https://issues.apache.org/jira/browse/TIKA-593
[3] https://issues.apache.org/jira/browse/TIKA-874
[4] https://issues.apache.org/jira/browse/TIKA-883
[5] https://issues.apache.org/jira/browse/TIKA-884
[6] https://issues.apache.org/jira/browse/TIKA-605
[7] http://s.apache.org/gJ

24 Jan 2012 [Chris A. Mattmann / Roy]

What is Tika?
=========================
Apache Tika is a dynamic
toolkit for content detection, analysis, and extraction. It allows
a user to understand, and leverage information from, a growing a
list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code,
and science data) as recognized by IANA and other standards bodies.

Releases
=========================
We released Tika 1.0 on 11/7/11
[1] to coincide with Apache Con NA 2011. Since then we've been
working on the 1.1 release, with 42 issues already resolved in JIRA
[2] and 1.1 likely to be shipped in the next quarter.

Community
=========================
We added Jerome Charron and
Antoni Mylkato the Tika PMC in November 2011.

At ApacheCon NA, and since, there have been some discussions regarding
Tika and the ODF Toolkit [3], as well as the Tika PMC's sponsorship
of the Any23 incubator project [4], which is progressing along.

Mailing list activity on dev@ remained steady in November and
December 2011 (259, 287 messages) but slowed in January 2012 (21
messages, respectively), while the user activity remained consistent
in November and December 2011 (47 and 65 messages), but has been
quiet in January 2012 (12 messages).

Chris Mattmann gave a talk at ApacheCon NA 2011 on "Apache Tika:
One Point Oh!" [5] commemorating the upcoming 1.0 Tika release.
Chris also had some discussions about Any23 and Tika with Lewis
John McGibbney at ApacheCon NA 2012. Lewis is a Nutch PMC member,
the candidate for the Gora VP with the current board resolution and
an Any23 Incubator committer.

Press
=========================
Chris and Jukka and Sally have
published a press release [6] about Tika's use at NASA, as well as
its use at Day Software and other companies. The press release went
out during ApacheCon NA 2011 and was perfect timing with the event.

Tika In Action
==========================
Chris Mattmann and Jukka
Zitting completed the Manning book called "Tika in Action" [7] in
time for ApacheCon NA 2011. The book is now available in print,
ebook and mobi editions. Yay!

[1] http://s.apache.org/AE6
[2] http://s.apache.org/HLX
[3] https://issues.apache.org/jira/browse/TIKA-737
[4] http://s.apache.org/gJ
[5] http://na11.apachecon.com/talks/19391
[6] http://s.apache.org/N0I
[7] http://manning.com/mattmann/

26 Oct 2011 [Chris A. Mattmann / Jim]

What is Tika?
=========================
Apache Tika is a dynamic
toolkit for content detection, analysis, and extraction. It allows
a user to understand, and leverage information from, a growing a
list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code,
and science data) as recognized by IANA and other standards bodies.

Releases
=========================
We rolled the Tika 0.10 release
on 9/30/2011 [1]. We opted for a 0.10 instead of 1.0 to try and
time the 1.0 release with ApacheCon. We've got a few weeks left and
are going to try and make it!

Community
=========================
We added Michael McCandless to the Tika PMC on August 29th, 2011.

The Tika PMC agreed to sponsor the Any23 (Anything to Triples)
Incubator project [3]. Any23 is a semantic understanding toolkit,
whose goal is to extraction information from, to detect, and to
reason over most of the current semantic document formats including
RDF, OWL, etc. Any23 leveraged Tika in its existing framework at
Googlecode, and we see the projects hopefully having a lot of synergy
going forward. Any23 was accepted into the Incubator on October 1,
2011 [4].

Mailing list activity on dev@ is growing (197, 356, and 186 in
August, September and October 2011, respectively), while the user
activity grew a little bit in August and September (70 and 66
messages), but has been relatively quiet in October (3 messages).

Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache
Tika: One Point Oh!" [5] commemorating the upcoming 1.0 Tika release.

Press
=========================
Chris and Jukka and Sally have
drafted a press release about Tika's use at NASA, as well as its
use at Day Software and other companies. The goal is to have this
coincide with the 1.0 release and ApacheCon NA 2011.

Tika In Action
==========================
Chris Mattmann and Jukka
Zitting are writing a Manning book called "Tika in Action" [6] and
the book is in its final copyediting stages and it should be in
print in time for ApacheCon NA 2011.


[1] http://tika.apache.org/0.10/index.html
[2] http://s.apache.org/HzK
[3] http://s.apache.org/HWO
[4] http://s.apache.org/gJ
[5] http://na11.apachecon.com/talks/19391
[6] http://manning.com/mattmann/

20 Jul 2011 [Chris A. Mattmann / Jim]

Releases
=========================
Progress towards the 1.0 release
continues and we hope to roll a 1.0 in the Q3 timeframe.

Development activity is progressing, and there are a number of
issues being worked on the user and dev lists. Notably, the Tika
command line interface now outputs JSON [1], new document formats
are being worked and/or improved: ole2 and ooxml [2], the pcap
format [3], the CHM format [4], the PRT format [5] and some new
Font parsers [6].

Based on dev discussion, Tika's MIME identifier is becoming more
prominently used in the Aperture project [7].

There was also some discussion regarding Tika's relationship with
Apache OOo [8].

Community
=========================
No new committers or PMC members elected in this quarter.

Mailing list activity on dev@ remains steady (near the ~150 message
range), while user@ is coming in at around ~50 or so messages per
month.

Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache
Tika: One Point Oh!" commemorating the upcoming 1.0 Tika release.

Chris Mattmann's CS572 Search Engines class students at USC are
doing final projects related to Search technologies, and one of
them is from Fernando Arreola [6] who is contributing new font
parsers as mentioned above.

Press
=========================
Chris and Jukka and Sally are coordinating with Priscilla Vega from JPL to
make a press release about Tika's use at NASA, as well as its use at Day
Software and other companies. The goal is to have this coincide with the
1.0 release and OSCON.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika
in Action", and it is now complete and handed off to production. All
Chapters of the book and the appendices are now available on the MEAP page
[9]. The book is set to be published in the Q2/Q3 timeframe of 2011.

[1] https://issues.apache.org/jira/browse/TIKA-213
[2] https://issues.apache.org/jira/browse/TIKA-652
[3] https://issues.apache.org/jira/browse/TIKA-658
[4] https://issues.apache.org/jira/browse/TIKA-245
[5] https://issues.apache.org/jira/browse/TIKA-679
[6] http://s.apache.org/17Z
[7] http://s.apache.org/Z4r
[8] http://s.apache.org/3aC
[9] http://www.manning.com/mattmann/

20 Apr 2011 [Chris A. Mattmann / Jim]

Releases
=========================
We made our 0.9 release [1] in February 2011. This fix included some
critical bug fixes including a fix that re-enabled extraction of metadata
from Scientific Data File Formats (HDF+NetCDF) from the command line. In
addition, the release included a patch [2] that significantly reduced the
number of pulled in dependencies having to do with NetCDF. Support for
parsing via external forking was also added [3]. See [4] for a full list of
changes.

Progress towards the 1.0 release continues and we hope to roll a 1.0 in the
Q2/Q3 timeframe.

Community
=========================

We added Oleg Tikhonov to the Tika PMC in April 2011 [5].

Mailing list activity on user@ and dev@ remain steady (near the ~100 message
range).

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We completed a full draft of the
entire book, and Chapters 9 and 10 are now available on the MEAP page [6].
The book is set to be published in the Q2/Q3 timeframe of 2011.

[1] http://s.apache.org/1lE
[2] http://issues.apache.org/jira/browse/TIKA-596
[3] http://issues.apache.org/jira/browse/TIKA-556
[4] http://www.apache.org/dist/tika/CHANGES-0.9.txt
[5] http://s.apache.org/Jur
[6] http://www.manning.com/mattmann/

19 Jan 2011 [Chris A. Mattmann / Geir]

Releases
=========================
We've made our 0.8 release [1] in November 2010. It's been
a long time coming, and there were over 98 JIRA issues [2]
addressed in the release. Work progresses towards a patch
release (either 0.8.1 or 0.9) and Chris Mattmann plans to RM
it and roll a release candidate hopefully in the next month.
This release should fix a number of smaller issues found after
folks have upgraded to 0.8.


Community
=========================

We added Maxim Valyanskiy to Tika PMC in November 2010 [3].

Chris Mattmann gave a talk on Tika titled Scientific Data
Curation and Processing with Apache Tika [4] at ApacheCon NA
in November 2010 during the Lucene and friends session. The
morning talk was well attended and it was great to finally meet
everyone in person!


Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book
called "Tika in Action", and it is progressing steadily. We
completed our 2/3 book review and Chapters 1-8 of the book are
now available [5] through Manning's Early Access Program or MEAP.

[1] http://s.apache.org/W1Dh
[2] http://s.apache.org/73R
[3[ http://s.apache.org/dj0
[4] http://s.apache.org/2ak
[5] http://www.manning.com/mattmann/

20 Oct 2010 [Chris A. Mattmann / Sam]

Releases
=========================
Progress towards an 0.8 release continues. There has been some recent
activity by Jukka Zitting [1] towards allowing for easily using Apache Tika
in environments requiring Compressed RTF / TNEF / LZFU parsing. Along these
lines a broader discussion [2] of how to use Apache Tika with parsing
libraries that are GPL-licensed ensued and it was determined that plugins
could be developed and hosted externally (e.g., at Github) and used with an
official Apache Tika release from the ASF just by dropping the plugin jars
onto a deployed version of Apache Tika's classpath. This provides a great
solution for folks who want to use Apache Tika in environments where they
need to parse formats which require non-ASLv2 friendly libraries.

We've had some nice documentation updates occur since the last report as
well. Arturo Beltran contributed a "get Tika parsing up and running in 5
minutes" quick start guide in TIKA-464 [3]. Nick Burch also added
documentation on the Container-aware Tika detection capabilities in TIKA-477
[4]. Paul Jakubik started a great wiki discussion on container-based
Metadata formats in [5], the first major use of the Tika wiki [6]. Chris
Mattmann has since done a bit of reorganizing of information, adding the
Tika logo to the wiki as well.

Other various notable development items include an RSS Feed Parser (TIKA-466
[7]) from Julien Nioche and Chris Mattmann, Container-aware Mime Detection
(TIKA-477 [4]) from Nick Burch, Jukka Zitting et al., and various Charset
improvements (e.g., TIKA-471 [8] and TIKA-529 [9]) from Ken Krugler.

Tika's website is now hooked up to SVNpubsub thanks to Jukka Zitting [10].

One of the blockers to the 0.8 release, getting NetCDF pushed to Maven
Central, has seen some good progress as of late. Chris Mattmann has created
a space for the NetCDF jars in the Sonatype OSS free hosting area that is
synced to Maven Central [11].


Community
=========================

There have been no new Tika PMC members or committers elected in this
quarter.

A podcast for Tika recorded in April 2010 between Chris Mattmann and Rich
Bowen was posted on the Feathercast.org website and mentioned on the Tika
mailing lists [12].

The Tika community was consulted for two website updates. The first site
update was to post a link and image of the Tika in Action book [13]. The
second site update was to allow for the selection of search engine provider
used to search the Tika website [14].

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We are currently doing our 2/3 book
review and Chapters 1-5 of the book are now available [15] through Manning's
Early Access Program or MEAP.

[1] http://github.com/jukka/jtnef
[2] http://s.apache.org/JGz
[3] https://issues.apache.org/jira/browse/TIKA-464
[4] https://issues.apache.org/jira/browse/TIKA-477
[5] http://wiki.apache.org/tika/MetadataDiscussion
[6] http://wiki.apache.org/tika/
[7] https://issues.apache.org/jira/browse/TIKA-466
[8] https://issues.apache.org/jira/browse/TIKA-471
[9] https://issues.apache.org/jira/browse/TIKA-529
[10] https://issues.apache.org/jira/browse/TIKA-473
[11] https://issues.apache.org/jira/browse/TIKA-407
[12] http://s.apache.org/Zmd
[13] http://s.apache.org/XUK
[14] http://s.apache.org/RgA
[15] http://manning.com/mattmann/

Shane appreciates the detail in this report.

21 Jul 2010 [Chris A. Mattmann / Shane]

Releases
=========================
There's been a bunch of development activity and mailing list activity in
Tika on a broad range of issues: related to charset detection improvement
[1] from Ken Krugler, to BoilerPlate extraction [2] (also from Ken), to
improving HTML parsing overall [3] [4] [5] with contributions from Julien
Nioche, Ken Krugler, Jukka Zitting, and our new committer Nick Burch (see
Community section). There has also been a flurry of activity on new Parsers,
improving parsing detection with Timeouts, Geospatial information
representation, improving PDF parsing and a host of other issues that
indicate the development community of Tika is thriving.

We've slipped a bit on our original intention to release 0.8 in this month's
timeframe. One thing that will need to get solved is the upload of some
current external dependencies (e.g., NetCDF, Boilerpipe) to Maven Central.
Chris Mattmann and Ken Krugler volunteered to take the lead on this. 0.8
isn't far off, but it's dependent on getting those jars up to Maven central.

Community
=========================
The Tika PMC elected to add Nick Burch [6] to the Tika PMC and committers
group. Nick is an ASF member, and the VP of Apache POI, an important
external library dependency of many of the Tika parsers. Welcome, Nick!

Chris Mattmann is teaching CS572: Information Retrieval and Web Search
Engines at USC this summer [7], and several of the final projects in his
class are related to Tika. There is a plan to contribute the final code
produced from some of the students in the form of JIRA issues and patches.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We have completed our 1/3 book
review and the book is now available [8] through Manning's Early Access
Program or MEAP.

IBM Developerworks Article on Tika
==================================
Chris Mattmann and Oleg Tikhonov published a short IBM Developerworks intro
article on Tika [9].

[1] http://issues.apache.org/jira/browse/TIKA-459
[2] http://issues.apache.org/jira/browse/TIKA-462
[3] http://issues.apache.org/jira/browse/TIKA-394
[4] http://issues.apache.org/jira/browse/TIKA-460
[5] http://issues.apache.org/jira/browse/TIKA-463
[6] MID: AANLkTinayUCu9pgv8p-ds3SFFMxjpCMLGkxhMTQiTFm7@mail.gmail.com
[7] http://sunset.usc.edu/classes/cs572_2010/
[8] http://manning.com/mattmann/
[9] http://s.apache.org/FXa

16 Jun 2010 [Chris A. Mattmann / Greg]

This is the second report from Apache Tika in its new TLP status approved at
the April 2010 board meeting.

TLP migration
=========================
All complete! Website has been updated, SVN taken care of, ML lists migrated
and UNIX groups and domain taken care of. Thanks to Gavin for handling this.

Releases
=========================
Since our last report, we've committed some important bug fixes including
TIKA-379 [1] which fixed and allowed HTML elements and attributes to be
available in the parsed XHTML provided by Tika, and some mime type detection
fixes in particular TIKA-417 [2]. We still think we are on target to release
0.8 within the next month or so.

License Issue
=========================
During the last board meeting, there was a question brought up regarding the
licensing issue of UCAR/NCAR's NetCDF java library, which was used to
implement TIKA-400 netCDF Tika Parser [3]. The question pertained to the
question of an "advertising" clause in the UCAR/NCAR license, and the board
asked Chris to follow up on it. Chris took the issue over to legal-discuss@
and a resolution to the matter was arrived upon [4] which included adding
some text to NOTICE.txt and LICENSE.txt in Tika, fulfilling the advertising
clause via the Apache NOTICE mechanism. This issue was tracked and fixed in
Tika in TIKA-432 [5], and is now solved in the current 0.8 trunk, removing a
roadblock to the 0.8 release.

Community
=========================
The Tika PMC elected to add Julien Nioche [6] to the Tika PMC and committers
group. Julien is a Nutch committer, and has been providing quality patches
to Tika (incl. the fix for TIKA-379) and good mailing list support.

Jukka Zitting presented on Tika at the Lucene Eurocon [7] and Berlin
Buzzwords [8] conferences.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We are preparing for a 1/3 book
review likely in the next week or so, and the book is about to be available
in Manning's Early Access Program or MEAP, electronically, any day now.

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-379
[2] http://issues.apache.org/jira/browse/TIKA-417
[3] http://issues.apache.org/jira/browse/TIKA-400
[4]http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201005.mbox/%3CAANLkTinXO4AZkYDm0L83SipPXl48kYC8enFRhamJCBvJ@mail.gmail.com%3E
[5] http://issues.apache.org/jira/browse/TIKA-432
[6]http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3CC83020F3.14F6B%25Chris.A.Mattmann@jpl.nasa.gov%3E
[7] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika
[8] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630

19 May 2010 [Chris A. Mattmann / Jim]

This is the first report from Apache Tika in its new TLP status approved at
the April 2010 board meeting.

TLP migration
=========================
Progress is being made in transitioning Tika off of the Lucene site and
infrastructure and into its new TLP home. The current status is:

(Gavin from the INFRA team grouped the below into a TLP migration task [1])

1. mailing lists migration, filed INFRA-2645 [2], Gavin working on it as of
last Saturday

2. SVN, filed INFRA-2646 [3], status is same as for #1

3. UNIX groups, and home on www.apache.org/dist [4], status same as #1

4. Domain name for tika.apache.org [5], status same as #1

Releases
=========================
Steady progress on the 0.8 Tika release is being made, with contributions
made to add more file formats (netCDF is a recent addition), to allow Tika
to be used in server environments with their own classloaders (like Apache
SOLR), and to improve image metadata extraction. We hope to release 0.8
within the next month or so.

Community
=========================
The Tika community reached out to the NetCDF community in order to get their
NetCDF jars released to Maven central (currently TIKA-400 [6] relies on
NetCDF jars from an external Maven2 repository that's not synced with
Central). Jukka Zitting pointed out that this isn't a best practice, so we
are working with the NetCDF community to resolve this, and they have been
receptive [7], in particular John Caron from UCAR/NCAR is working to help us
out.

[1] https://issues.apache.org/jira/browse/INFRA-2692
[2] https://issues.apache.org/jira/browse/INFRA-2645
[3] https://issues.apache.org/jira/browse/INFRA-2646
[4] https://issues.apache.org/jira/browse/INFRA-2647
[5] https://issues.apache.org/jira/browse/INFRA-2676
[6] https://issues.apache.org/jira/browse/TIKA-400
[7] http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/201004.mbox/%3C4BD758B7.60202@unidata.ucar.edu%3E

The thredds/cdm dependency appears to have an "advertising" clause. To be followed up on legal-discuss.

21 Apr 2010

Establish the Apache Tika Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to content detection and analysis for
 distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the "Apache Tika Project",
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Tika Project be and hereby is
 responsible for the creation and maintenance of software
 related to a content analysis and detection toolkit; and be it further

 RESOLVED, that the office of "Vice President, Apache Tika" be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Tika Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Tika Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Tika Project:

   * Chris A. Mattmann (mattmann@apache.org)
   * Jukka Zitting (jukka@apache.org)
   * Ken Krugler (kkrugler@apache.org)
   * Keith Bennett (kbennett@apache.org)
   * Mark Harwood (mharwood@apache.org)
   * Dave Meikle (dmeikle@apache.org)
   * Sami Siren (siren@apache.org)
   * Rida Benjelloun (ridabenjelloun@apache.org)

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Chris A. Mattmann
 be appointed to the office of Vice President, Apache Tika, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed; and be it further

 RESOLVED, that the Apache Tika Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Tika sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Tika sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged

 Special Order 7C, Establish the Apache Tika Project, was
 approved by Unanimous Vote of the directors present.

15 Oct 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Dave Meikle was just voted in as a new committer.
 * Paolo Mottadelli will present Tika at ApacheCon US.

Development

 * Tika 0.2 should be released soon.
 * Usage documentation has been added to the website.

Issues before graduation:

 * The current plan is to graduate as a Lucene subproject, which could
happen soon as the incubation criteria seem to be met.

16 Jul 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Tika community remains relatively small, with just a handful of active
members

Development

 * Work towards Tika 0.2 continues, Chris Mattman has volunteered to be the
release manager

Issues before graduation:

 * Increase the size and diversity of the community (or graduate into a
Lucene subproject?)

16 Apr 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Niall Pemberton joined the project as a committer and PPMC member
 * The number of issues reported by external contributors is growing
gradually
 * There was a Fast Feather Talk on Tika in ApacheCon EU 2008
 * We have good contacts especially with Apache POI and PDFBox

Development

 * We are working towards Tika 0.2
 * Metadata handling improvements are being discussed

Issues before graduation:

 * Increase the size of the community

17 Oct 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser Libraries. Tika entered
incubation on March 22nd, 2007.

Community

There have been a number of positive items within Tika during the last few
months. The traffic on the Tika mailing list has increased significantly
(with typically 2, 3 questions, and 1 or 2 commits every day, or every other
day), and there have been a lot of recent inquiries from external projects
wanting to collaborate with Tika (including Aperture, PDFBox and a fellow
developing a JSon library currently hosted at Google code). In addition,
Tika's architecture has become a recent discussion of interest (as we'll see
below).

We recently elected Keith Bennett as a new committer to Tika. Keith has been
spearheading many of the new patches committed to Tika, as well as
participating in discussions about the architecture, and future direction of
the project.

Tika will be represented at the "Fast Feather" track at Apache Con US by
Jukka Zitting. The rest of the community is helping to create the content
for the presentation. The abstract is listed below:

Tika is a new content analysis framework borne from the desire to factor our
commonality from the Apache Nutch search engine framework. Tika provides a
mime detection framework, an extensible parsing framework and metadata
environment for content analysis. Though in its nascent stages, progress on
Tika has recently taken shape and the project is nearing a stable 0.1
release. In this talk, we'll describe the core APIs of Tika and discuss its
use in several distinct domains including search engines, scientific data
dissemination and an industrial setting.


Development

There have been a flurry of JIRA issues and code activity [1] including 47
issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2
open major/minor issues in progress).

Tika's Parser interface (one of its key components) has just undergone a
major overhaul led by Jukka Zitting, and Chris Mattmann has recently
contributed a MimeType system (with help from fellow Apache Nutch committer
Jerome Charron) to Tika. We also cleaned up and refactored large parts of
the rest of the code (removing references to LuisLite and branding the
project wherever possible with the Tika name), in preparation for an
upcoming 0.1 release.

Chris Mattmann has led an effort to carve out the existing MimeType
detection system in Apache Nutch [2] and replace it with Tika's improved
MimeType detection system. There is a patch sitting in JIRA right now [3],
and barring objections, Nutch will rely on Tika for its MimeType detection
abilities.

Also active recently were committers Bertrand Delacretaz, Sami Siren and
Rida Benjelloun, committing patches and improvements wherever needed.


Issues before graduation

No changes since our last report: the Tika project is still at an early
stage of incubation. We need to continue bringing in the initial codebases
and are targeting an initial incubating release (0.1) probably within the
next month. We also need to work on growing the community and figuring out
how to best interact with external parser projects.


 1. http://issues.apache.org/jira/browse/TIKA
 2. http://lucene.apache.org/nutch/
 3. http://issues.apache.org/jira/browse/NUTCH-562

18 Jul 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various document formats using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community:
 * The Tika mailing list has seen increased activity in the last weeks,
with some new people showing interest for Tika's goals.
 * Grant Ingersoll brought the Aperture framework to our attention
(http://aperture.sourceforge.net/), which has similar goals to Tika. We will
look at possible synergies.

Development:
 * No code has been committed since our last report, but some initial code
is ready in JIRA and should be committed soon.

Issues before graduation:
 * No changes since our last report: the Tika project is still at an early
stage of incubation. We need to continue bringing in the initial codebases
and probably target for an initial incubating release later this year. We
also need to work on growing the community and figuring out how to best
interact with external parser projects.

20 Jun 2007

Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.
Tika entered incubation on March 22nd, 2007.

Community

The Tika mailing lists have been relatively quiet lately, probably
because with little code we don't yet have many concrete issues to talk
about.

Development

We saw the first piece of Tika code when Chris A. Mattmann ported the
Nutch metadata framework to Tika. Rida Benjelloun has created a version
of the Lius codebase to be included in Tika, and the code is currently
in the issue tracker.

Issues before graduation

The Tika project is still at an early stage of incubation. We need to
continue bringing in the initial codebases and probably target for an
initial incubating release later this year. We also need to work on
growing the community and figuring out how to best interact with
external parser projects.

16 May 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser libraries.

Incubating since: March 22nd, 2007.

__Community__

We had a good project bootstrap meeting as a part of the text analysis BOF
at the ApacheCon EU in Amsterdam. The resulting ideas were summarized on the
project mailing list, and the first design threads have started.

__Development__

We've started discussing the design of the Tika toolkit. It seems like we
will select one of the existing codebases listed in the project proposal as
the basis of an early 0.1 release, and start refactoring the code into a
more generic toolkit. The Tika svn tree is still empty, but I expect us to
see the first code commits before the next report.

__Infrastructure__

All the initial infrastructure is now in place. There is still some activity
on the temporary Tika wiki on the Google Project hosting service, so we may
end up requesting a Tika wiki to be set up on the ASF infrastructure.

__Issues before graduation__

The Tika project is still at an early stage of incubation. The most
important tasks before graduation are to develop and release the Tika
codebase and to grow a diverse and sustainable project community.

25 Apr 2007

iPMC Reviewers: rdonkin

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser libraries. Tika entered
incubation on March 22nd, 2007.

The Tika project has just started. The basic infrastructure (mailing lists,
subversion, issue tracker, web site) is mostly in place; the only thing
still missing is one committer account. We expect to get started with the
actual design and code work during the next few weeks.