Apache Logo
The Apache Way Contribute ASF Sponsors

This was extracted (@ 2017-06-08 23:10) from a list of minutes which have been approved by the Board.
Please Note The Board typically approves the minutes of the previous meeting at the beginning of every Board meeting; therefore, the list below does not normally contain details from the minutes of the most recent Board meeting.

2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | Pre-organization meetings

Tika

19 Apr 2017 [Dave Meikle / Shane]

=== Apache Tika Status Report : April 2017 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release Tika 1.14 was released on Wed Oct 19 2016.
Version 1.15 has been the focus of the last period, with a release candidate
to be built once the new POI release is out.

There a lots of new features such as WordPerfect and QuattroPro parsers,
SAX based parser for Office formats, new language detectors, inline image
extraction in PDFs, and a new tika-eval module to allow evaluation of extraction
between different systems.

Work has also continued on the new 2.X branch.

Community
=========================

The team have been actively publicising different uses of Apache Tika including
the Panama Papers Investigation which won the Pulitzer prize[1] and the new
Google / elastic cloud search offering [2].

Apache Tika is participating in the Google Summer of Code with two mentors in
Chris Mattmann and Thamme Gowda. We are looking forward to receiving proposals.

Mailing list activity on dev@ was at 235, 405 and 175 messages in Feb, Mar and
Apr 2017, respectively. user@ was at 14, 6 and 6 messages, during the same
timeframe.

[1] https://s.apache.org/XW3A
[2] https://s.apache.org/yzvc

18 Jan 2017 [Dave Meikle / Isabel]

=== Apache Tika Status Report : January 2017 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release Tika 1.14 was released on Wed Oct
19 2016. Version 1.15 is now underway with current new features such as the
WordPerfect and QuattroPro parsers, SAX based parser for Office formats, and
inline image extraction in PDFs.

Work has also continued on the new 2.X branch.

Community
=========================

Luís Filipe Nassif was added as a committer on Tue Oct 18 2016

Tika has featured an article from Chris Mattman entitled 'Searching deep
and dark: Building a Google for the less visible parts of the web'[1].

Mailing list activity on dev@ was at 351, 359 and 162 messages in Nov, Dec and
Jan 2016/17, respectively. user@ was at 41, 2 and 17 messages, during the same
timeframe.

[1] https://s.apache.org/qzen

19 Oct 2016 [Dave Meikle / Isabel]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release of Tika was made in May 2016. Version 1.14 has been
progressing and will be released soon including new
features such as the Image Recognition parser, OCR within PDFs, and a range of
new mime type support.

Work has also continuted on the new 2.X branch.

Community
=========================

There have been no new committers added in this period. The last committer
joined in June 2016.

Tim Allison has blogged via the Open Preservation Foundation on the regression
pack work he kicked of to make Apache PDFBox, Apache POI and Apache Tika more
robust[1].

Mailing list activity on dev@ was at 242, 476 and 115 messages in Aug, Sep and
Oct 2016, respectively. user@ was at 5, 56 and 14 messages, during the same
timeframe.

[1] https://s.apache.org/z9QL

20 Jul 2016 [Dave Meikle / Chris]

=== Apache Tika Status Report : July 2016 ===

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.13 release of Tika was made in May 2016[1], including various upgrades of dependences,
fixes including a security vulnerability (CVE-2016-4434)[2], and improvements around Name
Entity Recognition. The 2.X stream is now being actively worked on.

Community
=========================

The Tika PMC added Thamme Gowda as a committer and PMC member in June 2016.

Chris Mattmann is mentoring Anastasija Mensikova as part of the Google Sumer of Code 2016.
She is working on integrating OpenNLP's Sentiment Analysis in Tika.

Mailing list activity on dev@ was at 344, 328 and 77 messages in May, Jun and Jul
2016, respectively. user@ was at 41, 6 and 17 messages, during the same timeframe.

[1] https://s.apache.org/AFE1
[2] https://s.apache.org/Bbth

20 Apr 2016 [Dave Meikle / Chris]

=== Apache Tika Status Report : April 2016 ===

What is Tika?
=========================
 Apache Tika is a dynamic toolkit for content
detection, analysis, and extraction. It allows a user to understand, and
leverage information from, a growing a list over 1200 different file types
including most of the major types in existence (MS Office, Adobe, Text,
Images, Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.12 release of Tika was made in February
2016[1], including various fixes and a new NamedEntityParser using Apache
OpenNLP. A 1.13 release is imminent with work well underway on the 2.X stream.

Community
=========================
Along with Apache Solr, Tika was part of stack that
powered[1] the analysis of the Panama Papers leak.

A Twitter account, @ApacheTika, has been setup for the project to help aid
announcements and engage with the community.

The Tika PMC added Bob Paulin as a committer and PMC member in September 2015.

Mailing list activity on dev@ was at 463, 423 and 259 messages in Feb, Mar and
Apr 2016, respectively. user@ was at 85, 10 and 21 messages, during the same
timeframe.

[1] https://s.apache.org/6A7M
[2] https://s.apache.org/QMJ3

20 Jan 2016 [Dave Meikle / Brett]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.11 release of Tika was made in October 2015[1], improving MIME type
support and adding a parser for GROBID (GeneRation Of BIbliographic Data
Discussions). Work has also started on the 2.X stream.

Community
=========================
The Tika PMC added Bob Paulin as a committer and PMC member in September 2015.

Mailing list activity on dev@ was at 170, 185 and 136 messages in Nov, Dec and
Jan 2015/16, respectively. user@ was at 9, 3 and 6 messages, during the same
timeframe.

[1] http://s.apache.org/TDD

21 Oct 2015 [Dave Meikle / Bertrand]

Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.10 release of Tika was made in August, upgrading support to Java 7 and
adding new features to make configuration easier. Discussions are underway for
the 1.11 release as well as progression towards as 2.X stream.

Community
=========================
The Tika PMC added Bob Paulin as a committer and PMC member in September.

There were two talks on Tika by Nick Burch and Michael Starch[1][2] at
ApacheCon Big Data in Budapest, as well as a gathering of interested parties.

Apache Tika has now been wrapped as a Perl Module[4], extending the list of
community client libraries available.

Mailing list activity on dev@ was at 397, 388 and 150 messages in Aug, Sep and
Oct 2015, respectively. user@ was at 45, 25 and 23 messages, during the same
timeframe.

[1] http://sched.co/3zt7
[2] http://sched.co/40Zd
[3] http://s.apache.org/Mmo
[4] https://metacpan.org/release/RIBUGENT/Apache-Tika-0.04

15 Jul 2015 [Dave Meikle / Rich]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major types
in existence (MS Office, Adobe, Text, Images, Video, Code, and science data) as
recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The 1.9 release of Tika was made last month (June 2015) with new features such
as cTakes[1] integration and probabilistic MIME detection. Work has now started
on the 1.10 development stream.

Community
=========================
The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in
April 2015 as committers and PMC Members.

The authors (Chris Mattmann and Jukka Zitting) of the Tika in Action book have
donated the examples from the book to Tika. These have been included in a new
tika-examples sub-module.

There have been articles published on NASA and the Jet Propulsion Lab’s
involvement in the Memex project, which features Apache projects including
Tika[1].

Mailing list activity on dev@ was at 317, 307 and 63 messages in May, June and
July 2015, respectively. user@ was at 27, 39 and 9 messages, during the same
timeframe.

[1] https://wiki.apache.org/tika/cTAKESParser
[2] http://s.apache.org/Mmo

22 Apr 2015 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.7) was made in January 2015. There has been much progress
since then with a release candidate for 1.8 currently being voted on.

Community
=========================
The Tika PMC added Luis Filipe Nassif in March 2015 and Giuseppe Totaro in
April 2015 as committers and PMC Members.

There are six talks related to Tika scheduled to take place at ApacheCon NA
2015.

Chris Mattman has registered to be a mentor for Google Summer of Code 2015
with some Tika issues marked as potential projects.

Mailing list activity on dev@ was at 445, 891 and 107 messages in February,
March and April 2014, respectively. user@ was at 15, 15 and 1 messages,
during the same timeframe.

21 Jan 2015 [Dave Meikle / Ross]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.7) was made in January 2015[1], with many new features
including an OCR Parser based on Tesseract, improvements to the Tika JAXRS
Server and a number of parser fixes & enhancements.

Discussions have now started on the dev@ list to outline a roadmap[2] for what
a new 2.X stream could look like for the evolution of Tika.

Community
=========================
The Tika PMC added Konstantin Gribov as a committer and PMC Member in January
2015.

Mailing list activity on dev@ was at 389, 253 and 289 messages in November,
December and January 2014, respectively. user@ was at 10, 28 and 29 messages,
during the same timeframe.

[1] http://s.apache.org/u0p
[2] http://s.apache.org/DSm

15 Oct 2014 [Dave Meikle / Chris]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.6) was made in September 2014. Work has started on version
1.7 with bug fixes already complete and new features such as OCR Parsing being
worked on.

Community
=========================
The Tika PMC added Ann Bryant Burgess as a committer and PMC Member in August
2014.

New community developed bindings[1] have been created for Tika including a
binding for NodeJS and an OpenShift Cartridge for Apache Tika Server.

Discussions have started to take place on the mailing list about a potential
Tika meetup at ApacheCon EU 2014.  Nick Burch is also presenting a talk titled
'What's With The 1s and 0s?' on using Tika and other related tools to analyse
binary content[2].

Mailing list activity on dev@ was at 393, 364 and 119 messages in August,
September and October 2014, respectively. user@ was at 23, 38 and 12 messages,
during the same timeframe.

[1] http://s.apache.org/Y7w
[2] http://sched.co/1pbkX7n

16 Jul 2014 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================
The last release (1.5) was made in February 2014. Work has continued on version
1.6 with a many bug fixes and new features, including many new file formats.
A discussion thread has started for a 1.6 release candidate.

Community
=========================
The Tika PMC added Lewis John McGibbney as a committer and PMC Member in June
2014.

A Tika Hackathon session took place at the ApacheCon NA 2014 conference, kicking
off improvements to our JAX-RS module. There were also presentations by
Annie Bryant, Nick Burch and Jukka Zitting as part of the main conference.

Mailing list activity on dev@ was at 298, 671 and 17 messages in May,
June and July 2014, respectively. user@ was at 15, 31 and 18 messages,
during the same timeframe.

16 Apr 2014 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.5) was made in February 2014.  Since then progress has
been steady on version 1.6 with a number of bug fixes and improvements.

Community
=========================
No new committers or PMC members were added since the last report.  Prior to
this the last new Committer and PMC Member was added in January 2014.

Tika is well represented at ApacheCon NA with four talks from three different
speakers (Jukka Zitting, Nick Burch and Annie Burgess). There is also plans to
conduct a couple of Hackathon sessions during the conference.

Mailing list activity on dev@ was at 293, 226 and 29 messages in February,
March and April 2014, respectively. user@ was at 26, 35 and 1 messages,
during the same timeframe.

15 Jan 2014 [Dave Meikle / Sam]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.4) was made in July 2013. Work has progressed on version
1.5 with a number of bug fixes and improvements. A discussion thread is
underway on creating a version 1.5 release candidate.

Community
=========================
The Tika PMC has voted to add Hong-Thai Nguyen as a committer and PMC Member,
with ACK to board earlier this week.  Prior to this the last Tika PMC and
Committer and PMC Member was added in July 2013.

Discussion has progressed around integrating Any23 components of value into
Tika. This is not in full swing yet however there is broad agreement on the
approach, with some initial patches being proposed and integrated.

Mailing list activity on dev@ was at 84, 155 and 28 messages in November,
December and January 2014, respectively. user@ was at 14, 10 and 0 messages,
during the same timeframe.

16 Oct 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that need the boards attention.

Releases
=========================

The last release (1.4) was made in July 2013 and work is currently underway
on version 1.5 with 23 issues currently resolved, comprising a mixture of
bug fixes and new features.

Community
=========================
The Tika PMC added Tim Allison as a committer and PMC Member in July 2013.

Chris Mattmann has won a National Science Foundation proposal for a project
at the University of Southern California to deliver an open source framework
for metadata exploration, automatic text mining and information retrieval of
polar data using Apache Tika[1].

Mailing list activity on dev@ was at 103, 86 and 29 messages in August,
September and October 2013, respectively. user@ was at 4, 18 and 0 messages,
during the same timeframe.

[1] http://s.apache.org/QqY

17 Jul 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that needs the board's attention.

Releases
=========================
Version 1.4 was released on the 2nd of July[1]. This release included
several important bugfixes and new features, including improvements to the
REST server and parser components.

Work is now underway on version 1.5.

Community
=========================
No new committers or PMC members were added since the last report, with both
the last committer and PMC member added in August 2012.

Mailing list activity on dev@ was at 126, 138 and 54 messages in May,
June and July 2013, respectively. user@ was at 7, 15 and 8 messages,
during the same timeframe.

[1] http://s.apache.org/7hB

17 Apr 2013 [Dave Meikle / Jim]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognised by IANA and other standards bodies.

Issues
=========================
There are no issues that needs the boards attention.

Releases
=========================
Version 1.3 was released on the 22nd of January[1]. This release included
several important bugfixes and new features, including better handling of
embedded files.

Work is now underway on version 1.4 with 15 issues resolved and 20 open
to date.

Community
=========================
No new committers or PMC members were added since the last report.

We have added a potential new feature as part of the ASFs potential projects
within the Google Summer of Code program[2].

Mailing list activity on dev@ was at 174, 70 and 11 messages in February,
March and April 2012, respectively. user@ was at 53, 32 and 10 messages,
during the same timeframe.

[1] http://s.apache.org/PDH
[2] https://issues.apache.org/jira/browse/TIKA-605

16 Jan 2013 [Dave Meikle / Doug]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognised by IANA and other
standards bodies.

Issues
=========================
There are no issues that needs the boards attention.

Releases
=========================
Work continues on version 1.3 with 47 resolved and 18 open/in progress
JIRA tickets adding new features and providing bug fixes, with a
discussion thread underway to assess the need for a new release.

Community
=========================
No new committers or PMC members were added since the last report.

Jukka Zitting presented a session at ApacheCon EU titled "Content
Extraction With Apache Tika" [1].

Mailing list activity on dev@ was at 124, 109 and 21 messages in
November, December 2012 and January 2013, respectively. user@ was at
19, 18 and 6 messages, during the same timeframe.

[1] http://s.apache.org/CUc

17 Oct 2012 [Dave Meikle / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognised by IANA and other
standards bodies.

Releases
=========================
We released version 1.2 on the 12th of July 2012[1]. This contained
new features such as the JAX-RS based network server and XMP metadata
handling, along with new file formats and parser improvements.

Work is currently underway on version 1.3 with 22 resolved and 19
open/in progress JIRA tickets, adding support for open graph metadata,
correct rounding of geodata information and improved mime type
detection for JPEG 2000 formats.

Community
=========================
The Tika PMC added Sergey Beryozkin (July 2012), Ingo Renner (July
2012) and Jörg Ehrlich (August 2012) as PMC members and committers.

As sponsor the Tika PMC voted to recommend the graduation of the Any23
incubator project to a TLP. This passed[3] and following the Incubator
PMC vote, the board approved the graduation resolution.

The Tika PMC voted to recommend Dave Meikle as the new chair[4]. This
was accepted by the board in August 2012 and Chris has now handed
duties over to Dave.

Jukka Zitting is scheduled to speak about Tika at ApacheCon Europe.
The session is titled 'Content extraction with Apache Tika'[5] and
shows how Tika can be used with a Lucene or Solr search index.

Mailing list activity on dev@ was at 173, 61 and 7 messages in August,
September and October 2012, respectively. user@ was at 34, 31 and 1
messages, during the same timeframe.

[1] http://s.apache.org/Vzr
[2] http://s.apache.org/HoO
[3] http://s.apache.org/gHE
[4] http://s.apache.org/kBQ
[5] http://s.apache.org/lnR

15 Aug 2012

Change the Apache Tika Project Chair

 WHEREAS, the Board of Directors heretofore appointed Chris Mattmann
 to the office of Vice President, Apache Tika, and

 WHEREAS, the Board of Directors is in receipt of the resignation
 of Chris Mattmann from the office of Vice President, Apache Tika,
 and

 WHEREAS, the Project Management Committee of the Apache Tika
 project has chosen to recommend David Meikle the successor
 to the post;

 NOW, THEREFORE, BE IT RESOLVED, that Chris Mattmann is relieved and
 discharged from the duties and responsibilities of the office
 of Vice President, Apache Tika, and

 BE IT FURTHER RESOLVED, that David Meikle and hereby is
 appointed to the office of Vice President, Apache Tika, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification, or
 until a successor is appointed.

 Special Order 7A, Change the Apache Tika Project Chair, was
 approved by Unanimous Vote of the directors present.

25 Jul 2012 [Chris A. Mattmann / Sam]

What is Tika?
=========================
Apache Tika is a dynamic toolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information from, a
growing a list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code, and science
data) as recognized by IANA and other standards bodies.

Releases
=========================
Progress towards the 1.2 release continues. There have been a few
recent threads discussing making an RC ([1 and [2]). We anticipate
the 1.2 RC and official release arriving in the next month or so.

The 1.2 RC addresses 63 issues [3] including new features (e.g.,
Tika JAX-RS network server [4]), bug fixes (e.g., misuse of HTTP
content-encoding header [5]) and a enhanced approach to dealing
with metadata key naming and representation [6] including XMP
support.

Community
=========================
The Tika PMC added Ray Gauss as a Tika PMC member and
committer in May 2012.

The Tika PMC is still sponsoring the Any23 incubator project [7],
which is progressing along nicely and getting ready to make their
first Incubator release.

Chris started a thread [8] on private to discuss potentially rotating
the chair. So far there hasn't been strong positive or negative reception
to this suggestion.

Mailing list activity on dev@ was at 174, 102 and 134 messages in May,
June and July 2012, respectively. user@ was at 26, 28 and 45 messages,
during the same timeframe.


[1] http://s.apache.org/MFq
[2] http://s.apache.org/AMZ
[3] http://s.apache.org/w9
[4] https://issues.apache.org/jira/browse/TIKA-593
[5] https://issues.apache.org/jira/browse/TIKA-431
[6] http://wiki.apache.org/tika/MetadataRoadmap
[7] http://incubator.apache.org/any23/
[8] http://s.apache.org/P7c

(Tika)

18 Apr 2012 [Chris A. Mattmann / Bertrand]

What is Tika?
=========================
Apache Tika is a dynamictoolkit for content detection, analysis, and
extraction. It allows a user to understand, and leverage information
from, a growing a list over 1200 different file types including most
of the major types in existence (MS Office, Adobe, Text, Images,
Video, Code, and science data) as recognized by IANA and other
standards bodies.

Releases
=========================
We released Tika 1.1 on 3/23/12.

The current work on Tika 1.2 includes 14 of 33 issues already fixed.
These issues include a cool new Tika JAX-RS network server [2] that
really helped foster good will between the Apache Tika and CXF
communities. Sergey Beryozkin from CXF, and Maxim Valyanskiy from
Tika really led the way. Besides the network server, MIME type
support for the scientific data file format FITS, used heavily in
the astronomy community was added [3], and the ability to extract
embedded images from Powerpoint files [4] and the improvements to
the way that Tika load Detectors and Parsers in an OSGI environment
were also added [5] in the current trunk development branch.

There has been discussion of adding GDAL support to TIKA [6], which
would add hundreds spatial formats and the ability to parse and
detect them to Tika.

Community
=========================
No new PMC members/committers
were added in the last quarter.

The Tika PMC is still sponsoring the Any23 incubator project [7],
which is progressing along nicely and getting ready to make their
first Incubator release.

Mailing list activity on dev@ remained steady in January, February,
and March 2012 (189, 125, 200 messages) but slowed in April 2012
(48 messages, respectively), while the user activity remained
consistent in January, February and March 2012 (51, 46 and 37
messages). No user questions in April 2012 yet.


[1] http://s.apache.org/dWz
[2] https://issues.apache.org/jira/browse/TIKA-593
[3] https://issues.apache.org/jira/browse/TIKA-874
[4] https://issues.apache.org/jira/browse/TIKA-883
[5] https://issues.apache.org/jira/browse/TIKA-884
[6] https://issues.apache.org/jira/browse/TIKA-605
[7] http://s.apache.org/gJ

24 Jan 2012 [Chris A. Mattmann / Roy]

What is Tika?
=========================
Apache Tika is a dynamic
toolkit for content detection, analysis, and extraction. It allows
a user to understand, and leverage information from, a growing a
list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code,
and science data) as recognized by IANA and other standards bodies.

Releases
=========================
We released Tika 1.0 on 11/7/11
[1] to coincide with Apache Con NA 2011. Since then we've been
working on the 1.1 release, with 42 issues already resolved in JIRA
[2] and 1.1 likely to be shipped in the next quarter.

Community
=========================
We added Jerome Charron and
Antoni Mylkato the Tika PMC in November 2011.

At ApacheCon NA, and since, there have been some discussions regarding
Tika and the ODF Toolkit [3], as well as the Tika PMC's sponsorship
of the Any23 incubator project [4], which is progressing along.

Mailing list activity on dev@ remained steady in November and
December 2011 (259, 287 messages) but slowed in January 2012 (21
messages, respectively), while the user activity remained consistent
in November and December 2011 (47 and 65 messages), but has been
quiet in January 2012 (12 messages).

Chris Mattmann gave a talk at ApacheCon NA 2011 on "Apache Tika:
One Point Oh!" [5] commemorating the upcoming 1.0 Tika release.
Chris also had some discussions about Any23 and Tika with Lewis
John McGibbney at ApacheCon NA 2012. Lewis is a Nutch PMC member,
the candidate for the Gora VP with the current board resolution and
an Any23 Incubator committer.

Press
=========================
Chris and Jukka and Sally have
published a press release [6] about Tika's use at NASA, as well as
its use at Day Software and other companies. The press release went
out during ApacheCon NA 2011 and was perfect timing with the event.

Tika In Action
==========================
Chris Mattmann and Jukka
Zitting completed the Manning book called "Tika in Action" [7] in
time for ApacheCon NA 2011. The book is now available in print,
ebook and mobi editions. Yay!

[1] http://s.apache.org/AE6
[2] http://s.apache.org/HLX
[3] https://issues.apache.org/jira/browse/TIKA-737
[4] http://s.apache.org/gJ
[5] http://na11.apachecon.com/talks/19391
[6] http://s.apache.org/N0I
[7] http://manning.com/mattmann/

26 Oct 2011 [Chris A. Mattmann / Jim]

What is Tika?
=========================
Apache Tika is a dynamic
toolkit for content detection, analysis, and extraction. It allows
a user to understand, and leverage information from, a growing a
list over 1200 different file types including most of the major
types in existence (MS Office, Adobe, Text, Images, Video, Code,
and science data) as recognized by IANA and other standards bodies.

Releases
=========================
We rolled the Tika 0.10 release
on 9/30/2011 [1]. We opted for a 0.10 instead of 1.0 to try and
time the 1.0 release with ApacheCon. We've got a few weeks left and
are going to try and make it!

Community
=========================
We added Michael McCandless to the Tika PMC on August 29th, 2011.

The Tika PMC agreed to sponsor the Any23 (Anything to Triples)
Incubator project [3]. Any23 is a semantic understanding toolkit,
whose goal is to extraction information from, to detect, and to
reason over most of the current semantic document formats including
RDF, OWL, etc. Any23 leveraged Tika in its existing framework at
Googlecode, and we see the projects hopefully having a lot of synergy
going forward. Any23 was accepted into the Incubator on October 1,
2011 [4].

Mailing list activity on dev@ is growing (197, 356, and 186 in
August, September and October 2011, respectively), while the user
activity grew a little bit in August and September (70 and 66
messages), but has been relatively quiet in October (3 messages).

Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache
Tika: One Point Oh!" [5] commemorating the upcoming 1.0 Tika release.

Press
=========================
Chris and Jukka and Sally have
drafted a press release about Tika's use at NASA, as well as its
use at Day Software and other companies. The goal is to have this
coincide with the 1.0 release and ApacheCon NA 2011.

Tika In Action
==========================
Chris Mattmann and Jukka
Zitting are writing a Manning book called "Tika in Action" [6] and
the book is in its final copyediting stages and it should be in
print in time for ApacheCon NA 2011.


[1] http://tika.apache.org/0.10/index.html
[2] http://s.apache.org/HzK
[3] http://s.apache.org/HWO
[4] http://s.apache.org/gJ
[5] http://na11.apachecon.com/talks/19391
[6] http://manning.com/mattmann/

20 Jul 2011 [Chris A. Mattmann / Jim]

Releases
=========================
Progress towards the 1.0 release
continues and we hope to roll a 1.0 in the Q3 timeframe.

Development activity is progressing, and there are a number of
issues being worked on the user and dev lists. Notably, the Tika
command line interface now outputs JSON [1], new document formats
are being worked and/or improved: ole2 and ooxml [2], the pcap
format [3], the CHM format [4], the PRT format [5] and some new
Font parsers [6].

Based on dev discussion, Tika's MIME identifier is becoming more
prominently used in the Aperture project [7].

There was also some discussion regarding Tika's relationship with
Apache OOo [8].

Community
=========================
No new committers or PMC members elected in this quarter.

Mailing list activity on dev@ remains steady (near the ~150 message
range), while user@ is coming in at around ~50 or so messages per
month.

Chris Mattmann will give a talk at ApacheCon NA 2011 on "Apache
Tika: One Point Oh!" commemorating the upcoming 1.0 Tika release.

Chris Mattmann's CS572 Search Engines class students at USC are
doing final projects related to Search technologies, and one of
them is from Fernando Arreola [6] who is contributing new font
parsers as mentioned above.

Press
=========================
Chris and Jukka and Sally are coordinating with Priscilla Vega from JPL to
make a press release about Tika's use at NASA, as well as its use at Day
Software and other companies. The goal is to have this coincide with the
1.0 release and OSCON.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika
in Action", and it is now complete and handed off to production. All
Chapters of the book and the appendices are now available on the MEAP page
[9]. The book is set to be published in the Q2/Q3 timeframe of 2011.

[1] https://issues.apache.org/jira/browse/TIKA-213
[2] https://issues.apache.org/jira/browse/TIKA-652
[3] https://issues.apache.org/jira/browse/TIKA-658
[4] https://issues.apache.org/jira/browse/TIKA-245
[5] https://issues.apache.org/jira/browse/TIKA-679
[6] http://s.apache.org/17Z
[7] http://s.apache.org/Z4r
[8] http://s.apache.org/3aC
[9] http://www.manning.com/mattmann/

20 Apr 2011 [Chris A. Mattmann / Jim]

Releases
=========================
We made our 0.9 release [1] in February 2011. This fix included some
critical bug fixes including a fix that re-enabled extraction of metadata
from Scientific Data File Formats (HDF+NetCDF) from the command line. In
addition, the release included a patch [2] that significantly reduced the
number of pulled in dependencies having to do with NetCDF. Support for
parsing via external forking was also added [3]. See [4] for a full list of
changes.

Progress towards the 1.0 release continues and we hope to roll a 1.0 in the
Q2/Q3 timeframe.

Community
=========================

We added Oleg Tikhonov to the Tika PMC in April 2011 [5].

Mailing list activity on user@ and dev@ remain steady (near the ~100 message
range).

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We completed a full draft of the
entire book, and Chapters 9 and 10 are now available on the MEAP page [6].
The book is set to be published in the Q2/Q3 timeframe of 2011.

[1] http://s.apache.org/1lE
[2] http://issues.apache.org/jira/browse/TIKA-596
[3] http://issues.apache.org/jira/browse/TIKA-556
[4] http://www.apache.org/dist/tika/CHANGES-0.9.txt
[5] http://s.apache.org/Jur
[6] http://www.manning.com/mattmann/

19 Jan 2011 [Chris A. Mattmann / Geir]

Releases
=========================
We've made our 0.8 release [1] in November 2010. It's been
a long time coming, and there were over 98 JIRA issues [2]
addressed in the release. Work progresses towards a patch
release (either 0.8.1 or 0.9) and Chris Mattmann plans to RM
it and roll a release candidate hopefully in the next month.
This release should fix a number of smaller issues found after
folks have upgraded to 0.8.


Community
=========================

We added Maxim Valyanskiy to Tika PMC in November 2010 [3].

Chris Mattmann gave a talk on Tika titled Scientific Data
Curation and Processing with Apache Tika [4] at ApacheCon NA
in November 2010 during the Lucene and friends session. The
morning talk was well attended and it was great to finally meet
everyone in person!


Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book
called "Tika in Action", and it is progressing steadily. We
completed our 2/3 book review and Chapters 1-8 of the book are
now available [5] through Manning's Early Access Program or MEAP.

[1] http://s.apache.org/W1Dh
[2] http://s.apache.org/73R
[3[ http://s.apache.org/dj0
[4] http://s.apache.org/2ak
[5] http://www.manning.com/mattmann/

20 Oct 2010 [Chris A. Mattmann / Sam]

Releases
=========================
Progress towards an 0.8 release continues. There has been some recent
activity by Jukka Zitting [1] towards allowing for easily using Apache Tika
in environments requiring Compressed RTF / TNEF / LZFU parsing. Along these
lines a broader discussion [2] of how to use Apache Tika with parsing
libraries that are GPL-licensed ensued and it was determined that plugins
could be developed and hosted externally (e.g., at Github) and used with an
official Apache Tika release from the ASF just by dropping the plugin jars
onto a deployed version of Apache Tika's classpath. This provides a great
solution for folks who want to use Apache Tika in environments where they
need to parse formats which require non-ASLv2 friendly libraries.

We've had some nice documentation updates occur since the last report as
well. Arturo Beltran contributed a "get Tika parsing up and running in 5
minutes" quick start guide in TIKA-464 [3]. Nick Burch also added
documentation on the Container-aware Tika detection capabilities in TIKA-477
[4]. Paul Jakubik started a great wiki discussion on container-based
Metadata formats in [5], the first major use of the Tika wiki [6]. Chris
Mattmann has since done a bit of reorganizing of information, adding the
Tika logo to the wiki as well.

Other various notable development items include an RSS Feed Parser (TIKA-466
[7]) from Julien Nioche and Chris Mattmann, Container-aware Mime Detection
(TIKA-477 [4]) from Nick Burch, Jukka Zitting et al., and various Charset
improvements (e.g., TIKA-471 [8] and TIKA-529 [9]) from Ken Krugler.

Tika's website is now hooked up to SVNpubsub thanks to Jukka Zitting [10].

One of the blockers to the 0.8 release, getting NetCDF pushed to Maven
Central, has seen some good progress as of late. Chris Mattmann has created
a space for the NetCDF jars in the Sonatype OSS free hosting area that is
synced to Maven Central [11].


Community
=========================

There have been no new Tika PMC members or committers elected in this
quarter.

A podcast for Tika recorded in April 2010 between Chris Mattmann and Rich
Bowen was posted on the Feathercast.org website and mentioned on the Tika
mailing lists [12].

The Tika community was consulted for two website updates. The first site
update was to post a link and image of the Tika in Action book [13]. The
second site update was to allow for the selection of search engine provider
used to search the Tika website [14].

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We are currently doing our 2/3 book
review and Chapters 1-5 of the book are now available [15] through Manning's
Early Access Program or MEAP.

[1] http://github.com/jukka/jtnef
[2] http://s.apache.org/JGz
[3] https://issues.apache.org/jira/browse/TIKA-464
[4] https://issues.apache.org/jira/browse/TIKA-477
[5] http://wiki.apache.org/tika/MetadataDiscussion
[6] http://wiki.apache.org/tika/
[7] https://issues.apache.org/jira/browse/TIKA-466
[8] https://issues.apache.org/jira/browse/TIKA-471
[9] https://issues.apache.org/jira/browse/TIKA-529
[10] https://issues.apache.org/jira/browse/TIKA-473
[11] https://issues.apache.org/jira/browse/TIKA-407
[12] http://s.apache.org/Zmd
[13] http://s.apache.org/XUK
[14] http://s.apache.org/RgA
[15] http://manning.com/mattmann/

Shane appreciates the detail in this report.

21 Jul 2010 [Chris A. Mattmann / Shane]

Releases
=========================
There's been a bunch of development activity and mailing list activity in
Tika on a broad range of issues: related to charset detection improvement
[1] from Ken Krugler, to BoilerPlate extraction [2] (also from Ken), to
improving HTML parsing overall [3] [4] [5] with contributions from Julien
Nioche, Ken Krugler, Jukka Zitting, and our new committer Nick Burch (see
Community section). There has also been a flurry of activity on new Parsers,
improving parsing detection with Timeouts, Geospatial information
representation, improving PDF parsing and a host of other issues that
indicate the development community of Tika is thriving.

We've slipped a bit on our original intention to release 0.8 in this month's
timeframe. One thing that will need to get solved is the upload of some
current external dependencies (e.g., NetCDF, Boilerpipe) to Maven Central.
Chris Mattmann and Ken Krugler volunteered to take the lead on this. 0.8
isn't far off, but it's dependent on getting those jars up to Maven central.

Community
=========================
The Tika PMC elected to add Nick Burch [6] to the Tika PMC and committers
group. Nick is an ASF member, and the VP of Apache POI, an important
external library dependency of many of the Tika parsers. Welcome, Nick!

Chris Mattmann is teaching CS572: Information Retrieval and Web Search
Engines at USC this summer [7], and several of the final projects in his
class are related to Tika. There is a plan to contribute the final code
produced from some of the students in the form of JIRA issues and patches.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We have completed our 1/3 book
review and the book is now available [8] through Manning's Early Access
Program or MEAP.

IBM Developerworks Article on Tika
==================================
Chris Mattmann and Oleg Tikhonov published a short IBM Developerworks intro
article on Tika [9].

[1] http://issues.apache.org/jira/browse/TIKA-459
[2] http://issues.apache.org/jira/browse/TIKA-462
[3] http://issues.apache.org/jira/browse/TIKA-394
[4] http://issues.apache.org/jira/browse/TIKA-460
[5] http://issues.apache.org/jira/browse/TIKA-463
[6] MID: AANLkTinayUCu9pgv8p-ds3SFFMxjpCMLGkxhMTQiTFm7@mail.gmail.com
[7] http://sunset.usc.edu/classes/cs572_2010/
[8] http://manning.com/mattmann/
[9] http://s.apache.org/FXa

16 Jun 2010 [Chris A. Mattmann / Greg]

This is the second report from Apache Tika in its new TLP status approved at
the April 2010 board meeting.

TLP migration
=========================
All complete! Website has been updated, SVN taken care of, ML lists migrated
and UNIX groups and domain taken care of. Thanks to Gavin for handling this.

Releases
=========================
Since our last report, we've committed some important bug fixes including
TIKA-379 [1] which fixed and allowed HTML elements and attributes to be
available in the parsed XHTML provided by Tika, and some mime type detection
fixes in particular TIKA-417 [2]. We still think we are on target to release
0.8 within the next month or so.

License Issue
=========================
During the last board meeting, there was a question brought up regarding the
licensing issue of UCAR/NCAR's NetCDF java library, which was used to
implement TIKA-400 netCDF Tika Parser [3]. The question pertained to the
question of an "advertising" clause in the UCAR/NCAR license, and the board
asked Chris to follow up on it. Chris took the issue over to legal-discuss@
and a resolution to the matter was arrived upon [4] which included adding
some text to NOTICE.txt and LICENSE.txt in Tika, fulfilling the advertising
clause via the Apache NOTICE mechanism. This issue was tracked and fixed in
Tika in TIKA-432 [5], and is now solved in the current 0.8 trunk, removing a
roadblock to the 0.8 release.

Community
=========================
The Tika PMC elected to add Julien Nioche [6] to the Tika PMC and committers
group. Julien is a Nutch committer, and has been providing quality patches
to Tika (incl. the fix for TIKA-379) and good mailing list support.

Jukka Zitting presented on Tika at the Lucene Eurocon [7] and Berlin
Buzzwords [8] conferences.

Tika In Action
==========================
Chris Mattmann and Jukka Zitting are writing a Manning book called "Tika in
Action", and it is progressing steadily. We are preparing for a 1/3 book
review likely in the next week or so, and the book is about to be available
in Manning's Early Access Program or MEAP, electronically, any day now.

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-379
[2] http://issues.apache.org/jira/browse/TIKA-417
[3] http://issues.apache.org/jira/browse/TIKA-400
[4]http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201005.mbox/%3CAANLkTinXO4AZkYDm0L83SipPXl48kYC8enFRhamJCBvJ@mail.gmail.com%3E
[5] http://issues.apache.org/jira/browse/TIKA-432
[6]http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3CC83020F3.14F6B%25Chris.A.Mattmann@jpl.nasa.gov%3E
[7] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika
[8] http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630

19 May 2010 [Chris A. Mattmann / Jim]

This is the first report from Apache Tika in its new TLP status approved at
the April 2010 board meeting.

TLP migration
=========================
Progress is being made in transitioning Tika off of the Lucene site and
infrastructure and into its new TLP home. The current status is:

(Gavin from the INFRA team grouped the below into a TLP migration task [1])

1. mailing lists migration, filed INFRA-2645 [2], Gavin working on it as of
last Saturday

2. SVN, filed INFRA-2646 [3], status is same as for #1

3. UNIX groups, and home on www.apache.org/dist [4], status same as #1

4. Domain name for tika.apache.org [5], status same as #1

Releases
=========================
Steady progress on the 0.8 Tika release is being made, with contributions
made to add more file formats (netCDF is a recent addition), to allow Tika
to be used in server environments with their own classloaders (like Apache
SOLR), and to improve image metadata extraction. We hope to release 0.8
within the next month or so.

Community
=========================
The Tika community reached out to the NetCDF community in order to get their
NetCDF jars released to Maven central (currently TIKA-400 [6] relies on
NetCDF jars from an external Maven2 repository that's not synced with
Central). Jukka Zitting pointed out that this isn't a best practice, so we
are working with the NetCDF community to resolve this, and they have been
receptive [7], in particular John Caron from UCAR/NCAR is working to help us
out.

[1] https://issues.apache.org/jira/browse/INFRA-2692
[2] https://issues.apache.org/jira/browse/INFRA-2645
[3] https://issues.apache.org/jira/browse/INFRA-2646
[4] https://issues.apache.org/jira/browse/INFRA-2647
[5] https://issues.apache.org/jira/browse/INFRA-2676
[6] https://issues.apache.org/jira/browse/TIKA-400
[7] http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/201004.mbox/%3C4BD758B7.60202@unidata.ucar.edu%3E

The thredds/cdm dependency appears to have an "advertising" clause. To be followed up on legal-discuss.

21 Apr 2010

Establish the Apache Tika Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to content detection and analysis for
 distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the "Apache Tika Project",
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Tika Project be and hereby is
 responsible for the creation and maintenance of software
 related to a content analysis and detection toolkit; and be it further

 RESOLVED, that the office of "Vice President, Apache Tika" be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Tika Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Tika Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Tika Project:

   * Chris A. Mattmann (mattmann@apache.org)
   * Jukka Zitting (jukka@apache.org)
   * Ken Krugler (kkrugler@apache.org)
   * Keith Bennett (kbennett@apache.org)
   * Mark Harwood (mharwood@apache.org)
   * Dave Meikle (dmeikle@apache.org)
   * Sami Siren (siren@apache.org)
   * Rida Benjelloun (ridabenjelloun@apache.org)

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Chris A. Mattmann
 be appointed to the office of Vice President, Apache Tika, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed; and be it further

 RESOLVED, that the Apache Tika Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Tika sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Tika sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged

 Special Order 7C, Establish the Apache Tika Project, was
 approved by Unanimous Vote of the directors present.

15 Oct 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Dave Meikle was just voted in as a new committer.
 * Paolo Mottadelli will present Tika at ApacheCon US.

Development

 * Tika 0.2 should be released soon.
 * Usage documentation has been added to the website.

Issues before graduation:

 * The current plan is to graduate as a Lucene subproject, which could
happen soon as the incubation criteria seem to be met.

16 Jul 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Tika community remains relatively small, with just a handful of active
members

Development

 * Work towards Tika 0.2 continues, Chris Mattman has volunteered to be the
release manager

Issues before graduation:

 * Increase the size and diversity of the community (or graduate into a
Lucene subproject?)

16 Apr 2008

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community

 * Niall Pemberton joined the project as a committer and PPMC member
 * The number of issues reported by external contributors is growing
gradually
 * There was a Fast Feather Talk on Tika in ApacheCon EU 2008
 * We have good contacts especially with Apache POI and PDFBox

Development

 * We are working towards Tika 0.2
 * Metadata handling improvements are being discussed

Issues before graduation:

 * Increase the size of the community

17 Oct 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser Libraries. Tika entered
incubation on March 22nd, 2007.

Community

There have been a number of positive items within Tika during the last few
months. The traffic on the Tika mailing list has increased significantly
(with typically 2, 3 questions, and 1 or 2 commits every day, or every other
day), and there have been a lot of recent inquiries from external projects
wanting to collaborate with Tika (including Aperture, PDFBox and a fellow
developing a JSon library currently hosted at Google code). In addition,
Tika's architecture has become a recent discussion of interest (as we'll see
below).

We recently elected Keith Bennett as a new committer to Tika. Keith has been
spearheading many of the new patches committed to Tika, as well as
participating in discussions about the architecture, and future direction of
the project.

Tika will be represented at the "Fast Feather" track at Apache Con US by
Jukka Zitting. The rest of the community is helping to create the content
for the presentation. The abstract is listed below:

Tika is a new content analysis framework borne from the desire to factor our
commonality from the Apache Nutch search engine framework. Tika provides a
mime detection framework, an extensible parsing framework and metadata
environment for content analysis. Though in its nascent stages, progress on
Tika has recently taken shape and the project is nearing a stable 0.1
release. In this talk, we'll describe the core APIs of Tika and discuss its
use in several distinct domains including search engines, scientific data
dissemination and an industrial setting.


Development

There have been a flurry of JIRA issues and code activity [1] including 47
issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2
open major/minor issues in progress).

Tika's Parser interface (one of its key components) has just undergone a
major overhaul led by Jukka Zitting, and Chris Mattmann has recently
contributed a MimeType system (with help from fellow Apache Nutch committer
Jerome Charron) to Tika. We also cleaned up and refactored large parts of
the rest of the code (removing references to LuisLite and branding the
project wherever possible with the Tika name), in preparation for an
upcoming 0.1 release.

Chris Mattmann has led an effort to carve out the existing MimeType
detection system in Apache Nutch [2] and replace it with Tika's improved
MimeType detection system. There is a patch sitting in JIRA right now [3],
and barring objections, Nutch will rely on Tika for its MimeType detection
abilities.

Also active recently were committers Bertrand Delacretaz, Sami Siren and
Rida Benjelloun, committing patches and improvements wherever needed.


Issues before graduation

No changes since our last report: the Tika project is still at an early
stage of incubation. We need to continue bringing in the initial codebases
and are targeting an initial incubating release (0.1) probably within the
next month. We also need to work on growing the community and figuring out
how to best interact with external parser projects.


 1. http://issues.apache.org/jira/browse/TIKA
 2. http://lucene.apache.org/nutch/
 3. http://issues.apache.org/jira/browse/NUTCH-562

18 Jul 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various document formats using existing parser
libraries. Tika entered incubation on March 22nd, 2007.

Community:
 * The Tika mailing list has seen increased activity in the last weeks,
with some new people showing interest for Tika's goals.
 * Grant Ingersoll brought the Aperture framework to our attention
(http://aperture.sourceforge.net/), which has similar goals to Tika. We will
look at possible synergies.

Development:
 * No code has been committed since our last report, but some initial code
is ready in JIRA and should be committed soon.

Issues before graduation:
 * No changes since our last report: the Tika project is still at an early
stage of incubation. We need to continue bringing in the initial codebases
and probably target for an initial incubating release later this year. We
also need to work on growing the community and figuring out how to best
interact with external parser projects.

20 Jun 2007

Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.
Tika entered incubation on March 22nd, 2007.

Community

The Tika mailing lists have been relatively quiet lately, probably
because with little code we don't yet have many concrete issues to talk
about.

Development

We saw the first piece of Tika code when Chris A. Mattmann ported the
Nutch metadata framework to Tika. Rida Benjelloun has created a version
of the Lius codebase to be included in Tika, and the code is currently
in the issue tracker.

Issues before graduation

The Tika project is still at an early stage of incubation. We need to
continue bringing in the initial codebases and probably target for an
initial incubating release later this year. We also need to work on
growing the community and figuring out how to best interact with
external parser projects.

16 May 2007

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser libraries.

Incubating since: March 22nd, 2007.

__Community__

We had a good project bootstrap meeting as a part of the text analysis BOF
at the ApacheCon EU in Amsterdam. The resulting ideas were summarized on the
project mailing list, and the first design threads have started.

__Development__

We've started discussing the design of the Tika toolkit. It seems like we
will select one of the existing codebases listed in the project proposal as
the basis of an early 0.1 release, and start refactoring the code into a
more generic toolkit. The Tika svn tree is still empty, but I expect us to
see the first code commits before the next report.

__Infrastructure__

All the initial infrastructure is now in place. There is still some activity
on the temporary Tika wiki on the Google Project hosting service, so we may
end up requesting a Tika wiki to be set up on the ASF infrastructure.

__Issues before graduation__

The Tika project is still at an early stage of incubation. The most
important tasks before graduation are to develop and release the Tika
codebase and to grow a diverse and sustainable project community.

25 Apr 2007

iPMC Reviewers: rdonkin

Tika is a toolkit for detecting and extracting metadata and structured text
content from various documents using existing parser libraries. Tika entered
incubation on March 22nd, 2007.

The Tika project has just started. The basic infrastructure (mailing lists,
subversion, issue tracker, web site) is mostly in place; the only thing
still missing is one committer account. We expect to get started with the
actual design and code work during the next few weeks.