B Oracle Text Supported Document Formats

This appendix contains a list of the document formats supported by the automatic (AUTO_FILTER) filtering technology. The following topics are covered in this appendix:

B.1 About Document Filtering Technology

The automatic filtering technology in Oracle Text uses the HTML Export technology provided by Oracle Outside In. This technology also enables you to convert documents to HTML for document presentation with the CTX_DOC package.

To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER object in your filter preference.

To use automatic filtering technology for converting documents to HTML with the CTX_DOC package, you need not use the AUTO_FILTER indexing preference, but you must still set up your environment to use this filtering technology, as described in this appendix.

Note:

The underlying technology used by Oracle Text was migrated to Oracle Outside In HTML Export in release 11.1.0.7. See "Formats No Longer Supported in 11.1.0.7" for a list of formats that are no longer supported as a result of this migration. Applications that require support for those formats can use USER_FILTER to plug in third-party filtering technology supporting those formats. See "USER_FILTER" for more information.

B.1.1 Latest Updates for Patch Releases

The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases.

B.1.2 Restrictions on Format Support

Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER filter.

For other limitations, refer to sections in this chapter concerning specific document types.

B.1.3 Supported Platforms for AUTO_FILTER Document Filtering Technology

Several platforms can take advantage of AUTO_FILTER filter technology.

B.1.3.1 Supported Platforms

AUTO_FILTER filter technology is supported on the following platforms:

  • Windows (x86 32-bit) Windows 2000, Windows 2003, Windows XP, and Windows Vista

  • Windows (Itanium 64-bit) Windows .Net Server 2003 Enterprise Edition

  • Windows (x86 64-bit) Windows 2003 x64 Standard, Enterprise, and Datacenter Editions (64-bit Extended Systems)

  • HP-UX (PA-RISC 64-bit) 11.i

  • HP/UX (Itanium 64) 11i

  • IBM AIX (32-bit pSeries) 5.1 - 5.3

  • iSeries (OS/400 using PASE) V5R2

  • Red Hat Linux (x86) Advanced Server 3, 4, and 5

  • Red Hat Linux (x86) Red Hat Enterprise Linux (RHEL) 4

  • Red Hat Linux (Itanium 64) Advanced Server 3, 4, and 5

  • Red Hat Linux (zSeries, 31-bit) Advanced Server 3 and 4

  • Red Hat Enterprise Linux AS/ES 3.0, 4.0 and 5.0, x86-64 (AMD64/EM64T)Oracle Enterprise Linux 4.0 and 5.0, x86-64 (AMD64/EM64T)

  • SuSE Linux (x86) 9, 10, and Enterprise Server 9.0

  • SuSE Linux (x86 64-bit) SUSE Enterprise Server (SLES) 9, 10

  • SuSE Linux (Itanium 64) Enterprise Server 8

  • SuSE Linux (zSeries, 31-bit) 9

  • Sun Solaris (SPARC 64-bit) 9.x - 10.x

  • Sun Solaris (x86-64-bit) 10x

Note that some of these platforms may not be supported by the Oracle Database.

B.1.4 Filtering on PDF Documents and Security Settings

A PDF document can have different levels of security settings as follows:

Table B-1 AUTO_FILTER Behavior with PDF Security Settings

Security Level Description PDF Version Encryption AUTO_FILTER Support Level

Level 1

Requires a password for opening the document.

1.2+

40 bit RC4

Not supported.

   

1.4+

128 bit RC4

Not supported.

   

1.5+

128 bit RC4

Not supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 2

Disallows user printing of the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 3

Disallows user modification or change of the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit RC4

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 4

Disallows the user from copying or extracting content from the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.


B.1.5 PDF Filtering Limitations

The following limitations apply when filtering PDF files:

  • Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts.

  • Embedded fonts in a PDF document are not filtered correctly. They are usually displayed using the question mark (?) replacement character.

  • Hyperlinks in a PDF are not active when displayed in a browser or a viewing window.

  • Annotations, such as notes, sound, or movies, are not supported.

B.1.6 Environment Variables

No environment variables need to be set by the user.

B.1.7 General Limitations

AUTO_FILTER filter technology has the following limitations:

  • Any ASCII characters less then 0x20 (decimal 32) are converted to hexadecimal numbers.

  • Files larger than 2GB are not handled.

B.2 Supported Document Formats

The tables in this section list the document formats that Oracle Text supports for filtering.

Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC package.

Note:

These lists do not represent the complete list of formats that Oracle Text is able to process. The USER_FILTER and PROCEDURE_FILTER enable Oracle Text to process any document format, provided an external filter exists that can filter to some textual format like plain-text, HTML, XML, and so forth.

B.2.1 Word Processing and Desktop Publishing Formats

Format Version
Adobe FrameMaker (MIF) Versions 3.0, 4.0, 5.0, and 6.0 and Japanese 3.0, 4.0, 5.0, and 6.0 (text only)
ANSI Text 7 and 8 bit
ASCII Text 7 and 8 bit
DEC WPS Plus (DX) Versions through 3.1
DEC WPS Plus (WPL) Versions through 4.1
DisplayWrite 2 and 3 (TXT) All versions
EBCDIC All versions
Enable Versions 3.0, 4.0, and 4.5
First Choice Versions through 3.0
Framework Version 3.0
Hangul Versions 97, 2002, and 2005
IBM FFT All versions
IBM Revisable Form Text All versions
IBM Writing Assistant Version 1.01
Just System Ichitaro Versions 4.x through 6.x, 8.x through 13.x and 2004
JustWrite Versions through 3.0
Legacy Versions 1.1
Lotus AMI/AMI Professional Versions 3.1
Lotus Manuscript Version 2.0
Lotus Notes DXL All versions
Lotus Notes NSF All versions (File ID support only)
Lotus Word Pro (non-Windows) Versions SmartSuite 97, Millennium, and Millennium 9.6 (text only)
Lotus Word Pro (Windows) Versions SmartSuite 96, 97, and Millennium and Millennium 9.6
MacWrite II Version 1.1
MASS11 Versions through 8.0
Microsoft Rich Text Format (RTF) All versions
Microsoft Word (DOS) Versions through 6.0
Microsoft Word (Mac) Versions 4.0 - 2004
Microsoft Word (Windows) Versions through 2007
Microsoft WordPad All versions
Microsoft Works (DOS) Versions through 2.0
Microsoft Works (Mac) Versions through 2.0
Microsoft Works (Windows) Versions through 4.0
Microsoft Windows Write Versions through 3.0
MultiMate Versions through 4.0
Navy DIF All versions
Nota Bene Version 3.0
Novell Perfect Works Version 2.0
Novell/Corel WordPerfect (DOS) Versions through 6.1
Novell/Corel WordPerfect (Mac) Versions 1.02 through 3.0
Novell/Corel WordPerfect (Windows) Versions through 12.0
Office Writer Versions 4.0 - 6.0
OpenOffice Writer (Windows and UNIX) OpenOffice version 1.1 and 2.0
PC-File Letter Versions through 5.0
PC-File+ Letter Versions through 3.0
PFS:Write Versions A, B, and C
Professional Write Plus (Windows) Version 1.0
Q&A (DOS) Version 2.0
Q&A Write (Windows) Version 3.0
Samna Word Versions through Samna Word IV+
Signature Version 1.0
SmartWare II Version 1.02
Sprint Versions through 1.0
StarOffice Writer Version 5.2 (text only) and 6.x through 8.x
Total Word Version 1.2
Unicode Text All versions
UTF-8 All versions
Volkswriter 3 and 4 Versions through 1.0
Wang PC (IWP) Versions through 2.6
WordMARC Versions through Composer Plus
WordStar (Windows) Version 1.0
WordStar 2000 (DOS) Versions through 3.0
XyWrite Versions through III Plus

B.2.2 Spreadsheet Formats

Format Version
Enable Versions 3.0, 4.0, and 4.5
First Choice Versions through 3.0
Framework Version 3.0
Lotus 1-2-3 (DOS & Windows) Versions through 5.0
Lotus 1-2-3 (OS/2) Versions through 2.0
Lotus 1-2-3 Charts (DOS & Windows) Versions through 5.0
Lotus 1-2-3 for SmartSuite Versions 97 - Millennium 9.6
Lotus Symphony Versions 1.0, 1.1, and 2.0
Lotus Symphony (Documents, Presentations, Spreadsheets) Version 1.2
Mac Works Version 2.0
Microsoft Excel Charts Versions 2.x - 7.0
Microsoft Excel (Mac) Versions 3.0 - 4.0, 98, 2001, 2002, 2004, and v.X
Microsoft Excel (Windows) Versions 2.2 through 2007
Microsoft Multiplan Version 4.0
Microsoft Works (Windows) Versions through 4.0
Microsoft Works (DOS) Versions through 2.0
Microsoft Works (Mac) Versions through 2.0
Mosaic Twin Version 2.5
Novell Perfect Works Version 2.0
PFS:Professional Plan Version 1.0
Quattro Pro (DOS) Versions through 5.0 (text only)
Quattro Pro (Windows) Version through 12.0 (text only)
SmartWare II Version 1.02
StarOffice/OpenOffice Calc (Windows and UNIX) StarOffice versions 5.2 (text only) through 8.x and OpenOffice version 1.1 and 2.0
SuperCalc 5 Version 4.0
VP Planner 3D Version 1.0

B.2.3 Presentation Formats

Format Version
Corel/Novell Presentations Versions through 12.0
Harvard Graphics (DOS) Versions 2.x and 3.x
Harvard Graphics (Windows) Windows versions
Freelance (Windows) Versions through Millennium 9.6
Freelance (OS/2) Versions through 2.0
Microsoft PowerPoint (Windows) Versions 3.0 through 2007

Versions 97 - 2003 (support for read-only files)

Microsoft PowerPoint (Mac) Versions 4.0 through v.x

Versions 97 - 2003 (support for read-only files)

StarOffice/OpenOffice Impress (Windows and UNIX) StarOffice versions 5.2 (text only) and 6.x through 8.x (full support) and OpenOffice version 1.1 and 2.0 (text only)

B.2.4 Database Formats

Format Version
Access Versions through 2.0
dBASE Versions through 5.0
DataEase Version 4.x
dBXL Version 1.3
Enable Versions 3.0, 4.0, and 4.5
First Choice Versions through 3.0
FoxBase Version 2.1
Framework Version 3.0
Microsoft Works (Windows) Versions through 4.0
Microsoft Works (DOS) Versions through 2.0
Microsoft Works (Mac) Versions through 2.0
Paradox (DOS) Versions through 4.0
Paradox (Windows) Versions through 1.0
Personal R:BASE Version 1.0
R:BASE 5000 Versions through 3.1
R:BASE System V Version 1.0
Reflex Version 2.0
Q & A Versions through 2.0
SmartWare II Version 1.02

B.2.5 Archive File Format

When filtering an archive file, all the contents of the files inside the archive will be exported to a single output file. This will also include the contents of all subfolders and files inside the archive file.

Table B-2 lists the archive formats that Oracle supports.

Table B-2 Supported Archive File Formats

Format Version

GZIP

 

Microsoft Binder

Versions 7.0 - 97 (conversion of files contained in the Binder File is supported only on Windows)

UUEncode

 

UNIX Compress

 

UNIX Tar

 

ZIP

PKWARE versions through 2.04g

LZA Self-Extracting Compress

 

LZH Compress

 

B.2.6 Email Formats

Format Version
Microsoft Outlook Folder (PST) Microsoft Outlook Folder and Microsoft Outlook Offline Folder files versions 97, 98, 2000, 2002, 2003, and 2007
Microsoft Outlook Message (MSG) Microsoft Outlook Message and Microsoft Outlook Form Template versions 97, 98, 2000, 2002, 2003, and 2007
MIME MIME-encoded mail messages.

B.2.6.1 MIME Support Notes

The following formats are supported:

  • MIME formats

    • EML

    • MHT (Web Archive)

    • NWS (Newsgroup single-part and multi-part)

    • Simple Text Mail (defined in RFC 2822)

  • TNEF format

  • MIME encodings, including

    • base64 (defined in RFC 1521)

    • binary (defined in RFC 1521)

    • binhex (defined in RFC 1741)

    • btoa

    • quoted-printable (defined in RFC 1521)

    • utf-7 (defined in RFC 2152)

    • uue

    • xxe

    • yenc

In addition, the body of a message can be encoded in several ways. The following encodings are supported:

  • HTML

  • RTF

  • TNEF

  • Text/enriched (defined in RFC 1523)

  • Text/richtext (defined in RFC1341)

  • Embedded mail message (defined in RFC 822) - this is handled as a link to a new message

The attachments of a MIME message can be stored in many formats. Oracle Corporation processes all attachment types that its technology supports.

B.2.7 Other Formats

Format Version
ASF (subformats: WMA/SMV/DVR-MS) Metadata extraction only
Executable (EXE, DLL) -
HTML Versions through 3.0, with some limitations
IBM Lotus Notes DXL All versions
IBM Lotus Notes NSF All versions (File ID support only)
ISO Base Media (subformats: Quicktime/MPEG-4/MPEG-7) Media extraction only
MacroMedia Flash Macromedia Flash 6.x, MacroMedia Flash 7.x, and MacroMedia Flash Lite (text only)
Microsoft Office 2007 (support for SmartArt created using SP2 MSO)
Microsoft Office 2008 for Mac (Word, PowerPoint, Excel)
Microsoft Project Versions 98 - 2003 (text only)
Microsoft Project Version 2007 (File ID support only)
Microsoft Publisher 2003/2007 (File ID support only)
Microsoft XPS Metadata extraction only
MP3 ID3 information
RIFF (subformats: WAV/AVI) Metadata extraction only
RPIX File ID support only
vCard, vCalendar Version 2.1
Windows Executable -
WML Version 5.2
XML Text only
Yahoo Instant -

B.2.8 Graphic Formats

Table B-3 lists the graphic formats that the AUTO_FILTER filter recognizes. This means that indexing a text column that contains any of these formats produces no error. As such, it is safe for the column to contain any of these formats.

Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.

Note:

The AUTO_FILTER filter cannot extract textual information from graphics.

Table B-3 Supported Graphics Formats for AUTO_FILTER Filter

Graphics Format Version

Adobe Photoshop (PSD)

Version 4.0

Windows Icon Cursor (ICO)

no specific version

Adobe Photoshop (PSD)

Version 4.0

Adobe Illustrator

Versions 7.0 and 9.0

Adobe FrameMaker graphics (FMV)

Vector/raster through 5.0

Adobe Acrobat (PDF)

Versions 1.0, 2.1, 3.0, 4.0, 5.0, 6.0, and 7.0 (including Japanese PDF)

Versions 1.6 and 1.7 (PDF packages and portfolios are not supported)

Ami Draw (SDW)

Ami Draw

AutoCAD Interchange and Native Drawing formats (DXF and DWG)

AutoCAD Drawing Versions 2.5 - 2.6, 9.0-14.0, 2000i and 2002

AutoShade Rendering (RND)

Version 2.0

Binary Group 3 Fax

All versions

Bitmap (BMP, RLE, ICO, CUR, OS/2 DIB, and WARP)

All versions

CALS Raster (GP4)

Type I and Type II

Corel Clipart format (CMX)

Versions 5 through 6

Corel Draw (CDR)

Versions 3.x - 8.x

Corel Draw (CDR with TIFF header)

Versions 2.x - 9.x

Computer Graphics Metafile (CGM)

ANSI, CALS NIST version 3.0

Encapsulated PostScript (EPS)

TIFF header only

GEM Paint (IMG)

All versions

Graphics Environment Mgr (GEM)

Bitmap and vector

Graphics Interchange Format (GIF)

All versions

Hewlett Packard Graphics Language (HPGL)

Version 2.0

IBM Graphics Data Format (GDF)

Version 1.0

IBM Picture Interchange Format (PIF)

Version 1.0

Initial Graphics Exchange Spec (IGES)

Version 5.1

JBIG2

JBIG2 graphic embeddings in PDF files

JFIF (JPEG not in TIFF format)

All versions

JPEG (including EXIF)

All versions

Kodak Flash Pix (FPX)

All versions

Kodak Photo CD (PCD)

Version 1.0

Lotus PIC

All versions

Lotus Snapshot

All versions

Macintosh PIC1 and PICT2

Bitmap only

MacPaint (PNTG)

All versions

Micrografx Draw (DRW)

Versions through 4.0

Micrografx Designer (DRW)

Versions through 3.1

Micrografx Designer (DFS)

Windows 95, version 6.0

Novell PerfectWorks (Draw)

Version 2.0

OS/2 PM Metafile (MET)

Version 3.0

Paint Shop Pro 6 (PSP)

Windows only, versions 5.0 - 6.0

PC Paintbrush (PCX and DCX)

All versions

Portable Bitmap (PBM)

All versions

Portable Graymap (PGM)

No specific version

Portable Network Graphics (PNG)

Version 1.0

Portable Pixmap (PPM)

No specific version

Postscript (PS)

Levels 1-2

Progressive JPEG

No specific version

Sun Raster (SRS)

No specific version

StarOffice/OpenOffice Draw for Windows and UNIX

StarOffice versions 5.2 (text only) through 8.x and OpenOffice version 1.1 and 2.0

TIFF

Versions through 6

TIFF CCITT Group 3 and 4

Versions through 6

Truevision TGA (TARGA)

Version 2

Visio (preview)

Version 4

Visio

Versions 5, 2000, 2002, and 2003

Visio 2007

File ID support only

WBMP

No specific version

Windows Enhanced Metafile (EMF)

No specific version

Windows Metafile (WMF)

No specific version

WordPerfect Graphics (WPG and WPG2)

Versions through 2.0

X-Windows Bitmap (XBM)

x10 compatible

X-Windows Dump (XWD)

x10 compatible

X-Windows Pixmap (XPM)

x10 compatible


B.2.8.1 Graphics Formats Limitations

The AUTO_FILTER filter supports AutoCAD on IBM AIX with the following limitations:

  • Oracle Database release 11.1.0.7 and release 11.2.0.1 supports AutoCAD files up to AutoCAD 2002.

  • Support for AutoCAD versions later than 2002 on IBM AIX is available with Oracle release 11.2.0.2.

B.2.9 Formats No Longer Supported in 11.1.0.7

Certain document formats are not supported if you upgrade from release 11.1.0.6 to 11.1.0.7. This is because Oracle Text filtering technology migrated to Oracle Outside In HTML Export technology for release 11.1.0.7. To filter these unsupported formats, you can plug in a third party filtering technology using USER_FILTER. See "USER_FILTER" for more information.

Table B-4 lists the formats supported in release 11.1.0.6, but not in 11.1.0.7.

Table B-4 Formats Supported in Release 11.1.0.6 and not in 11.1.0.7

Format Versions

Word Processing Formats

 

Applix Words (AW)

3.11, 4.0, 4.1, 4.2, 4.3, 4.4

JustSystems Ichitaro (JTD)

2005

Folio Flat File (FFF)

3.1

Fujitsu Oasys (OA2)

7

Lotus Word Pro (LWP)

9.7, 9.8

WordPerfect for Linux

All versions

   

Desktop Publishing Formats

 

Adobe Framemaker (MIF)

7

   

Spreadsheet Formats

 

Applix Spreadsheets (AS)

4.2, 4.3, 4.4

Lotus 1-2-3 (123)

Millennium Edition R9, 9.8

Microsoft Works Spreadsheet (DOS)

3.4

Microsoft Works Spreadsheet (Mac)

3.4

Comma-Separated Values (SCV)

N/A

   

Presentation Formats

 

Applix Presents (AG)

4.0, 4.2, 4.3, 4.4

Lotus Freelance Graphics (PRE)

Millennium Edition R9, 9.8

Microsoft Visio XML Format

2003

   

Graphic Formats

 

SGI RGB Image (RGB)

No specific version

Windows Animated Cursor (ANI)

No specific version

WordPerfect Graphics 2 (WPG2)

7

Microsoft Office Drawing (MSO)

No specific version

Windows Icon Cursor (ICO)

No specific version