Text Filters
This collection of filters aid in the retrieval and formatting of internet-based news leads, and helps compile the data into an input file to be read by KEDS and TABARI. The processes involved in this task include downloading the lead sentences from a web-based source, ordering the information chronologically, and formatting the specific sourcecodes and identifiers for interpretation by KEDS or TABARI. The tasks performed by each individual filter are expounded below.
The filters are listed in reverse chronological order, with the programs that we have used in our most recent research listed first. Due to changes in data service (NEXIS, Factiva) formats over time, programs that are more than a couple of years old will probably not work without modification, but we are leaving the older code available since it might provide templates for writing other filters. That said, Perl is so vastly superior to C, C++, and Pascal for text processing that the Perl-based filters are generally the only versions worth bothering with unless you are dealing with archived downloads.
Advisory: The Perl programs should work on Macintosh, Unix, Linux, and Windows operating systems. However, make sure that you have converted the source code and any input files to the appropriate operating system file format before running them: if the program appears to behaving erratically, it is quite likely due to a file incompatibility (e.g. a Windows program trying to read a Unix file). Click here for further discussion on the merits of Perl.
NewNexisFormat.pl (Perl)
This Perl program reformats stories downloaded from the LexisNexis Academic Universe system into the TABARI format. It replaces the older "nexispider.pl" that did the downloading automatically; this no longer works due to changes in the NEXIS web site. The program is currently set to process only Agence France Presse records but should be easy to modify for other sources. Because LexisNexis downloads are sent in Windows file format, the program automatically converts these to Unix format.
Last update: 22 March 2008
Download NewNexisFormat.pl source code -- this will open as a page of text; "Save" this in your browser.
"Read.Me" file that explains how to do the LexisNexis formating.
Zipped file containing NewNexisFormat.pl, nexisreverse.pl and NewNexisFormat.readme.txt.
Factiva.Reutlead.filter.pl (Perl)
This Perl program processes a set of Reuters stories downloaded from Factiva using the email option, and formats the lead sentences of those stories into the KEDS/TABARI format. The input to the program is output files for the formatted leads and a date file, then a list of the files containing the stories.
Last update: 16 July 2008
Download Factiva.Reutlead.filter.1b1.pl source code -- this will open as a page of text; "Save" this in your browser.
OSC.filter.pl (Perl)
This Perl program is used for combining files from the U.S. government's Open Source Center (http://opensource.gov) data system after these have been downloaded using the FireFox extension DownThemAll ( http://www.downthemall.net). The news reports are combined into a single HTML file while removing minus most of the extraneous HTML code.
Last update: 8 April 2008
Download OSC.filter.1b03.pl source code ~~ this will open as a page of text; "Save" this in your browser.
Download instructions (.pdf) for using DownThemAll and the filter.
FactivaMail.pl (Perl)
This Perl program processes a set of stories downloaded from Factiva using the email option, and formats those into the KEDS/TABARI format. The input to the program is a list of the files containing the stories. The program is currently set to output only the lead sentences from Agence France Presse records.
Last update: 5 July 2004
Download FactivaMail.pl source code -- this will open as a page of text; "Save" this in your browser.
nexisreverse.pl (Perl)
This Perl program reverses the order of stories that were downloaded from NEXIS using the nxdnldformat.pl or nexispider.pl programs (more generally, it will reverse the order of any "KEDS-formatted" files). The program solves the problem of NEXIS downloading stories in reverse chronological order, while event data coding usually needs records in chronological order. The program also combines multiple downloads into a single file, and eliminates stories that have identical first lines. The current version gets only lead sentences, but this is easily changed.
Last update: 26 January 2003
Download nexisreverse.pl source code -- this will open as a page of text; "Save" this in your browser.
ActorFilter
This program locates potential new actor names in a file of KEDS input records by looking for strings of consecutive capitalized words and comparing these against an existing sets of actor names and a list of stop words. It produces a keyword-in-context index of the new actors sorted by frequency. Documentation in .pdf and MS-Word format is included. The beta version of the program was available only for the Macintosh. The java version was created in March of 2001, both are available here.
Beta version uploaded: 12 October 1997
ActorFilter program and manual (.sit)
Java version 2.03 updated: 13 June 2001
ActorFilter program (java version)
nexispider.pl (Perl)
This Perl program followed a set of linked news stories generated by the NEXIS Academic Universe system, then formats those stories into the KEDS format. The input to the program is the initial URL for a linked set of stories. The code contains extensive internal documentation and should be easy to modify for other sources. The program ceased to work when LexisNexis changed the format of its web site in summer 2007 but might be useful as the basis of some other URL-following, HTML-reformatting program.
Last update: 15 August 2003
Download nexispider.pl source code -- this will open as a page of text; "Save" this in your browser.
NEXIS_Filter (C)
This is the HLEAD_Filter program translated into C. It is considerably faster than the Pascal version, as least on the Macintosh. The program will do both lead and full-story filtering; it uses a file called "filter.abbrev" to distinguish periods at the end of abbreviations from those at the end of sentences. It also incorporates a secondary filter that will skip stories where the "HEADLINE:" segment contains certain strings.
Documentation in MS-Word format is included. The .zip version of the files is for the benefit of those using Windows or UNIX: the program code and documentation are in ASCII text files (the .zip code is not quite as debugged as the .sit version, so if you intend to do a lot of filtering, get the .sit file)
Last update: 14 February 1998
NEXIS_Filter
program, source code and files (.sit)
NEXIS_Filter source code and files
(.zip)
RBBFilter (C)
Modification of Nexis_Filter that works with the Reuters Business Briefing download format. It has all of the facilities of Nexis_Filter, plus somewhat better handling of material in quotations and automatic elimination of very short sentences.
Last update: 25 January 1998
RBBFilter program, source code and files (.sit)
FactivaFilter (C)
Modification of RBBFilter that works with leads saved from the Web version of the Dow Jones Interactive/Factiva service. Different browsers seem to vary a bit in how they save the material, so it may require a bit of modification to work at your site.
Last update: 31 January 2000
FactivaFilter program, source code and files (.sit)
FactivaFilter program, source code and files (.zip)
NEXIS_Verify (C)
This program goes through a list of KEDS records and checks the dates for missing intervals, bad date formats and the like. Date strings are tagged on any of the following conditions:
- Date is not in the range 790415 to 981231
- Consecutive dates are separated by more than 4 days
- A date occurs earlier than the previous date.
These conditions can be modified using an assortment of parameters in the program.
Based on long experience with the vagaries of Reuters and NEXIS records, we strongly recommend running this routine before coding, particularly if you subsequently intend to aggregate data using KEDS_Count.
Both the compiled program and the source code are included.
Last update: 25 January 1998
NEXIS_Verify
program and source code (.sit)
HLEAD Filter (Pascal)
This folder contains the Pascal source code for two
programs for converting Reuters text downloaded from the NEXIS data
service into the KEDS format. Edit_HLEAD will reformat leads downloaded
using the NEXIS HLEAD segment and .tty interface into the KEDS input
format (as well as eliminating duplicate stories). To run the program,
simply respond to the file selection dialog requests when asked for
the names of the input and output files. This program is quite sensitive
to the exact NEXIS format, so the commented Pascal source code and Macintosh
resource file have also been included (Edit_HLEAD.text and Edit_HLEAD.RSRC);
it is also relatively straightforward to modify this program to work
in a DOS/Windows environment.
Reverse_Reuters is a program that reverses the chronological order of
KEDS-formatted records; it is used to change the recent-to-earliest
ordering used by NEXIS to the earliest-to-recent ordering used in time
series analysis. (Using the SORT; DATE command in NEXIS accomplishes
the same thing...)
The programs in the "Text Filter" folder are two more general filters,
along with their documentation. These programs formed the basis of the
later C-language text filter.
Last update: circa 1996
