-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.txt
executable file
·234 lines (170 loc) · 10.2 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
Wikibench
=========
Wikibench contains three software modules
1. wikiloader: parses archived Wikipedia XML dumps and inserts data
into the database
2. tracebench: pre-processes a Wikipedia trace so that it can be used
by the load injector
3. wikijector: the load injector
Wikiloader can be used independently from tracebench and
wikijector. However tracebench and wikijector are to be used according
to a workflow described below.
Up-to-date software, documentation, datasets and publications about
WikiBench can always be found at http://www.wikibench.eu/
/*******************************************************************/
wikiloader
Prerequisites
1. Expat library for XML parsing. It can be downloaded from:
http://sourceforge.net/projects/expat/ Some Linux distributions
have this package in their software repository.
Characteristics
The characteristics of the program (also refered to as the dumper, in
this file) are optional and can be activated using the parameters
described in the following sections.
1. Deletion of records from the tables in which data will be inserted,
with the purpose of avoiding primary key conflicts.
2. Temporarily disabling indexing for the tables in which data will be
inserted, in order to speed up the process. Indexing will be
enabled after the insertion process is over.
3. Logging the number of pages processed into a file in order to
overcome failures. The user can choose an interval between which
the number of pages processed will be written in the log file.
4. Restarting a failed insertion process. If the logging was enabled
and the user has the last number of processed pages, it can be used
to restart the process from the failure point.
Run and compile
The package contains a script that compiles and starts running the
dumper: "wikiloader.sh". The script will show how much time the dumper
worked, measured in seconds.
A typic usage of the dumper is the following:
./wikiloader.sh [parameters]
Execution parameters
The dumper supports the following parameters:
./wikiloader.sh -h ; displays a list of the possible
; parameters of the script
./wikiloader.sh -d ; for deleting information from the
; tables: page, revision, text
./wikiloader.sh -i ; for turning indexing off before
; inserting data into the tables,
; and turning it on when finishing
; the whole process of dumping the data
./wikiloader.sh -l filepath page_interval ; for turning on logging, the filepath
; of the log file, and the page_interval
; showing between how many page counts
; the page_id is written into the log file
./wikiloader.sh -r page_id ; restart page insertion after fail from
; the page_id
./wikiloader.sh ; default
The dumper can be run using any combination of the above parameters
with the condition that if an option requires more than one parameters
(such as "-l"), the parameters should be given in the order shown
above.
For example: "./wikiloader.sh -d -l filepath page_interval" will work
correctly and delete the data in the tables, while during the insert
statements the page count will be logged into the file.
However, "./wikiloader.sh -d -l page_interval filepath" will result
into an error. The default run doesn't delete the data in the tables,
doesn't remove indexes from tables, doesn't log page ids to file and
cannot be used to restart the dumping from a certain page id.
Upon the execution a console dialog will be launched and you will be
asked to provide the following additional parameters: database name,
username and path to the dump file. After the sources are compiled and
the script begins the execution, MySQL will prompt you for the
password associated with the user.
/*******************************************************************/
Workflow for tracebench and wikijector
Prerequisites
1. Sun Java - 1.5, 1.6
2. Java - MySQL connector that can be found in package
libmysql-java. It is advised that the path to mysql.jar should be
added to the CLASSPATH.
3. HttpComponents-Client and HttpComponents-Core (Can be downloaded
from http://hc.apache.org/downloads.cgi)
4. Ant
Step 1: Preparing the traces
Using the script sort_trace.sh(which can be found in the "scripts"
folder) you can prepair your traces in order to be used with
WikiBench. Traces are sorted chronologically and the line numbers at
the beginning of each line are deleted using the sort_trace.sh
script. The script can sort data from multiple .gz archives and also
select traces from a certain day, required by the user.
Script usage:
./sort_trace.sh <path_to_traces_folder> <path_to_result_file> <date>"
./sort_trace.sh <path_to_traces_folder> <date>
./sort_trace.sh <path_to_traces_folder>
The user can specify a result file, where the sorted traces will be
saved. The name of the file has to have a ".txt" extension, even
though the result of the script will be a .bz2 archive. The script
processes all the .gz archives in the traces folder. If a specific
date is given, in the result file will appear only the traces from
that date.
Step 2: tracebench
After the traces are sorted they can be piped into tracebench. In
order to run tracebench it is necessary to first build the TraceBench.jar.
ant build -Dconnector=<path to java-mysql connector>
java -jar build/lib/TraceBench.jar <reduction in permil> '<db uri>' <plsampling|sampling> <date_ts|epoch_ts>
Parameters:
<reduction in permil> - reduction percentage, an integer between 0 and
100. If it is 0, this tool will be quicker as it will only remove
unwanted trace lines without further sampling.
<db uri> - standard MySQL URI for the MediaWiki database. For more
information please check:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html.
ex: jdbc:mysql://localhost/wikidb?user=root&password=pass
<plsampling|sampling> - sampling method
plsampling - page level sampling. It depends completely on the
title. This way, we can reduce the amount of pages that are in the
traces by removing selected page names from the trace
completely. There are many ways of requesting a page and related
services concerning a page name. Some very rare things might be missed
by this parser. For the benchmark, this does not influence the results
considering the extremely small percentages of these rare requests.
sampling - We consider some pages and files as static, even though
they are not. Obvious static files are the skin-1.5/ files, /images/
files and the css files. The x most popular wiki pages are considered
and sampled like static files, because we would not want to risk that
such files are removed by page-level sampling (which would reduce the
size considerably!). Such pages include the Main_Page, and the css and
javascript 'pages'
<date_ts|epoch_ts> - time stamp of the traces. In later traces, date_ts is used.
tracebench uses standard input and output. Its output can be further
archived or piped directly into WikiBench.
Step 3: wikijector
wikijector can run in two modes: controller and worker. The controller
must be started first and the number of workers specified as an input
parameter. After that the declared numbered of workers can be started
one by one.
Basic usage:
Run controller:
./wikijector.sh controller <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname> <number of workers>
Run controller in verbose mode:
./wikijector.sh vcontroller <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname> <number of workers>
Run worker:
/wikijector.sh worker <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname or IP address> <number_of_threads> <string_with_SUT_hostname(s)> <string_with_port_of_SUT> <timeout_in_ms> <path to logfile>
When running the wikijector, an URL of the wikipage is used to launch a
HTTP request. This URL is based on the traces sorted and sampled in the
previous steps. The URL is composed using a relative path that in the
sorted traces has the following form /wiki/Main_Page. "wiki" is the name
of the Wikipedia installation found in the foloder "/var/www/". The
wikijector software changes this URL to "/mediawiki/index.php/Main_Page"
in order to be used by the software. If you are using your own mediawiki
installation be sure to modify the source file
wikibench/threads/FetchThread.java and change "/mediawiki" with the name
of your installation of mediawiki. This is only valuable for content
pages. For css and script files you also need to replace "/w/" with the
name of your mediawiki installation.
Full tracebench - wikijector workflow:
a. Run controller:
If you are running this software from the WikibenchServer virtual
machine, there is no need for building or running tracebench as the
machine contains a large variety of sampled traces from the interval 23
31 January 2010. This sampled traces are in the folder
/home/wikiuser/samples.
gunzip -c plasampling.xxx.gz | ./wikijector.sh controller <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname> <number of workers>
If you are running this software from your own machine we reccomend to
first build tracebench:
ant build -Dconnector=<path to java-mysql connector>
And after that pipe its output into the wikijector controller.
bunzip2 -c <path_to_trace_file> | java -jar build/lib/TraceBench.jar <reduction in permil> '<db uri>' <plsampling|sampling> <date_ts|epoch_ts> | ./wikijector.sh controller <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname> <number of workers>
b. Run workers:
/wikijector.sh worker <path to httpcomponents-core/lib> <path to httpcomponents-client/lib> <controller hostname or IP address> <number_of_threads> <string_with_SUT_hostname(s)> <string_with_port_of_SUT> <timeout_in_ms> <path to logfile>