Web scraping with Java

Very important

To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
You will only see chat messages from people who are at or below your level.
More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)

Pages (2): 1 2 Next »

Thread Rating:

339 Vote(s) - 3.58 Average
1
2
3
4
5

Options

Web scraping with Java

loonoirl

Contributor

Contributor

Posts: 0
Threads: 0
Joined: Jul 2017
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#1

07-20-2023, 08:02 AM

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some `pageID` and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

Reply

Proglare798

Member

Member

Posts: 0
Threads: 0
Joined: Apr 2021
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#2

07-20-2023, 08:07 AM

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

Reply

otero847

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Jun 2016
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#3

07-20-2023, 08:19 AM

Your best bet is to use Selenium Web Driver since it

1. Provides visual feedback to the coder (see your scraping in action, see where it stops)
2. Accurate and Consistent as it directly controls the browser you use.
3. Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

Htmlunit is fast but is horrible at handling Javascript and AJAX.

Reply

glochidiumyidfyzgxyn

Member

Member

Posts: 0
Threads: 0
Joined: Mar 2022
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#4

07-20-2023, 08:44 AM

**HTMLUnit** can be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping.
[read here for more][1]

[1]:

[To see links please register here]

Reply

Procavalcade200

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Aug 2016
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#5

07-20-2023, 08:45 AM

mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

[To see links please register here]

(and the GitHub here

[To see links please register here]

)

Reply

leighufejf

Member

Member

Posts: 0
Threads: 0
Joined: Jul 2016
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#6

07-20-2023, 09:12 AM

There is also Jaunt Java Web Scraping & JSON Querying -

[To see links please register here]

Reply

obscurest439899

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Oct 2018
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#7

07-20-2023, 09:21 AM

If you wish to automate scraping of large amount pages or data, then you could try [Gotz ETL][1].

It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.

[1]:

[To see links please register here]

Reply

below417

Member

Member

Posts: 0
Threads: 0
Joined: Oct 2022
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#8

07-20-2023, 09:30 AM

# jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "_Java HTML parsers_". One of them is [Jsoup](

[To see links please register here]

).

You can navigate the page using DOM if you know the page structure, see

[To see links please register here]

It's a good library and I've used it in my last projects.

Reply

giuliacwtbbjck

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Jun 2019
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#9

07-20-2023, 09:37 AM

For tasks of this type I usually use Crawller4j + Jsoup.

With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.

With jsoup, I "parsed" the html data you have searched for and downloaded with crawler4j.

Normally you can also download data with jsoup, but Crawler4J makes it easier to find links.
Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads

[To see links please register here]

Reply

dawnaawgzxrnw

Luxury

Luxury

Posts: 0
Threads: 0
Joined: Sep 2016
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#10

07-20-2023, 09:55 AM

Normally I use selenium, which is software for testing automation.
You can control a browser through a webdriver, so you will not have problems with javascripts and it is usually not very detected if you use the full version. Headless browsers can be more identified.

Reply

« Next Oldest

Next Newest »

Pages (2): 1 2 Next »

Forum Jump:

Users browsing this thread:

1 Guest(s)

©0Day 2016 - 2023 | All Rights Reserved. Made with for the community. Connected through